Normalization between two different data sets

ThiruThiru Member Posts: 100 Guru
hi all,

Ive understood the 'Normalize" operator is to normalize  " within " the attributes of a particular data set.

However, I have a case : 

Ive  trained and tested the classification model with a particular data set (A)

while deploying with  new fresh data set (B)  - the attributes are given not in same scale as the above data set (A) .

eg:  attribute 'X' in data set 'A'  is in scale :  0 to 100
       attribute 'X' in data set 'B'  is in scale:  0 to  350

My qn. is: 

Does rapid miner have any operator to normalize 'Between' the two different data sets? or do we have to do manually before feed in.

 Kindly let me know. thanks. 



  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Yes, unfortunately. RapidMiner doesn't have a way to know that both datasets contain the same structure, therefore it doesn't know what kinds of preparations does it need. But today I have a trick for you, right under the sleeve. It's very basic but might help you.

    For example, for training, I have this simple process:

    Instead of retrieving data and making a decision tree inside the process, I make some small modifications:

    ...and I then create a "main" process from where I call the rest:

    Do you see that "Execute Prepare Data" operator being called twice? It is the result of dragging and dropping the process you want to execute. You can actually save a lot of time if you embed your code like this, as you can reuse your filters.

    Hope this helps,

  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi @Thiru,

    this is a conceptional question.
    What does it mean for a model that attribute X has a value of 30 (normalized e. g. -0.2)? Should a value of 30 in example set B handled by the model in the same way?

    RapidMiner lets you store the "preprocessing model" from the Normalization and apply it (Retrieve = > Apply Model) on the new data. That would make sure that the predictive model sees the same normalized input from identical numbers. (In your case, the normalization model from A will assign a high value to B if X = 350, but that's the correct approach.)

    It's even more elegant to build one stacked model from the normalization and then the predictive model, using Group Models. (The example process in the help illustrates the concept.) You would do this inside a cross validation on the left side, and then just apply the grouped model on the right side. This is the conceptionally correct approach.

    The process from @rfuentealba is correct in the generic case. However, what does it mean for your data and your model to normalize in a different way? How would you normalize *one* example later, if you're applying the model to single examples? If you don't have very good reasons to normalize in a different way, you should keep the normalization-for-the-model parameters using one of the described methods.

Sign In or Register to comment.