Options

# Normalization between two different data sets

hi all,

Ive understood the 'Normalize" operator is to normalize " within " the attributes of a particular data set.

However, I have a case :

Ive trained and tested the classification model with a particular data set (A)

while deploying with new fresh data set (B) - the attributes are given not in same scale as the above data set (A) .

eg: attribute 'X' in data set 'A' is in scale : 0 to 100

attribute 'X' in data set 'B' is in scale: 0 to 350

My qn. is:

Does rapid miner have any operator to normalize 'Between' the two different data sets? or do we have to do manually before feed in.

Kindly let me know. thanks.

thiru

Ive understood the 'Normalize" operator is to normalize " within " the attributes of a particular data set.

However, I have a case :

Ive trained and tested the classification model with a particular data set (A)

while deploying with new fresh data set (B) - the attributes are given not in same scale as the above data set (A) .

eg: attribute 'X' in data set 'A' is in scale : 0 to 100

attribute 'X' in data set 'B' is in scale: 0 to 350

My qn. is:

Does rapid miner have any operator to normalize 'Between' the two different data sets? or do we have to do manually before feed in.

Kindly let me know. thanks.

thiru

1

## Answers

568Unicorn955Unicornthis is a conceptional question.

What does it mean for a model that attribute X has a value of 30 (normalized e. g. -0.2)? Should a value of 30 in example set B handled by the model in the same way?

RapidMiner lets you store the "preprocessing model" from the Normalization and apply it (Retrieve = > Apply Model) on the new data. That would make sure that the predictive model sees the same normalized input from identical numbers. (In your case, the normalization model from A will assign a high value to B if X = 350, but that's the correct approach.)

It's even more elegant to build one stacked model from the normalization and then the predictive model, using Group Models. (The example process in the help illustrates the concept.) You would do this inside a cross validation on the left side, and then just apply the grouped model on the right side. This is the conceptionally correct approach.

The process from @rfuentealba is correct in the generic case. However, what does it mean for your data and your model to normalize in a different way? How would you normalize *one* example later, if you're applying the model to single examples? If you don't have very good reasons to normalize in a different way, you should keep the normalization-for-the-model parameters using one of the described methods.

Regards,

Balázs