Normalization between two different data sets

Thiru · February 2020

hi all,

Ive understood the 'Normalize" operator is to normalize " within " the attributes of a particular data set.

However, I have a case :

Ive trained and tested the classification model with a particular data set (A)

while deploying with new fresh data set (B) - the attributes are given not in same scale as the above data set (A) .

eg: attribute 'X' in data set 'A' is in scale : 0 to 100
attribute 'X' in data set 'B' is in scale: 0 to 350

My qn. is:

Does rapid miner have any operator to normalize 'Between' the two different data sets? or do we have to do manually before feed in.

Kindly let me know. thanks.

thiru

rfuentealba · February 2020

Hello,

Yes, unfortunately. RapidMiner doesn't have a way to know that both datasets contain the same structure, therefore it doesn't know what kinds of preparations does it need. But today I have a trick for you, right under the sleeve. It's very basic but might help you.

For example, for training, I have this simple process:

Image: https://us.v-cdn.net/6030995/uploads/editor/lp/gbpu2koyzyfr.png

Instead of retrieving data and making a decision tree inside the process, I make some small modifications:

Image: https://us.v-cdn.net/6030995/uploads/editor/re/qedndn41wqlu.png

...and I then create a "main" process from where I call the rest:

Image: https://us.v-cdn.net/6030995/uploads/editor/20/0h4450uvn2z7.png

Do you see that "Execute Prepare Data" operator being called twice? It is the result of dragging and dropping the process you want to execute. You can actually save a lot of time if you embed your code like this, as you can reuse your filters.

Hope this helps,

Rodrigo.

BalazsBarany · February 2020

Hi @Thiru,

this is a conceptional question.
What does it mean for a model that attribute X has a value of 30 (normalized e. g. -0.2)? Should a value of 30 in example set B handled by the model in the same way?

RapidMiner lets you store the "preprocessing model" from the Normalization and apply it (Retrieve = > Apply Model) on the new data. That would make sure that the predictive model sees the same normalized input from identical numbers. (In your case, the normalization model from A will assign a high value to B if X = 350, but that's the correct approach.)

It's even more elegant to build one stacked model from the normalization and then the predictive model, using Group Models. (The example process in the help illustrates the concept.) You would do this inside a cross validation on the left side, and then just apply the grouped model on the right side. This is the conceptionally correct approach.

The process from @rfuentealba is correct in the generic case. However, what does it mean for your data and your model to normalize in a different way? How would you normalize *one* example later, if you're applying the model to single examples? If you don't have very good reasons to normalize in a different way, you should keep the normalization-for-the-model parameters using one of the described methods.

Regards,
Balázs

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Normalization between two different data sets

Answers