normalising data before data split or after?

Fred12 · August 2016

I have a question, I wanted to separate my data into train and test set, should I now apply normalisation before or after the split?

someone told me it would make more sense to do normalisation after the split for each train /test data... but why? if I do so, I would normalise on the specific ranges of values regarding the train / test dataset... but if I use split before, I will normalise on the whole range, isn't that more general, and therefore more representative regarding my dataset? or does it make no difference at all?

IngoRM · August 2016

Hi,

It makes a HUGE difference and is one of the most common errors in data science. Part of the reason is that most software tools are not allowing you to do this in the right way. Luckily enough, RapidMiner is not "most software tools" and allows you to do this right.

Here is the answer: You should NEVER do anything which leaks information about your testing data BEFORE a split. If you normalize before the split, then you will use the testing data to calculate the range or distribution of this data which leaks this information also into the testing data. And that "contaminates" your data and will lead to over-optimistic performance estimations on your testing data. This is by the way not just true for normalization but for all data preprocessing steps which change data based on all data points including also feature selection. Just to be clear: This contamination does not have to lead to over-optimistic performance estimations but often it will.

What you SHOULD do instead is to create the normalization only on the training data and use the preprocessing model coming out of the normalization operator. This preprocessing model can then be applied like any other model on the testing data as well and will change the testing data based on the training data (which is ok) but not the other way around.

The process below will show you how this works in general. You can also open this directly in RapidMiner Studio here.

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="7.2.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="136">
        <parameter key="target_function" value="sum classification"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="7.2.000" expanded="true" height="103" name="Split Data" width="90" x="179" y="136">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.7"/>
          <parameter key="ratio" value="0.3"/>
        </enumeration>
      </operator>
      <operator activated="true" class="normalize" compatibility="7.2.000" expanded="true" height="103" name="Normalize" width="90" x="313" y="34"/>
      <operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model" width="90" x="447" y="136">
        <list key="application_parameters"/>
        <description align="center" color="transparent" colored="false" width="126">Apply Preprocessing Model to Testing Data</description>
      </operator>
      <operator activated="true" class="k_nn" compatibility="7.2.000" expanded="true" height="82" name="k-NN" width="90" x="447" y="34"/>
      <operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="34">
        <list key="application_parameters"/>
        <description align="center" color="transparent" colored="false" width="126">Apply Prediction Model to Testing Data</description>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="7.2.000" expanded="true" height="82" name="Performance" width="90" x="715" y="34">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Normalize" to_port="example set input"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Normalize" from_port="example set output" to_op="k-NN" to_port="training set"/>
      <connect from_op="Normalize" from_port="preprocessing model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="k-NN" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Hope this helps,

Ingo

IngoRM · August 2016

The type of the transformation is the same as the setting of the operator which produced the model. So if the original setting was z-transformation then a z-transformation is also applied on the test data. And yes, the distribution information is taken from the training data (which was originally normalized with the operator) and then the same distribution are stored in the model and are then applied to the test data.

Cheers,

Ingo

IngoRM · August 2016

Leaking information into the training set and in general validating models in the wrong way are by far the most common issues. Not doing any feature engineering is probably the next one. Of course somebody with more experience might be able to get to better models in general and most likely also faster but I personally always prefer weaker models which are well validated over highly optimized models where I cannot trust the validation or which are no longer robust. So do not be too focussed on optimizing for accuracy alone is in general a good hint.

We are actually adding more educational material while we are going. So stay tuned. But some of this is also covered in "5 Minutes with Ingo" for example:

https://www.youtube.com/playlist?list=PLssWC2d9JhOZZ6PCzJt2L2zUwA3RozrP_

And of course in our trainings:

https://rapidminer.com/learning/training/

Cheers,

Ingo

Fred12 · August 2016

thanks, that was a nice answer, but can you explain more what the preprocessing model exactly is? is it the z- or range-transformation applied on the testing-data only, and based on those data, the same z- or range-transformation ratio is applied, only on the test data?

Fred12 · August 2016

real nice, thanks

are there any more "tripping stones" like that which a beginner should be aware of?

I mean, I would probably never have come to that kind of solution, if I wasn't aware that such kind of problem exists...

maybe is there some kind of documentation what a beginner should consider if designing data mining processes?

cyborghijacker · August 2017

Hi Ingo

Regarding your statement: 'This preprocessing model can then be applied like any other model on the testing data as well and will change the testing data based on the training data (which is ok) but not the other way around.'

Why is changing the test data based on training data OK? but not the other way around?

Regards

Ben

MartinLiebig · August 2017

Dear Ben,

i think there are two steps in this line of logic:

1) Normalization is part of the model. Let's say you run a linear regression model on just one attribute. The result is an equation like

y = a*x+b

if you normalize before hand you still get an equation, but it has another form:

y = a'*norm(x)+b

you could now rightfully argue that norm( ) is part of the model. One important point is, that norm() is calculated using the full training set and not a constant function like log or sqrt.

So once this is clear, we talk about validation:

2) You want to validate to measure the predictive performance. The predictive performance is the performance on an unknown (= not used for training) data set. You build the full model on the training and apply it to the test. Since normalizing is part of your modeling, you need to do it on the test set.

This statement is true for all transformations which need the information about the full data (or the full attribute). E.g. also things like a PCA.

Best,

Martin

cyborghijacker · August 2017

Thank you Martin, that was a concise and clear explanation.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

normalising data before data split or after?

Best Answers

Answers