RapidMiner

How to use standardization /normalization correctly on test/Train data set?

Elite II

How to use standardization /normalization correctly on test/Train data set?

hi,

I read that norm./standardization should be applied to train set separately, then the preprocessing model of the normalization/std. should be applied to the test data set,

but what about the validation set if I am doing cross-validation? should I also do a separate inner X-Validation normalization, where I apply the ranges of norm. from testdata in the XVal-set onto the validation set from the X-Validation?

 

For now, my process looks like this:

Unbenannt.PNG

 

I use once normalization on the outside "big" process, but inside the grid optimizer, I have a X-Validation with an SVM inside, however, I Am not applying further normalization on there, now my Question is, would it be better if my process looked like this:

 

Unbenannt2.PNG

 

 

where I also apply normalization to the inner X-Validation validation data (or is it called the test-data?) and if so, what about the normalization of the outside big process, how should I use that normalization for my test-data on the outside, without already using it for the traindata set for X-Validation?

 

last question:

some people say (including my supervisor) that the test-data inside cross validation is called test-data, not validation data, and that  validation data is the separate data tested outside that is entirely independent from the other X-Validation datasets. Is it not the other way around?

7 REPLIES
Elite II

Re: How to use standardization /normalization correctly on test/Train data set?

can nobody answer that question? normalization from traindata to testdata only or also applied to validation data?

Re: How to use standardization /normalization correctly on test/Train data set?

I am not sure if I understand your question, but see the attached process, This is how I have typically used preprocessing models during x-validation.

 

the model now has two inner models first normalization followed by the decision tree.

Please note that order of input models on group models is important

Attachments

Community Manager

Re: How to use standardization /normalization correctly on test/Train data set?

[ Edited ]

Hi Fred,

 

Sometimes the terms are bit confusing. Validation is the process in which you train and test the data to determine meaningfulness of your data. Cross Validation is this type of method. In order to properly validate and interpret your data, you must generate a training and test set. 

 

I won't go into a discussion about cross validation - that's been covered here in the forum - but our X-Validation operator automatically creates a training and testing set iteratively.  Certain algos, such as k-nn, are suspectible to scaling problems, so you need to normalize the data. If you use 10 fold X-Val and a k-nn algo, you will generate 10 different k-nn models on 10 different test sets.  Therefore you need to normalize the 10 k-nn algos.

 

You're awfully close to achiving this in your above screen shots, but I offer a more elegant solution using the Group models operator.

 

Z-normalization.png

 

In my above screenshot you should see a Normalize operator connected to a k-nn algo and then Group Models operator.  The Training Data gets normlized and passed to K-nn. The Group Models operator is a handy operator that lets you apply models (pre-processing ones too) in specific order to the Testing Data set. In the above case, the pre-processing model for the normalization of the Training Data gets passed first to the Testing Set and hten the trained k-nn model is applied and tested for performance.

 

Why the heck do we do that?  The Normalization Operator comes with a default parameter set to Z-normalization. Z-normalization is the normalization technique use to generate a zero mean and 1 std dev on your training data. This pre-processed model is then passed to the testing data to make it into a zero mean 1 std dev test set so you can compare the means.  If you didn't use the z-normalization and just a regualar one, you risk the chance of have a different mean in your training and test set, thereby not honestly evaluating your data.

 

Hope this helps. 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Elite II

Re: How to use standardization /normalization correctly on test/Train data set?

this is even more confusing to me... so this group models is basically just passing the first model first to the testing set (and why not to the apply model operator??)

and the second model is then passed to the apply model model input...

why's it not the other way around? because I think the testing input on the other half window is coming second...?

 

but nonetheless, it should be just the same as my normalization process or not? Its just more explicit..

 

moreover, when I am doing this method, I am getting quite bad performances 70% as opposed to 84% previously..??

 

 

Moderator

Re: How to use standardization /normalization correctly on test/Train data set?

Fred,

 

i think you do not get the group model operator. The group model operator creates a list of models. So you have something like:

grouped_model = [Normalization, k-NN]

 

if you apply the grouped model you apply the models after another. So you apply first the normalization, than the k-nn. That's why you move both over to the other side and the same normalization equation is used on the testing side.

 

~martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Contributor

Re: How to use standardization /normalization correctly on test/Train data set?

So a preferred name for this operator should be "Sequence of models", right ?

Community Manager

Re: How to use standardization /normalization correctly on test/Train data set?

Kinda maybe. What the Group Models does is great a group of pre-processed models in the training side so you can apply them on the test side.

 

You could still do it the old hardcore way where you add the normalization operator on the training side before the learner and then pass the PRE port via the THR port to the Testing side. Then you'll need two Apply Models operators.

 

This all got messy like so: 

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="124" name="Validation" width="90" x="380" y="34">
        <parameter key="sampling_type" value="2"/>
        <process expanded="true">
          <operator activated="true" class="normalize" compatibility="7.4.000" expanded="true" height="94" name="Normalize" width="90" x="112" y="120"/>
          <operator activated="true" class="k_nn" compatibility="7.4.000" expanded="true" height="76" name="k-NN" width="90" x="246" y="30">
            <parameter key="k" value="11"/>
          </operator>
          <connect from_port="training" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Normalize" from_port="example set output" to_op="k-NN" to_port="training set"/>
          <connect from_op="Normalize" from_port="preprocessing model" to_port="through 1"/>
          <connect from_op="k-NN" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <portSpacing port="sink_through 2" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="76" name="Apply Model (2)" width="90" x="45" y="75">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="76" name="Apply Model" width="90" x="246" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.4.000" expanded="true" height="76" name="Performance" width="90" x="380" y="30"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_port="through 1" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="source_through 2" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

This was solved by the Group Models operator clean things up. The same process above gets cleaner below:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="124" name="Validation" width="90" x="380" y="34">
        <parameter key="sampling_type" value="2"/>
        <process expanded="true">
          <operator activated="true" class="normalize" compatibility="7.4.000" expanded="true" height="103" name="Normalize" width="90" x="45" y="34"/>
          <operator activated="true" class="k_nn" compatibility="7.4.000" expanded="true" height="82" name="k-NN" width="90" x="246" y="30">
            <parameter key="k" value="11"/>
          </operator>
          <operator activated="true" class="group_models" compatibility="7.4.000" expanded="true" height="103" name="Group Models" width="90" x="313" y="187"/>
          <connect from_port="training" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Normalize" from_port="example set output" to_op="k-NN" to_port="training set"/>
          <connect from_op="Normalize" from_port="preprocessing model" to_op="Group Models" to_port="models in 1"/>
          <connect from_op="k-NN" from_port="model" to_op="Group Models" to_port="models in 2"/>
          <connect from_op="Group Models" from_port="model out" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.4.000" expanded="true" height="82" name="Performance" width="90" x="246" y="34"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

It does take a few minutes to wrap your head around it but then it makes a lot of sense. The neat thing is that you can have both the normalize, sample, and other preprocessing operators on the training side and then apply those pre-processed models to the testing side in a very honest way with no data snooping. 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott