Modified cross validation [Solved]

ammarghammargh Member Posts: 27 Maven
edited November 2018 in Help
I am playing with imbalanced data (28 positive examples and 444 negative examples). I have randomly sampled 30 of the majority class merged with minority class and used it for learning. During the cross validation test step I have appended the remaining majority class sample to the test data set (retrieved from tes port) in order to test the model performance with imbalanced data.

The result of 2, 3, and 4 cross validation was reasonable the performance vector showed that it tested the model using 28 positive examples. However, with 5-cross validation and more things get weird. The performance showed that there are more positive examples than originally is in the data set. This is expected with the negative examples because I am using almost all of them in each iteration, however, the positive examples should add up to exactly the same number because in each iteration I am using only the portion provided by the tes port.

Can you help me please in explaining what I am doing wrong?

The data I used can be downloaded from

http://sci2s.ugr.es/keel/keel-dataset/datasets/imbalanced/imb_IRhigherThan9p1/page-blocks-1-3_vs_4.zip
(csv format with few comment lines)

The code I use is

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="retrieve" compatibility="5.3.013" expanded="true" height="60" name="Retrieve play imbalanced" width="90" x="45" y="30">
       <parameter key="repository_entry" value="//Local Repository/data/play imbalanced"/>
     </operator>
     <operator activated="true" class="generate_id" compatibility="5.3.013" expanded="true" height="76" name="Generate ID" width="90" x="45" y="120"/>
     <operator activated="true" class="multiply" compatibility="5.3.013" expanded="true" height="94" name="Multiply" width="90" x="45" y="345"/>
     <operator activated="true" class="filter_examples" compatibility="5.3.013" expanded="true" height="76" name="Filter Examples (2)" width="90" x="246" y="345">
       <parameter key="condition_class" value="attribute_value_filter"/>
       <parameter key="parameter_string" value="att11=positive"/>
       <parameter key="invert_filter" value="true"/>
     </operator>
     <operator activated="true" class="multiply" compatibility="5.3.013" expanded="true" height="94" name="Multiply (2)" width="90" x="380" y="345"/>
     <operator activated="true" class="sample" compatibility="5.3.013" expanded="true" height="76" name="Sample" width="90" x="581" y="345">
       <parameter key="sample_size" value="30"/>
       <list key="sample_size_per_class"/>
       <list key="sample_ratio_per_class"/>
       <list key="sample_probability_per_class"/>
       <parameter key="use_local_random_seed" value="true"/>
     </operator>
     <operator activated="true" class="multiply" compatibility="5.3.013" expanded="true" height="94" name="Multiply (3)" width="90" x="715" y="390"/>
     <operator activated="true" class="set_minus" compatibility="5.3.013" expanded="true" height="76" name="Set Minus" width="90" x="581" y="660"/>
     <operator activated="true" class="remember" compatibility="5.3.013" expanded="true" height="60" name="Remember" width="90" x="782" y="750">
       <parameter key="name" value="NegDat"/>
       <parameter key="io_object" value="ExampleSet"/>
     </operator>
     <operator activated="true" class="filter_examples" compatibility="5.3.013" expanded="true" height="76" name="Filter Examples" width="90" x="246" y="75">
       <parameter key="condition_class" value="attribute_value_filter"/>
       <parameter key="parameter_string" value="att11=positive"/>
     </operator>
     <operator activated="true" class="append" compatibility="5.3.013" expanded="true" height="94" name="Append" width="90" x="581" y="120"/>
     <operator activated="true" class="shuffle" compatibility="5.3.013" expanded="true" height="76" name="Shuffle" width="90" x="715" y="120"/>
     <operator activated="true" class="x_validation" compatibility="5.3.013" expanded="true" height="112" name="Validation" width="90" x="916" y="75">
       <parameter key="number_of_validations" value="6"/>
       <process expanded="true">
         <operator activated="true" class="neural_net" compatibility="5.3.013" expanded="true" height="76" name="Neural Net" width="90" x="246" y="75">
           <list key="hidden_layers"/>
         </operator>
         <connect from_port="training" to_op="Neural Net" to_port="training set"/>
         <connect from_op="Neural Net" from_port="model" to_port="model"/>
         <portSpacing port="source_training" spacing="0"/>
         <portSpacing port="sink_model" spacing="0"/>
         <portSpacing port="sink_through 1" spacing="0"/>
       </process>
       <process expanded="true">
         <operator activated="true" class="recall" compatibility="5.3.013" expanded="true" height="60" name="Recall" width="90" x="45" y="390">
           <parameter key="name" value="NegDat"/>
           <parameter key="io_object" value="ExampleSet"/>
           <parameter key="remove_from_store" value="false"/>
         </operator>
         <operator activated="true" class="append" compatibility="5.3.013" expanded="true" height="94" name="Append (2)" width="90" x="179" y="300"/>
         <operator activated="true" class="apply_model" compatibility="5.3.013" expanded="true" height="76" name="Apply Model" width="90" x="179" y="75">
           <list key="application_parameters"/>
         </operator>
         <operator activated="true" class="performance" compatibility="5.3.013" expanded="true" height="76" name="Performance" width="90" x="380" y="30"/>
         <connect from_port="model" to_op="Apply Model" to_port="model"/>
         <connect from_port="test set" to_op="Append (2)" to_port="example set 1"/>
         <connect from_op="Recall" from_port="result" to_op="Append (2)" to_port="example set 2"/>
         <connect from_op="Append (2)" from_port="merged set" to_op="Apply Model" to_port="unlabelled data"/>
         <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
         <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
         <portSpacing port="source_model" spacing="0"/>
         <portSpacing port="source_test set" spacing="0"/>
         <portSpacing port="source_through 1" spacing="0"/>
         <portSpacing port="sink_averagable 1" spacing="0"/>
         <portSpacing port="sink_averagable 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Retrieve play imbalanced" from_port="output" to_op="Generate ID" to_port="example set input"/>
     <connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
     <connect from_op="Generate ID" from_port="original" to_port="result 2"/>
     <connect from_op="Multiply" from_port="output 1" to_op="Filter Examples" to_port="example set input"/>
     <connect from_op="Multiply" from_port="output 2" to_op="Filter Examples (2)" to_port="example set input"/>
     <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
     <connect from_op="Multiply (2)" from_port="output 1" to_op="Sample" to_port="example set input"/>
     <connect from_op="Multiply (2)" from_port="output 2" to_op="Set Minus" to_port="example set input"/>
     <connect from_op="Sample" from_port="example set output" to_op="Multiply (3)" to_port="input"/>
     <connect from_op="Multiply (3)" from_port="output 1" to_op="Append" to_port="example set 2"/>
     <connect from_op="Multiply (3)" from_port="output 2" to_op="Set Minus" to_port="subtrahend"/>
     <connect from_op="Set Minus" from_port="example set output" to_op="Remember" to_port="store"/>
     <connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 1"/>
     <connect from_op="Append" from_port="merged set" to_op="Shuffle" to_port="example set input"/>
     <connect from_op="Shuffle" from_port="example set output" to_op="Validation" to_port="training"/>
     <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
     <portSpacing port="sink_result 3" spacing="0"/>
   </process>
 </operator>
</process>

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    It's something to do with the way the cross validation operator produces its final average performance vector. I noticed the order of the confusion matrices for the inner performances sometimes swapped the order of positive and negative. The cross validation operator uses the position within the matrix to calculate the average since it assumes that the orders will always be the same.

    If you remove the Shuffle operator, does the behaviour change?

    regards

    Andrew
  • ammarghammargh Member Posts: 27 Maven
    Thank you Andrew.

    Removing the shuffle did operator changed the behavior and the I've got the expected results. However, I don't think it is a good practice to provide the samples in order as provided at the output of the append operator.

    Do think that their is something wrong in the performance calculation method, or was I mistaken by using the shuffle operator?
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    The cross validation will perform a fair amount of shuffling as it divides the data into partitions. One of the options is shuffled sampling.

    regards

    Andrew
  • ammarghammargh Member Posts: 27 Maven
    Thank you very much Andrew.

    I see your point. However, I do not think as a concept that more shuffling should affect the result.
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello ammargh,

    I agree that shuffling shouldn't make a difference. You've discovered a feature of the cross validation operator.

    Regards

    Andrew
Sign In or Register to comment.