different classification results by Performance Operator and ExampleSet Output

LaraLara Member Posts: 5 Contributor II
edited November 2018 in Help
Hello RapidMiner Experts,
in my process I train a classifier and test the performance results both of training and testing partition. If I count the correctly predicted cases of the training and the testing partition I get 953 correct predictions.
When having a look at the root Split Operator ExampleSet Output I guessed to receive the whole classified dataset (training and testing data together) and I get 968 correct predictions.
Maybe this has a simple reason, but I do not understand why this number is different.
I attached an example process that might help to understand my problem.
Thank you very much for your help. Lara
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Root">
    <process expanded="true" height="584" width="962">
      <operator activated="true" class="generate_direct_mailing_data" compatibility="5.0.8" expanded="true" height="60" name="DirectMailingExampleSetGenerator" width="90" x="45" y="30">
        <parameter key="number_examples" value="1000"/>
      </operator>
      <operator activated="true" class="split_validation" compatibility="5.0.8" expanded="true" height="130" name="SimpleValidation" width="90" x="179" y="30">
        <process expanded="true" height="731" width="500">
          <operator activated="true" class="decision_tree" compatibility="5.0.8" expanded="true" height="76" name="Decision Tree" width="90" x="44" y="30"/>
          <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model" width="90" x="179" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.8" expanded="true" height="76" name="Performance (2)" width="90" x="313" y="120"/>
          <connect from_port="training" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Decision Tree" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
          <connect from_op="Apply Model" from_port="model" to_port="model"/>
          <connect from_op="Performance (2)" from_port="performance" to_port="through 1"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <portSpacing port="sink_through 2" spacing="0"/>
        </process>
        <process expanded="true" height="731" width="500">
          <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="ModelApplier" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.8" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
          <connect from_port="model" to_op="ModelApplier" to_port="model"/>
          <connect from_port="test set" to_op="ModelApplier" to_port="unlabelled data"/>
          <connect from_port="through 1" to_port="averagable 2"/>
          <connect from_op="ModelApplier" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="72"/>
          <portSpacing port="source_through 2" spacing="324"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="90"/>
          <portSpacing port="sink_averagable 3" spacing="342"/>
        </process>
      </operator>
      <connect from_op="DirectMailingExampleSetGenerator" from_port="output" to_op="SimpleValidation" to_port="training"/>
      <connect from_op="SimpleValidation" from_port="model" to_port="result 1"/>
      <connect from_op="SimpleValidation" from_port="training" to_port="result 2"/>
      <connect from_op="SimpleValidation" from_port="averagable 1" to_port="result 3"/>
      <connect from_op="SimpleValidation" from_port="averagable 2" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hi Lara,
    if you connect the model output of a validation operator, a model on the complete training data is built using the training process. Hence the training process is applied a second time, this time on all available data. Together with the model learning, your process performs a new prediction, changing the results from what they have been when measuring the performance.
    You might find insert  breakpoints after the model applier useful, if you want to check it out yourself.

    Greetings,
      Sebastian
  • LaraLara Member Posts: 5 Contributor II
    Good Morning Sebastian,
    thanks for the hint using breakpoints - I should have been hit on that on my own :-)
    Does it make sense to build a new model? If I want to create a model based on the whole data set I would leave the split operator.

    Greetings, Lara
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hi Lara,
    in fact it's a convenient way for generating a  complete model and getting an estimation of its performance.
    And the other way around: if you just want to build a model on a subset you might use the sampling instead of the split validation.
    If you want to extract the single model built on the subset, you could take a look at the remember and recall operators, which allow you to pass objects through "wormholes" in the  process even if there's no real connection possible  :)

    Greetings,
      Sebastian
Sign In or Register to comment.