🎉 🎉. RAPIDMINER 9.8 IS OUT!!! 🎉 🎉

RapidMiner 9.8 continues to innovate in data science collaboration, connectivity and governance


Problems SplitValidation

MichaMicha Member Posts: 3 Contributor I
edited November 2018 in Help

i have some probems understanding the splitvalidation. I thought the model learned on the "training" side (left) is the same model wich is applied on the testing side (right). But when i store the models on both sides and retrieve them in another process they are different (picture)

Is this a bug?



  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,750  RM Founder
    Hi Micha,

    welcome to our RapidMiner forum. Did you connect the model output port of your splitvalidation operator? In that case, the model which is delivered by the whole validation would be produced again on the complete data set and hence also stored with Store (3).

  • MichaMicha Member Posts: 3 Contributor I
    Hi Ingo,

    thank you for the fast reply. You are absolutely right, when i connect the model-output of the splitvalidation operator then it computes the model for the whole dataset (good to know ;) ). But i think this is a little bit counterintuitive because in both cases (model-output of splitval. connected and not connected) the model applied on the test-data is the model learned on the train data. So when i connect the splitval. model-output is see a diffent model than the actual applied model. I dont know if iam the only one who think this is counterintuitive. Maybe as solution you could add another model-output to the spiltval for the actual applied model.

    Thanks again

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531   Unicorn
    Hi Micha,
    the reason why we didn't do this is, that the applied model was only trained on a subset of the data. So it is most probably that the model trained on all available training data will perform much better on new, unseen data, because it simply saw more of "the world". So you are strongly discouraged to use this model anyway.
    If you want to have it anyway, you can make use of the modular conception of RapidMiner and use a Remember / Recall operator pair to tunnel the objects out of the subprocess. Here's a small example:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="197" width="614">
          <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="41" y="77">
            <parameter key="target_function" value="random classification"/>
          <operator activated="true" class="split_validation" expanded="true" height="112" name="Validation" width="90" x="246" y="75">
            <parameter key="split_ratio" value="0.1"/>
            <process expanded="true" height="444" width="389">
              <operator activated="true" breakpoints="before" class="decision_tree" expanded="true" height="76" name="Decision Tree" width="90" x="112" y="30"/>
              <operator activated="true" class="remember" expanded="true" height="60" name="Remember" width="90" x="246" y="30">
                <parameter key="name" value="model%{a}"/>
                <parameter key="io_object" value="Model"/>
              <connect from_port="training" to_op="Decision Tree" to_port="training set"/>
              <connect from_op="Decision Tree" from_port="model" to_op="Remember" to_port="store"/>
              <connect from_op="Remember" from_port="stored" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            <process expanded="true" height="444" width="389">
              <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              <operator activated="true" class="performance" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
          <operator activated="true" class="recall" expanded="true" height="60" name="Recall" width="90" x="447" y="30">
            <parameter key="name" value="model2"/>
            <parameter key="io_object" value="Model"/>
          <connect from_op="Generate Data" from_port="output" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="model" to_port="result 2"/>
          <connect from_op="Recall" from_port="result" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
    Please note that we have to introduce another advanced RapidMiner technique: The macro handling. We have used the predefined macro a, accessed by %{a}, that gives the apply count of the operator. So we are remembering each application of the models that are generated in the learning subprocess of the Split validation. After the Split validation operator has been executed (take a look at the execution order to be sure (Menu Process / Operator Execution Order / Show...)), we can recall the remembered objects with their name. Note that we have replaced the macro here with the constant 2, since the complete model will be trained in the second run. You will see this when reaching the breakpoint I set in the above process.

  • MichaMicha Member Posts: 3 Contributor I
    Hi Sebastian,

    thank you for showing me the remember/recall functionality - works fine. The reasoning why you provide the modell trained on the whole DS instead of the actual applied model on the test-DS is clear. But i still think its counterintuitve. In our special usecase the 2 modells (whole-DS and train-DS) lead to complete different results (this was due to useing very thin staffed data, which is of course a problem itself) and i didn't knew that the operator applies a different model than the one wich is plugged at the output. So it i couldn't explain the result with the given modell - wich confused me  :). Maybe you should provide, as mentioned earlier, two outputs for both models (they are computed anyway).


  • xkubaxkuba Member Posts: 3 Contributor I

    I'm a RapidMiner newbie. I like the program very much, it's really amazing!

    I just struggled with the same problem as Micha for several hours. Finally, I've found this thread which made it clear to me what is the model returned from split validation - the training part is run for second time with the whole data set. I understand the reasing for recaltulation the model but I find it counterintuitive as well.

    I'd suggest either add original model used for training to the output of split validation as suggested before or at least add one sentence describing the behaviour to the documentation. It could save some time to another newbie...

    Otherwise RapidMiner rocks! :-)

  • dragoljubdragoljub Member Posts: 241  Maven
    Hi Everyone,

    Let me make sure I understand this. The operator called 'Split Validation' splits the data, trains a model on a subset of the data on the left side, then applies this "same" model trained on a subset of data to classify the unseen data on the right? Now the confusion comes in when you use the "model output" of the 'Split Validation' operator which will produce a more general model based on all training data. I guess this makes sense from the perspective that we want to estimate model generality from our training data but use all training data to train the best model model which we will actually deploy.

    Thanks for the clarification,  ;D
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531   Unicorn
    yes a short sentence in the operator documentation could help a lot. Unfortunately currently it's quite difficult to just add such a sentence, but a solution is near:
    We are going to set up a Wiki containing all the operator documentation. Then you could just drop a sentence there if you feel that it is needed and each RapidMiner user can see it in the help window if he is online.

Sign In or Register to comment.