Extract example set from Crossvalidation operator?

Fred12Fred12 Member Posts: 344 Unicorn
edited November 2018 in Help



I want to extract the example set from training and testdata separately from the inner cross-validation operator process.. is that somehow possible? because I have only performance vector outputs...


  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I don't believe this is possible.  But I'm also not sure what you really intended by this request, because by definition, k-fold cross-validation requires that every example will appear once in a test dataset (and the other k-1 times it appears in the training sets).  


    As you already know, the model produced by cross-validation is based on the entire dataset.  The cross-validation procedure is simply designed to estimate how the model might perform on unseen data in a more statistically robust way than the older approach of a static two-way split into a training versus testing set.  So why would you need to extract the specific example sets used in cross-validation?  The entire dataset is ultimately used both for training and testing in cross-validation.


    If you really need to do this, then I think you are going to have to set up a kind of manual cross-validation by creating static segments and then building the model and running the test statistics on each segment separately using loops.  But it seems like a lot of effort to build manually what cross-validation already does automatically.  

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist

    you add store on the training and test side, and then some macro logic.

    Not sure why would you want to store it, but hopefully this attached example will give some ideas

  • earmijoearmijo Member Posts: 270 Unicorn

    May I ask why do you want to do that?


    1) Because I want to use the predictions


    Then you can use the operator X-Prediction. 


    2) Because you want to something else


    A possibility here is to define yourself the k-different samples outside Rapidminer and then define a Batch Variable. After that you can use Batch-X-VAlidation

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,453 RM Data Scientist



    a few remarks:

    RapidMiner has two operators, the operator X-Validation to get a performance and the operator X-Prediction to get a scored sample. Sadly there is no built in operator to do both things at once. I am using the attached building block for this


    Why shouldn't i do this? Well, to be honest it is very dangerous to do this. People tend to have a look at the scored data set and built new variables which solve issues with single examples. This is obvious overtraining by hand and should be treated with care or should better be avoided.


    Why should i do this? Well, i am personally using it in regression problems to get a scatterplot true vs predicted. In this scatter plot you can see biases or biases in some regions, nonlinearities and so on. This is I think very useful.




    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
    <operator activated="true" class="process" compatibility="7.2.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.2.001" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Samples/data/Sonar"/>
    <operator activated="true" class="subprocess" compatibility="7.2.001" expanded="true" height="124" name="X-Val with X-Pred" width="90" x="313" y="30">
    <process expanded="true">
    <operator activated="true" class="x_validation" compatibility="7.2.001" expanded="true" height="112" name="Validation" width="90" x="45" y="30">
    <parameter key="sampling_type" value="shuffled sampling"/>
    <process expanded="true">
    <operator activated="true" class="linear_regression" compatibility="7.2.001" expanded="true" height="94" name="Linear Regression" width="90" x="45" y="30"/>
    <connect from_port="training" to_op="Linear Regression" to_port="training set"/>
    <connect from_op="Linear Regression" from_port="model" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
    <list key="application_parameters"/>
    <operator activated="true" class="performance" compatibility="7.2.001" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
    <operator activated="true" class="handle_exception" compatibility="7.2.001" expanded="true" height="76" name="Handle Exception" width="90" x="179" y="165">
    <process expanded="true">
    <operator activated="true" class="recall" compatibility="7.2.001" expanded="true" height="60" name="Recall" width="90" x="112" y="30">
    <parameter key="name" value="labeledData"/>
    <operator activated="true" class="append" compatibility="7.2.001" expanded="true" height="94" name="Append" width="90" x="246" y="120"/>
    <connect from_port="in 1" to_op="Append" to_port="example set 2"/>
    <connect from_op="Recall" from_port="result" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="72"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <process expanded="true">
    <connect from_port="in 1" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <operator activated="true" class="remember" compatibility="7.2.001" expanded="true" height="60" name="Remember" width="90" x="313" y="165">
    <parameter key="name" value="labeledData"/>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
    <connect from_op="Performance" from_port="example set" to_op="Handle Exception" to_port="in 1"/>
    <connect from_op="Handle Exception" from_port="out 1" to_op="Remember" to_port="store"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_averagable 1" spacing="0"/>
    <portSpacing port="sink_averagable 2" spacing="0"/>
    <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a linear regression model.</description>
    <operator activated="true" class="recall" compatibility="7.2.001" expanded="true" height="60" name="Recall (2)" width="90" x="179" y="120">
    <parameter key="name" value="labeledData"/>
    <connect from_port="in 1" to_op="Validation" to_port="training"/>
    <connect from_op="Validation" from_port="model" to_port="out 1"/>
    <connect from_op="Validation" from_port="averagable 1" to_port="out 2"/>
    <connect from_op="Recall (2)" from_port="result" to_port="out 3"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <portSpacing port="sink_out 3" spacing="0"/>
    <portSpacing port="sink_out 4" spacing="0"/>
    <connect from_op="Retrieve Sonar" from_port="output" to_op="X-Val with X-Pred" to_port="in 1"/>
    <connect from_op="X-Val with X-Pred" from_port="out 2" to_port="result 1"/>
    <connect from_op="X-Val with X-Pred" from_port="out 3" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>


    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.