🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

"How do I split the data into training, validation and testing subsets?"

CuriousCurious Member Posts: 12 Newbie
edited June 12 in Help
How do I split the data into training, validation and testing subsets? (Not just training and testing)
AndyJ

Best Answer

Answers

  • varunm1varunm1 Moderator, Member Posts: 817   Unicorn
    Hi @Curious

    As @mschmitz informed you can split using split data operator. You can provide the ratio of splits like 0.7 for training, 0.1 for validation and 0.2 for testing. You can see the sample code. The order in which you give this ratio defines the order of outputs are well.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="9.1.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="85">
            <parameter key="target_function" value="random"/>
            <parameter key="number_examples" value="100"/>
            <parameter key="number_of_attributes" value="5"/>
            <parameter key="attributes_lower_bound" value="-10.0"/>
            <parameter key="attributes_upper_bound" value="10.0"/>
            <parameter key="gaussian_standard_deviation" value="10.0"/>
            <parameter key="largest_radius" value="10.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="split_data" compatibility="9.1.000" expanded="true" height="124" name="Split Data" width="90" x="246" y="136">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.7"/>
              <parameter key="ratio" value="0.1"/>
              <parameter key="ratio" value="0.2"/>
            </enumeration>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_port="result 1"/>
          <connect from_op="Split Data" from_port="partition 2" to_port="result 2"/>
          <connect from_op="Split Data" from_port="partition 3" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    

    Thanks,
    Varun
    Regards,
    Varun
    Rapidminer Wisdom 2020 (User Track): Call for proposals 

    https://www.varunmandalapu.com/
    lionelderkrikorsgenzerAndyJ
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,247   Unicorn
    But based on modern data science best practice, you should be using Cross Validation and not Split Validation...there are very few cases left (if any) where a simple split validation is better than cross validation for estimating future model performance.  Split validation provides a single point estimate based on one test sample only.  Cross validation uses all the data for training and testing.  
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    lionelderkrikorvarunm1sgenzerCurious
  • varunm1varunm1 Moderator, Member Posts: 817   Unicorn
    edited January 24
    Hi @Telcontar120

    I just want to clarify if there is any use of validation set when we apply cross validation? I get this question a lot in deep learning when i skip validation set in training because I apply cross validation most of the time. As the main use of validation set is not to overfit during training but I think cross validation reduces over fitting as well. 

    Thanks,
    Varun
    Regards,
    Varun
    Rapidminer Wisdom 2020 (User Track): Call for proposals 

    https://www.varunmandalapu.com/
    sgenzerAndyJ
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,661  RM Founder
    edited January 24
    Hi,
    If there is a cross validation as the most outer step including all preprocessing and modeling, then an additional validation set would indeed not be necessary.  However, this is not always feasible - most often for runtime reasons, sometimes the complexity of the processes gets a bit out of control.
    In those cases, I would still keep some fraction of the original data (before I do anything to it!) as a validation set to make sure that I did not accidentally leak any information as part of my data processing.
    Hope this helps,
    Ingo
    RapidMiner Wisdom 2020
    February 11th and 12th 2020 in Boston, MA, USA

    varunm1sgenzerAndyJ
  • CuriousCurious Member Posts: 12 Newbie
    Thank you so much everyone!
    sgenzer
Sign In or Register to comment.