Options

Splitting data

frankiefrankie Member Posts: 26 Contributor II
edited November 2018 in Help
Hi,

I have what I consider a simple problem but due to poor understanding or perhaps poor documentation I cannot figure out how to:
Split a dataset of say 1000 observations into two separate datasets of say 700 and 300 observations respectively. That is, a operator that has two outputs and one input...

Is this done with the "Split Data" operator? If so, what are these "partitions" I need to define?
The split should be random, preferably with a predefined seed for reproducibility.


-frankie

Answers

  • Options
    earmijoearmijo Member Posts: 270 Unicorn
    Frankie:

    Yes you can do it easily in RM. Take a look at the code below. It uses the operator "Split Data". It splits the iris dataset into 2 partitions: 70/30%.  This info is fed to RM clicking the "Edit Enumeration" button. Notice you could have k partitions by adding k ratios.

    If you select the option "local random seed" the partitions will be the same in repeated trials.

    Hope this helps.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
        <process expanded="true" height="179" width="346">
          <operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve" width="90" x="74" y="62">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="split_data" compatibility="5.1.001" expanded="true" height="94" name="Split Data" width="90" x="246" y="75">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.7"/>
              <parameter key="ratio" value="0.3"/>
            </enumeration>
            <parameter key="use_local_random_seed" value="true"/>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_port="result 1"/>
          <connect from_op="Split Data" from_port="partition 2" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • Options
    frankiefrankie Member Posts: 26 Contributor II
    Thank you!
Sign In or Register to comment.