Split data

oli · May 2013

Hi,

I have got another question, hopefully someone might be able to point me in the right direction.

I am using the knn process and looping through a lot of data based on the name.

I want to split my data but by different amounts depending on where I am through the data.

My decisions are done on a time basis, the top part of my data is the earliest observations and the bottom the later observations. I will try and show an simple example below.

Example Name 1 is in the data set 10 times. The first time it appears in the data set it will have no previous results so a KNN can not be done, so I would discard this example.

The second time the name appears I want to base the KNN on the example that has happened before, so the top 10% of the data for Name 1 would go into creating the model. Then the current example would go into apply model and I would discard the other 80% (as from this examples point of view it has not happened yet so it is information I would not have at the time).

The third time the name appears I would base the KNN on the two above examples, so the top 20% of data would go into creating the model. The the current example would go into apply model and I would discard 70%.

I want to carry on doing this as per the below table.

Make Model Apply Model Discard
4th 30% 10% 60%
5th 40% 10% 50%
6th 50% 10% 40%
7th 60% 10% 30%
8th 70% 10% 20%
9th 80% 10% 10%
10th 90% 10% 0%

I should also note that names might occur different times sometimes just once other times over 20.

I was hoping to use the split data function with a macro to split the data. I have the percentages in my data, but I am struggling to get the figures into my split data operator.

This is my current operation, I have tried to use macros but have taken them out as it did not work and replaced them with some random ratio.

Any help would be much appreciated.

Thanks,

Oli

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="">
    <process expanded="true">
      <operator activated="true" class="free_memory" compatibility="5.3.008" expanded="true" height="60" name="Free Memory" width="90" x="179" y="75"/>
      <operator activated="true" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Retrieve Names2" width="90" x="45" y="345">
        <parameter key="repository_entry" value="data/Names2"/>
      </operator>
      <operator activated="true" class="loop_values" compatibility="5.3.008" expanded="true" height="76" name="Loop Values" width="90" x="179" y="210">
        <parameter key="attribute" value="NAME"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Retrieve data aw (2)" width="90" x="45" y="435">
            <parameter key="repository_entry" value="data/data aw"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="5.3.008" expanded="true" height="76" name="Filter Examples (2)" width="90" x="45" y="255">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="NAME=%{loop_value}"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="5.3.008" expanded="true" height="60" name="Free Memory (2)" width="90" x="112" y="75"/>
          <operator activated="true" class="split_data" compatibility="5.3.008" expanded="true" height="94" name="Split Data" width="90" x="179" y="165">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.2"/>
              <parameter key="ratio" value="0.8"/>
            </enumeration>
            <parameter key="sampling_type" value="linear sampling"/>
          </operator>
          <operator activated="true" class="k_nn" compatibility="5.3.008" expanded="true" height="76" name="k-NN (2)" width="90" x="315" y="30"/>
          <operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model (2)" width="90" x="450" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.3.008" expanded="true" height="76" name="Performance (2)" width="90" x="585" y="30"/>
          <operator activated="true" class="materialize_data" compatibility="5.3.008" expanded="true" height="76" name="Materialize Data" width="90" x="514" y="210"/>
          <connect from_op="Retrieve data aw (2)" from_port="output" to_op="Filter Examples (2)" to_port="example set input"/>
          <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="k-NN (2)" to_port="training set"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="k-NN (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
          <connect from_op="Performance (2)" from_port="example set" to_op="Materialize Data" to_port="example set input"/>
          <connect from_op="Materialize Data" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="5.3.008" expanded="true" height="76" name="Append" width="90" x="380" y="165"/>
      <operator activated="true" class="write_csv" compatibility="5.3.008" expanded="true" height="76" name="Write CSV" width="90" x="447" y="300">
        <parameter key="csv_file" value="C:\Users\Oliver\Documents\Gambling\Dump\write test.CSV"/>
        <parameter key="column_separator" value=","/>
      </operator>
      <connect from_op="Retrieve Names2" from_port="output" to_op="Loop Values" to_port="example set"/>
      <connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_op="Write CSV" to_port="input"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

oli · May 2013

Hi,

Just wondered if anyone had any suggestions on this, all help very much appreciated.

Thanks,

Oli

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Split data

Answers