Find the lowest value of an attribute in relation to the other

oneponep Member Posts: 20  Maven
edited November 2018 in Help

Hello.

I'm just getting started to learn Radpidminer and would like some help to my first project.

I've managed to import my data into RM and made a predictive process on some training data and can predict my unlabeled data based on the trained data.

Now, I've trained my model with the raw training data, but I had realized that I need to cleanse the data first. See I only want my model to be trained with the data with the lowest possible value of the column B.

 

So here is my setup;

Column A - Labeled data

Column B - Data where i only want the lowest value

Coumn C - Data

 

Example:

First row

Column A: 100

Column B: 20

Column C: 50

Second row

Column A: 100

Column B: 19

Column C: 50

 

Now, since A+C is matching I would only want my model to be trained with the second row, since the first row has a higher value of column B.

 

Is this achiveable in Rapidminer?

I do not hope that my explanation is too confusing :)

 

 

Best Answers

  • oneponep Posts: 20  Maven
    Solution Accepted

    I only want to maintain the lowest value if the other columns are the same. If I invent an example with a car, it might make more sense :).

    I have the following data from the car in my database;

    Speed, gear selection, fuel consumption.

     

    I've got alot of rows with data from the car, for example;

    100 km / h, 5th gear, 0.3 l / km

    100 km / h, 4th gear, 0.35 km / l

     

    Gear selection is my label. In this case, I keep only the row with the 5th gear because it gives me the lowest fuel consumption.

     

    But if I have a row with the following data;

    95 km / h, 4th gear, 0.35 km / l

    I would still keep this row, becuase now speed has changed.

  • mschmitzmschmitz Posts: 2,131  RM Data Scientist
    Solution Accepted

    Mathias,

     

    have a look at the attached process, i think this is what you want to do.

     

    Best,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="7.2.001" expanded="true" height="82" name="Subprocess" width="90" x="45" y="85">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="speed" value="100"/>
    <parameter key="gear" value="5"/>
    <parameter key="fuel consumption" value="0.3"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.001" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="45" y="136">
    <list key="attribute_values">
    <parameter key="speed" value="100"/>
    <parameter key="gear" value="4"/>
    <parameter key="fuel consumption" value="0.35"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.001" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="45" y="238">
    <list key="attribute_values">
    <parameter key="speed" value="95"/>
    <parameter key="gear" value="4"/>
    <parameter key="fuel consumption" value="0.35"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="7.2.001" expanded="true" height="124" name="Append" width="90" x="179" y="85"/>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
    <connect from_op="Append" from_port="merged set" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Get sample data</description>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.2.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="136"/>
    <operator activated="true" class="aggregate" compatibility="7.2.001" expanded="true" height="82" name="Aggregate" width="90" x="380" y="85">
    <list key="aggregation_attributes">
    <parameter key="fuel consumption" value="minimum"/>
    </list>
    <parameter key="group_by_attributes" value="speed"/>
    </operator>
    <operator activated="true" class="join" compatibility="7.2.001" expanded="true" height="82" name="Join" width="90" x="581" y="136">
    <parameter key="use_id_attribute_as_key" value="false"/>
    <list key="key_attributes">
    <parameter key="minimum(fuel consumption)" value="fuel consumption"/>
    <parameter key="speed" value="speed"/>
    </list>
    </operator>
    <connect from_op="Subprocess" from_port="out 1" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
    <connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
    <connect from_op="Join" from_port="join" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Posts: 1,230   Unicorn
    Solution Accepted

    I think there may be an even easier solution here, using only one aggregate operator.  Just group by the two attributes that define a unique case and set the third attribute to capture the minimum value. Take a look at the attached sample process and dataset and see if it works.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="7.2.002" expanded="true" height="68" name="Read CSV" width="90" x="45" y="85">
    <parameter key="csv_file" value="C:\Users\brian\Downloads\dedupe test.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Att_A.true.integer.attribute"/>
    <parameter key="1" value="Att_B.true.integer.attribute"/>
    <parameter key="2" value="Att_C.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="aggregate" compatibility="7.2.002" expanded="true" height="82" name="Aggregate" width="90" x="179" y="85">
    <list key="aggregation_attributes">
    <parameter key="Att_C" value="minimum"/>
    <parameter key="Att_C" value="count"/>
    </list>
    <parameter key="group_by_attributes" value="Att_A|Att_B"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
    <connect from_op="Aggregate" from_port="original" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,131  RM Data Scientist

    Hi Mathias,

     

    welcome to the community!

     

    For sure it is doable, but i am not totally sure what you want do.

     

    For your given example, i guess you would first calculate the min per example (horizontal aggregation)

    so you would get

    id    min(A,B,C)

    1     20

    2     19

     

    And then take the min of the mins (vertical aggregation), which yields to take the 2?

     

    If yes, its a combination of: Generate Aggregation to get min(A,B,C), Aggregation to get min of min(A,B,C) and i think two joins.

     

    Best,

    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • oneponep Member Posts: 20  Maven

    Thank you for your help - just what I was looking for :)

Sign In or Register to comment.