Options

Noob question? Finding maximum of setsub of data

sgtsgt Member Posts: 2 Contributor I
edited November 2018 in Help

Hi,

 

I have a data source (CSV) with TestRun,Time,Result.

TestRun identifies when a test was run.

Time is time since the start of the test.

Result is the measured value at that time.

 

I have a large number of unique TestRuns. I'd like to perform some calculations but for the life of me I can't figure out how any of the loops etc work. I've also tried tutorials, time series extension etc, and tried cutting and pasting XML from other answers in this forum. Nothing seems to work (BTW There is a smiley face on one of the XML examples which screws it up.)

 

The calculations I'd like to perform are:

What is the maximum value of Result for the TestRun? (From this result, I can find the earliest time for the maximum result)

 

I can think of a number of ways to do this with a programming language, but just can't get my head around doing it within RapidMiner.

 

Any help would be appreciated?

 

 

 

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,524 RM Data Scientist

    Hi,

     

    i think an Aggreate does the trick if it is only the maximum. For complex calcuations with "group by" Test Run I would use the Loop Values. I attach a two processes to demonstrate the two ideas.

     

    Best,

    Martin

     

    Easy Aggregation

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.1.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.1.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="85">
    <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="7.1.001" expanded="true" height="82" name="Aggregate" width="90" x="246" y="85">
    <list key="aggregation_attributes">
    <parameter key="Age" value="maximum"/>
    </list>
    <parameter key="group_by_attributes" value="Passenger Class"/>
    <description align="center" color="transparent" colored="false" width="126">Calculate max age per cabin class</description>
    </operator>
    <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Loop Values to generate average(cumulative sum) per class

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.1.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.1.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="85">
    <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
    </operator>
    <operator activated="true" class="loop_values" compatibility="7.1.001" expanded="true" height="82" name="Loop Values" width="90" x="246" y="85">
    <parameter key="attribute" value="Passenger Class"/>
    <process expanded="true">
    <operator activated="true" class="filter_examples" compatibility="7.1.001" expanded="true" height="103" name="Filter Examples" width="90" x="45" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Passenger Class.equals.%{loop_value}"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">Filter for current class</description>
    </operator>
    <operator activated="true" class="sort" compatibility="7.1.001" expanded="true" height="82" name="Sort" width="90" x="179" y="34">
    <parameter key="attribute_name" value="Age"/>
    </operator>
    <operator activated="true" class="series:integrate_series" compatibility="5.3.000" expanded="true" height="82" name="Integrate" width="90" x="313" y="34">
    <parameter key="attribute_name" value="Age"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="7.1.001" expanded="true" height="82" name="Aggregate" width="90" x="514" y="34">
    <list key="aggregation_attributes">
    <parameter key="cumulative(Age)" value="average"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.1.001" expanded="true" height="82" name="Generate Attributes" width="90" x="648" y="34">
    <list key="function_descriptions">
    <parameter key="Passanger Class" value="%{loop_value}"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">Generate a attribute indicating the current passenger class</description>
    </operator>
    <connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Sort" to_port="example set input"/>
    <connect from_op="Sort" from_port="example set output" to_op="Integrate" to_port="example set input"/>
    <connect from_op="Integrate" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_example set" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="173" resized="true" width="291" x="163" y="11">Gerneate a coloum with the *hidden* sum</description>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Loop over Passenger Class and get each value once as a macro</description>
    </operator>
    <operator activated="true" class="append" compatibility="7.1.001" expanded="true" height="82" name="Append" width="90" x="380" y="85">
    <description align="center" color="transparent" colored="false" width="126">Append the individual results</description>
    </operator>
    <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Loop Values" to_port="example set"/>
    <connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    sgtsgt Member Posts: 2 Contributor I

    Thanks for the fast response Martin. Aggregate is what I was looking for - it is much faster than using the loop function, and I'm starting to see how the loop function works.

     

    I'm guessing that my next problem to solve will require the loop function - I have now got a maximum value, and I want to find the coresponding time where the maximum occured. If I have several maximum events, I only want to see the first one.

     

    So, I think I need to multiply my original example set so I can keep the data through the aggregation, then join that example set with the example set including the aggregated maximums, then loop by TestRun, which filters by testrun=%{loop_value} and Result=maximum(result), I'll get a resulting exampleset that includes the maximum, and the time of the maximum, for each TestRun.

     

    Or am I over thinking it? It sounds way to complicated for what I'm trying to do...

     

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,524 RM Data Scientist

    I think you are a bit of overthinking it. Or rather - you are a bit much in a programming wolrd and not in an ETL/SQL world.

     

    Why don't you join the max/testrun on the original one and take the first (with Filter Examples to remove missings, and Remove duplicates to take the first)). That should be way faster and easier to built.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.