Noob question? Finding maximum of setsub of data

sgt · June 2016

Hi,

I have a data source (CSV) with TestRun,Time,Result.

TestRun identifies when a test was run.

Time is time since the start of the test.

Result is the measured value at that time.

I have a large number of unique TestRuns. I'd like to perform some calculations but for the life of me I can't figure out how any of the loops etc work. I've also tried tutorials, time series extension etc, and tried cutting and pasting XML from other answers in this forum. Nothing seems to work (BTW There is a smiley face on one of the XML examples which screws it up.)

The calculations I'd like to perform are:

What is the maximum value of Result for the TestRun? (From this result, I can find the earliest time for the maximum result)

I can think of a number of ways to do this with a programming language, but just can't get my head around doing it within RapidMiner.

Any help would be appreciated?

MartinLiebig · June 2016

Hi,

i think an Aggreate does the trick if it is only the maximum. For complex calcuations with "group by" Test Run I would use the Loop Values. I attach a two processes to demonstrate the two ideas.

Best,

Martin

Easy Aggregation

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.1.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
      </operator>
      <operator activated="true" class="aggregate" compatibility="7.1.001" expanded="true" height="82" name="Aggregate" width="90" x="246" y="85">
        <list key="aggregation_attributes">
          <parameter key="Age" value="maximum"/>
        </list>
        <parameter key="group_by_attributes" value="Passenger Class"/>
        <description align="center" color="transparent" colored="false" width="126">Calculate max age per cabin class</description>
      </operator>
      <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Loop Values to generate average(cumulative sum) per class

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.1.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
      </operator>
      <operator activated="true" class="loop_values" compatibility="7.1.001" expanded="true" height="82" name="Loop Values" width="90" x="246" y="85">
        <parameter key="attribute" value="Passenger Class"/>
        <process expanded="true">
          <operator activated="true" class="filter_examples" compatibility="7.1.001" expanded="true" height="103" name="Filter Examples" width="90" x="45" y="34">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Passenger Class.equals.%{loop_value}"/>
            </list>
            <description align="center" color="transparent" colored="false" width="126">Filter for current class</description>
          </operator>
          <operator activated="true" class="sort" compatibility="7.1.001" expanded="true" height="82" name="Sort" width="90" x="179" y="34">
            <parameter key="attribute_name" value="Age"/>
          </operator>
          <operator activated="true" class="series:integrate_series" compatibility="5.3.000" expanded="true" height="82" name="Integrate" width="90" x="313" y="34">
            <parameter key="attribute_name" value="Age"/>
          </operator>
          <operator activated="true" class="aggregate" compatibility="7.1.001" expanded="true" height="82" name="Aggregate" width="90" x="514" y="34">
            <list key="aggregation_attributes">
              <parameter key="cumulative(Age)" value="average"/>
            </list>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="7.1.001" expanded="true" height="82" name="Generate Attributes" width="90" x="648" y="34">
            <list key="function_descriptions">
              <parameter key="Passanger Class" value="%{loop_value}"/>
            </list>
            <description align="center" color="transparent" colored="false" width="126">Generate a attribute indicating the current passenger class</description>
          </operator>
          <connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Sort" to_port="example set input"/>
          <connect from_op="Sort" from_port="example set output" to_op="Integrate" to_port="example set input"/>
          <connect from_op="Integrate" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
          <description align="center" color="yellow" colored="false" height="173" resized="true" width="291" x="163" y="11">Gerneate a coloum with the *hidden* sum</description>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Loop over Passenger Class and get each value once as a macro</description>
      </operator>
      <operator activated="true" class="append" compatibility="7.1.001" expanded="true" height="82" name="Append" width="90" x="380" y="85">
        <description align="center" color="transparent" colored="false" width="126">Append the individual results</description>
      </operator>
      <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Loop Values" to_port="example set"/>
      <connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

sgt · June 2016

Thanks for the fast response Martin. Aggregate is what I was looking for - it is much faster than using the loop function, and I'm starting to see how the loop function works.

I'm guessing that my next problem to solve will require the loop function - I have now got a maximum value, and I want to find the coresponding time where the maximum occured. If I have several maximum events, I only want to see the first one.

So, I think I need to multiply my original example set so I can keep the data through the aggregation, then join that example set with the example set including the aggregated maximums, then loop by TestRun, which filters by testrun=%{loop_value} and Result=maximum(result), I'll get a resulting exampleset that includes the maximum, and the time of the maximum, for each TestRun.

Or am I over thinking it? It sounds way to complicated for what I'm trying to do...

MartinLiebig · June 2016

I think you are a bit of overthinking it. Or rather - you are a bit much in a programming wolrd and not in an ETL/SQL world.

Why don't you join the max/testrun on the original one and take the first (with Filter Examples to remove missings, and Remove duplicates to take the first)). That should be way faster and easier to built.

~Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Noob question? Finding maximum of setsub of data

Answers