Median calculation problem in Aggregate function

luqieluqie Member Posts: 2 Contributor I
edited November 2018 in Help
Hi guys,

I'm using RM 5.3 and 6 versions and trying to come out with a median for my data (aggregated by an attribute value).

I realized the median calculation used is not correct. RM for both versions do not seem to take the average of 2 middle values if the number list is even.

As an example, use the following data and calculate median for DOM (aggregate by DATE):

DOM  DATE
33 537
47 537
49 537
57 537
79 537
91 537
102 537
123 537
133 537
134 537
149 537
155 537
186 537
238 537

The correct answer should be 112.5
RM gives the median as 102

Thanks!

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    Good find.

    Interestingly, if you sort the examples, the answer changes as in the attached.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="6.0.008" expanded="true" height="60" name="Generate Data" width="90" x="112" y="75">
            <parameter key="number_examples" value="10"/>
            <parameter key="attributes_lower_bound" value="-1.0"/>
            <parameter key="attributes_upper_bound" value="5.0"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="6.0.008" expanded="true" height="76" name="Generate Attributes" width="90" x="112" y="165">
            <list key="function_descriptions">
              <parameter key="att1" value="round(att1)"/>
              <parameter key="constant" value="1"/>
            </list>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="6.0.008" expanded="true" height="76" name="Select Attributes" width="90" x="112" y="255">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="constant|att1"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="sort" compatibility="6.0.008" expanded="true" height="76" name="Sort" width="90" x="313" y="75">
            <parameter key="attribute_name" value="att1"/>
            <parameter key="sorting_direction" value="decreasing"/>
          </operator>
          <operator activated="true" class="aggregate" compatibility="6.0.008" expanded="true" height="76" name="Aggregate" width="90" x="313" y="165">
            <list key="aggregation_attributes">
              <parameter key="att1" value="median"/>
            </list>
            <parameter key="group_by_attributes" value="constant"/>
          </operator>
          <operator activated="true" class="sort" compatibility="6.0.008" expanded="true" height="76" name="Sort (2)" width="90" x="313" y="255">
            <parameter key="attribute_name" value="att1"/>
          </operator>
          <operator activated="true" class="aggregate" compatibility="6.0.008" expanded="true" height="76" name="Aggregate (2)" width="90" x="313" y="345">
            <list key="aggregation_attributes">
              <parameter key="att1" value="median"/>
            </list>
            <parameter key="group_by_attributes" value="constant"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Sort" to_port="example set input"/>
          <connect from_op="Sort" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
          <connect from_op="Aggregate" from_port="original" to_op="Sort (2)" to_port="example set input"/>
          <connect from_op="Sort (2)" from_port="example set output" to_op="Aggregate (2)" to_port="example set input"/>
          <connect from_op="Aggregate (2)" from_port="example set output" to_port="result 2"/>
          <connect from_op="Aggregate (2)" from_port="original" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    I suppose you could manually calculate the median you're after by using the two values - a bit ugly but it would work.

    regards

    Andrew
  • luqieluqie Member Posts: 2 Contributor I
    Thanks for the workaround Andrew. Ugly, but works. It gets abit unhelpful though if I have loads of values to aggregate in the same columns  (I only have one value for aggregation in the above example). Any suggestions for that?

    Also, will other functions and charting be affected in the use of the engine's median function ?(eg k-medoids, boxplot etc)

    Thanks!
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    I could imagine it would turn into a complicated process with multiple aggregation groups. It's slightly more gymnastics time than I can spare at the moment but at a high level, I would use Loop Values for each aggregation group, filter the aggregated result for that value, do the ugly sorting thing and then store the result somewhere.

    I don't know what would happen elsewhere regarding median calculations - we have to wait for one of those nice developers to say whilst we remain vigilant.

    regards

    Andrew
Sign In or Register to comment.