RapidMiner

0 Likes

Possible bug with Aggregate (mode) Function?

Status: Resolved

Hi,

I tried to aggregate a set of values using the mode (aggregate) function. See input data below.

User_ID Month Coupon
12245 Aug-17 A123
55645 Aug-17 B774
99987 Aug-17 B376
9890 Aug-17 B456
9890 Aug-17 B456
9890 Aug-17 B457
9891 Aug-17 ?
9891 Aug-17 ?

When aggregating, RM appears to randomly assign a value (mode) to the missing values when the answer for 9891 should be 0. Pls see xml below. Is this is a bug?

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.003" expanded="true" height="68" name="Retrieve RM_Test" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Local Repository/RM_Test"/>
</operator>
<operator activated="true" class="aggregate" compatibility="8.1.003" expanded="true" height="82" name="Aggregate" width="90" x="246" y="85">
<parameter key="use_default_aggregation" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="default_aggregation_function" value="average"/>
<list key="aggregation_attributes">
<parameter key="Coupon" value="mode"/>
</list>
<parameter key="group_by_attributes" value="User_ID|Month"/>
<parameter key="count_all_combinations" value="false"/>
<parameter key="only_distinct" value="false"/>
<parameter key="ignore_missings" value="false"/>
</operator>
<connect from_op="Retrieve RM_Test" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

8 Comments (8 New)
Comments
Contributor II

This is the result I get. The answer for 9981 should be 0 and not A123

 

1 9890.0 Tue Aug 01 00:00:00 SGT 2017 B456
2 9891.0 Tue Aug 01 00:00:00 SGT 2017 A123
3 12245.0 Tue Aug 01 00:00:00 SGT 2017 A123
4 55645.0 Tue Aug 01 00:00:00 SGT 2017 B774
5 99987.0 Tue Aug 01 00:00:00 SGT 2017 B376

Hello @data123,

 

In deed, you discovered a strange behaviour....

 

Until this phenomenon is explained, and as a palliative solution, you can in a preliminary way replace missing value(s)

with 0 using Replace Missing Values operator.

Here the process : 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="8.1.003" expanded="true" height="68" name="Read Excel" width="90" x="112" y="85">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Bug_Aggregate\Bug_Aggregate.xlsx"/>
        <parameter key="imported_cell_range" value="A1:C9"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="User_ID.true.integer.attribute"/>
          <parameter key="1" value="Month.true.polynominal.attribute"/>
          <parameter key="2" value="Coupon.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="replace_missing_values" compatibility="8.1.003" expanded="true" height="103" name="Replace Missing Values" width="90" x="313" y="85">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Coupon"/>
        <parameter key="default" value="value"/>
        <list key="columns"/>
        <parameter key="replenishment_value" value="0"/>
      </operator>
      <operator activated="true" class="aggregate" compatibility="8.1.003" expanded="true" height="82" name="Aggregate" width="90" x="514" y="85">
        <list key="aggregation_attributes">
          <parameter key="Coupon" value="mode"/>
        </list>
        <parameter key="group_by_attributes" value="User_ID"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regards,

 

 

Lionel

 

 

 

Community Manager

hmm well I'm not sure 0 would be the expected behavior IMHO but it seems to do what I would expect...

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve Book2" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//RapidMiner OneDrive/Misc/Book2"/>
      </operator>
      <operator activated="true" class="aggregate" compatibility="8.1.001" expanded="true" height="82" name="Aggregate" width="90" x="246" y="85">
        <list key="aggregation_attributes">
          <parameter key="Coupon" value="mode"/>
        </list>
        <parameter key="group_by_attributes" value="User_ID|Month"/>
      </operator>
      <connect from_op="Retrieve Book2" from_port="output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Screen Shot 2018-04-30 at 9.50.24 AM.png

Contributor II

Thanks guys, If we replace with 0 then as long as the values are not declared as "missing", the aggregate (mode) function will compute them all hence presenting a mathematically correct result but not the desired result (e.g. A123,A123,0,0,0 will give a result of 0 instead of A123 as desired).

RM Staff

Hi all,

 

we ahve this on the radar and are working on it. To keep everything consistent, the aggregation function in future will return a missing value if most entries are missing values. So you still would have to use the Replace Missing operator afterwards. Will keep you posted.

 

Cheers

Jan

RM Staff

Quick update, the mode aggregation was alqays ignoring missing values, regardless whether the corresponding parameter was set or not. This will be fixed in the next patch release (8.2.1).

 

Cheers
Jan

Community Manager
Status: Investigating
 
Community Manager
Status: Resolved

fixed in ver 8.2.1