Possible bug with Aggregate (mode) Function?

data123data123 Member Posts: 23 Maven
edited August 2020 in Product Feedback - Resolved

Hi,

I tried to aggregate a set of values using the mode (aggregate) function. See input data below.

User_ID Month Coupon
12245 Aug-17 A123
55645 Aug-17 B774
99987 Aug-17 B376
9890 Aug-17 B456
9890 Aug-17 B456
9890 Aug-17 B457
9891 Aug-17 ?
9891 Aug-17 ?

When aggregating, RM appears to randomly assign a value (mode) to the missing values when the answer for 9891 should be 0. Pls see xml below. Is this is a bug?

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.003" expanded="true" height="68" name="Retrieve RM_Test" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Local Repository/RM_Test"/>
</operator>
<operator activated="true" class="aggregate" compatibility="8.1.003" expanded="true" height="82" name="Aggregate" width="90" x="246" y="85">
<parameter key="use_default_aggregation" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="default_aggregation_function" value="average"/>
<list key="aggregation_attributes">
<parameter key="Coupon" value="mode"/>
</list>
<parameter key="group_by_attributes" value="User_ID|Month"/>
<parameter key="count_all_combinations" value="false"/>
<parameter key="only_distinct" value="false"/>
<parameter key="ignore_missings" value="false"/>
</operator>
<connect from_op="Retrieve RM_Test" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

0
0 votes

Fixed and Released · Last Updated

Comments

  • data123data123 Member Posts: 23 Maven

    This is the result I get. The answer for 9981 should be 0 and not A123

     

    1 9890.0 Tue Aug 01 00:00:00 SGT 2017 B456
    2 9891.0 Tue Aug 01 00:00:00 SGT 2017 A123
    3 12245.0 Tue Aug 01 00:00:00 SGT 2017 A123
    4 55645.0 Tue Aug 01 00:00:00 SGT 2017 B774
    5 99987.0 Tue Aug 01 00:00:00 SGT 2017 B376
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hello @data123,

     

    In deed, you discovered a strange behaviour....

     

    Until this phenomenon is explained, and as a palliative solution, you can in a preliminary way replace missing value(s)

    with 0 using Replace Missing Values operator.

    Here the process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="8.1.003" expanded="true" height="68" name="Read Excel" width="90" x="112" y="85">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Bug_Aggregate\Bug_Aggregate.xlsx"/>
    <parameter key="imported_cell_range" value="A1:C9"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="User_ID.true.integer.attribute"/>
    <parameter key="1" value="Month.true.polynominal.attribute"/>
    <parameter key="2" value="Coupon.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="replace_missing_values" compatibility="8.1.003" expanded="true" height="103" name="Replace Missing Values" width="90" x="313" y="85">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Coupon"/>
    <parameter key="default" value="value"/>
    <list key="columns"/>
    <parameter key="replenishment_value" value="0"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="8.1.003" expanded="true" height="82" name="Aggregate" width="90" x="514" y="85">
    <list key="aggregation_attributes">
    <parameter key="Coupon" value="mode"/>
    </list>
    <parameter key="group_by_attributes" value="User_ID"/>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
    <connect from_op="Replace Missing Values" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

     

    Lionel

     

     

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hmm well I'm not sure 0 would be the expected behavior IMHO but it seems to do what I would expect...

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve Book2" width="90" x="112" y="85">
    <parameter key="repository_entry" value="//RapidMiner OneDrive/Misc/Book2"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="8.1.001" expanded="true" height="82" name="Aggregate" width="90" x="246" y="85">
    <list key="aggregation_attributes">
    <parameter key="Coupon" value="mode"/>
    </list>
    <parameter key="group_by_attributes" value="User_ID|Month"/>
    </operator>
    <connect from_op="Retrieve Book2" from_port="output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Screen Shot 2018-04-30 at 9.50.24 AM.png

  • data123data123 Member Posts: 23 Maven

    Thanks guys, If we replace with 0 then as long as the values are not declared as "missing", the aggregate (mode) function will compute them all hence presenting a mathematically correct result but not the desired result (e.g. A123,A123,0,0,0 will give a result of 0 instead of A123 as desired).

  • jczogallajczogalla Employee, Member Posts: 144 RM Engineering

    Hi all,

     

    we ahve this on the radar and are working on it. To keep everything consistent, the aggregation function in future will return a missing value if most entries are missing values. So you still would have to use the Replace Missing operator afterwards. Will keep you posted.

     

    Cheers

    Jan

  • jczogallajczogalla Employee, Member Posts: 144 RM Engineering

    Quick update, the mode aggregation was alqays ignoring missing values, regardless whether the corresponding parameter was set or not. This will be fixed in the next patch release (8.2.1).

     

    Cheers
    Jan

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    fixed in ver 8.2.1

Sign In or Register to comment.