"PIVOT operator creates erroneous output"

mmaragmmarag Member Posts: 35 Maven
edited June 2019 in Help

Hello, i was testing the results of pivot in Excel and Rapidminer and i found out that they are completely different, whith those of RM being wrong! Any ideas?

 

Here is the dataset i used in both cases:

Student Item Amount
Id 1 Item 1 1
Id 2 Item 2 8
Id 3 Item 1 1
Id 3 Item 1 1
Id 2 Item 2 1
Id 2 Item 1 1
Id 3 Item 1 1
Id 2 Item 1 1
Id 2 Item 2 5
Id 2 Item 1 1

and here is the result from Pivot operation in Excel (the correct one)

 

Row Labels Item 1 Item 2
Id 1 1  
Id 2 3 14
Id 3 3  

and here is the result from the Pivot operator in RM. i selected the student attr as ID and Itema s Index and SUM as weight aggregation.

Row No Item1 Item 2
1 1.0 ?
2 1.0 5.0
3 1.0 ?

here is the process XML

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_transaction_data" compatibility="7.4.000" expanded="true" height="68" name="Generate Transaction Data" width="90" x="45" y="340">
        <parameter key="number_transactions" value="10"/>
        <parameter key="number_customers" value="3"/>
        <parameter key="number_items" value="2"/>
        <parameter key="number_clusters" value="2"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.4.000" expanded="true" height="103" name="Multiply" width="90" x="246" y="289"/>
      <operator activated="true" class="pivot" compatibility="7.4.000" expanded="true" height="82" name="Pivot" width="90" x="246" y="85">
        <parameter key="group_attribute" value="Id"/>
        <parameter key="index_attribute" value="Item"/>
      </operator>
      <operator activated="true" class="rename_by_replacing" compatibility="7.4.000" expanded="true" height="82" name="Rename by Replacing" width="90" x="447" y="34">
        <parameter key="replace_what" value="Amount_"/>
      </operator>
      <connect from_op="Generate Transaction Data" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Pivot" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_port="result 2"/>
      <connect from_op="Pivot" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
      <connect from_op="Rename by Replacing" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Tagged:

Best Answer

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    @mmarag, and if you want to actually get the data structure similar to the Excel output, where the row is by id and there is a seprate column for each item sum, then you can add a pivot after the Aggregate operator that @Thomas_Ott showed in his process.  Here's an updated process that adds that piece:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_transaction_data" compatibility="7.4.000" expanded="true" height="68" name="Generate Transaction Data" width="90" x="45" y="34">
    <parameter key="number_transactions" value="10"/>
    <parameter key="number_customers" value="3"/>
    <parameter key="number_items" value="2"/>
    <parameter key="number_clusters" value="2"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.4.000" expanded="true" height="103" name="Multiply" width="90" x="246" y="136"/>
    <operator activated="true" class="aggregate" compatibility="7.4.000" expanded="true" height="82" name="Aggregate" width="90" x="514" y="289">
    <list key="aggregation_attributes">
    <parameter key="Amount" value="sum"/>
    </list>
    <parameter key="group_by_attributes" value="Id|Item"/>
    </operator>
    <operator activated="true" class="pivot" compatibility="7.4.000" expanded="true" height="82" name="Pivot (2)" width="90" x="648" y="289">
    <parameter key="group_attribute" value="Id"/>
    <parameter key="index_attribute" value="Item"/>
    </operator>
    <operator activated="true" class="pivot" compatibility="7.4.000" expanded="true" height="82" name="Pivot" width="90" x="447" y="34">
    <parameter key="group_attribute" value="Id"/>
    <parameter key="index_attribute" value="Item"/>
    </operator>
    <operator activated="true" class="rename_by_replacing" compatibility="7.4.000" expanded="true" height="82" name="Rename by Replacing" width="90" x="581" y="34">
    <parameter key="replace_what" value="Amount_"/>
    </operator>
    <connect from_op="Generate Transaction Data" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Pivot" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_op="Pivot (2)" to_port="example set input"/>
    <connect from_op="Pivot (2)" from_port="example set output" to_port="result 2"/>
    <connect from_op="Pivot" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
    <connect from_op="Rename by Replacing" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

     

    But keep in mind that for certain types of analysis, the unpivoted format (long and skinny) is actually more efficient.  So the right structure depends what you are trying to accomplish.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    What you want is the Aggregate operator, not Pivot. There's much confusion as to what Excel calls Pivot and what the rest of the world means by Pivot, but that's a long discussion over a few beers.

     

    I added an Aggregate to your process, see attached.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_transaction_data" compatibility="7.4.000" expanded="true" height="68" name="Generate Transaction Data" width="90" x="45" y="34">
    <parameter key="number_transactions" value="10"/>
    <parameter key="number_customers" value="3"/>
    <parameter key="number_items" value="2"/>
    <parameter key="number_clusters" value="2"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.4.000" expanded="true" height="124" name="Multiply" width="90" x="246" y="136"/>
    <operator activated="true" class="aggregate" compatibility="7.4.000" expanded="true" height="82" name="Aggregate" width="90" x="514" y="289">
    <list key="aggregation_attributes">
    <parameter key="Amount" value="sum"/>
    </list>
    <parameter key="group_by_attributes" value="Id|Item"/>
    </operator>
    <operator activated="true" class="pivot" compatibility="7.4.000" expanded="true" height="82" name="Pivot" width="90" x="447" y="34">
    <parameter key="group_attribute" value="Id"/>
    <parameter key="index_attribute" value="Item"/>
    </operator>
    <operator activated="true" class="rename_by_replacing" compatibility="7.4.000" expanded="true" height="82" name="Rename by Replacing" width="90" x="581" y="34">
    <parameter key="replace_what" value="Amount_"/>
    </operator>
    <connect from_op="Generate Transaction Data" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Pivot" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_port="result 2"/>
    <connect from_op="Multiply" from_port="output 3" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 3"/>
    <connect from_op="Pivot" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
    <connect from_op="Rename by Replacing" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>
Sign In or Register to comment.