[SOLVED] Filtering Duplicate Table Data

MckenzieMckenzie Member Posts: 2 Contributor I
edited September 2019 in Help
Hi all, I have a simialirty to data module setup and I'm getting the following column outputs:

FIRST_ID, SECOND_ID, SIMILARITY

The way the pages are being compared means that the first id and second id are being displayed twice, for example

3, 2, 1.0
2, 3, 1.0

They are both the same but just in a different order. 3,2 and 2,3

I've been having a look at the remove duplicate module under Filtering, however I can't seem to find the correct rule or expression to only return unique values of the first and second id once.

Many thanks,

Mckenzie

Tagged:

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,326  RM Data Scientist
    Hi,

    i do not have a one operator solution for you, but the process below solves the problem. I do not know if there is an easier way to do it.



    Cheers,
    Martin

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.3.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="6.3.000" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="120">
            <list key="attribute_values">
              <parameter key="id1" value="1"/>
              <parameter key="id2" value="2"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="6.3.000" expanded="true" height="76" name="Generate ID" width="90" x="179" y="75"/>
          <operator activated="true" class="multiply" compatibility="6.3.000" expanded="true" height="94" name="Multiply (2)" width="90" x="179" y="210"/>
          <operator activated="true" class="append" compatibility="6.3.000" expanded="true" height="94" name="Append (2)" width="90" x="313" y="210"/>
          <operator activated="true" class="set_data" compatibility="6.3.000" expanded="true" height="76" name="Set Data" width="90" x="447" y="120">
            <parameter key="example_index" value="1"/>
            <parameter key="attribute_name" value="id1"/>
            <parameter key="value" value="2"/>
            <list key="additional_values">
              <parameter key="id2" value="1"/>
            </list>
          </operator>
          <operator activated="true" class="multiply" compatibility="6.3.000" expanded="true" height="94" name="Multiply" width="90" x="581" y="120"/>
          <operator activated="true" class="rename" compatibility="6.3.000" expanded="true" height="76" name="Rename" width="90" x="715" y="30">
            <parameter key="old_name" value="id1"/>
            <parameter key="new_name" value="test_id1"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="rename" compatibility="6.3.000" expanded="true" height="76" name="Rename (2)" width="90" x="849" y="30">
            <parameter key="old_name" value="id2"/>
            <parameter key="new_name" value="test_id2"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="rename" compatibility="6.3.000" expanded="true" height="76" name="Rename (3)" width="90" x="715" y="165">
            <parameter key="old_name" value="id2"/>
            <parameter key="new_name" value="test_id1"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="rename" compatibility="6.3.000" expanded="true" height="76" name="Rename (4)" width="90" x="849" y="165">
            <parameter key="old_name" value="id1"/>
            <parameter key="new_name" value="test_id2"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="append" compatibility="6.3.000" expanded="true" height="94" name="Append" width="90" x="983" y="120"/>
          <operator activated="true" class="remove_duplicates" compatibility="6.3.000" expanded="true" height="76" name="Remove Duplicates" width="90" x="1117" y="120">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="test_id1|test_id2"/>
          </operator>
          <operator activated="true" class="remove_duplicates" compatibility="6.3.000" expanded="true" height="76" name="Remove Duplicates (2)" width="90" x="1251" y="120">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="attributes" value="test_id1|test_id2"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
          <connect from_op="Multiply (2)" from_port="output 1" to_op="Append (2)" to_port="example set 2"/>
          <connect from_op="Multiply (2)" from_port="output 2" to_op="Append (2)" to_port="example set 1"/>
          <connect from_op="Append (2)" from_port="merged set" to_op="Set Data" to_port="example set input"/>
          <connect from_op="Set Data" from_port="example set output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Rename" to_port="example set input"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Rename (3)" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_op="Rename (2)" to_port="example set input"/>
          <connect from_op="Rename (2)" from_port="example set output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Rename (3)" from_port="example set output" to_op="Rename (4)" to_port="example set input"/>
          <connect from_op="Rename (4)" from_port="example set output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Append" from_port="merged set" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_op="Remove Duplicates (2)" to_port="example set input"/>
          <connect from_op="Remove Duplicates (2)" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • MckenzieMckenzie Member Posts: 2 Contributor I
    Hi Martin,

    Thanks for the reply. In the end I created an aggregate attribute similar to what you did and compared, ordered and concatenated the first and second by ID (using RegEx) to give a new unique ID then removed duplicates.

    Many thanks for your help.

    Mckenzie
Sign In or Register to comment.