Remove duplicates

ga34hoxga34hox Member Posts: 6 Contributor I
edited December 2018 in Help

Hello everbody,

 

i hope somebody can help. I've got a data set of about 50000 rows and 2 columns (att1, att2). I wanna remove duplicates in which (and only if) the value of att1(row1) is equal to value of att2(row2) and value off att2(row1) is equal to att1(row2).

 

Example:

            att1   att2

row1  100    200

row2   200   100

 

Sow row2 will be eliminated. Anybody an idea/solution?

Thanks so much!

Tagged:

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @ga34hox,

     

    I think that the Lag Series operator from the Value Series extension can help you.

    Does this process answer to your need ? (I just tested it on a partial dataset) : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="8.2.000" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Remove_Duplicates\Remove_Duplicates.xlsx"/>
    <parameter key="imported_cell_range" value="A1:B7"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Att1.true.integer.attribute"/>
    <parameter key="1" value="Att2.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="series:lag_series" compatibility="7.4.000" expanded="true" height="82" name="Lag Series" width="90" x="246" y="34">
    <list key="attributes">
    <parameter key="Att1" value="1"/>
    <parameter key="Att2" value="1"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="380" y="34">
    <list key="function_descriptions">
    <parameter key="Duplicates" value="if((Att1==[Att2-1])&amp;&amp;(Att2==[Att1-1]),1,0)"/>
    </list>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="34">
    <parameter key="parameter_string" value="Duplicates &lt;&gt; 1"/>
    <parameter key="condition_class" value="attribute_value_filter"/>
    <list key="filters_list"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="648" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Att1|Att2"/>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Lag Series" to_port="example set input"/>
    <connect from_op="Lag Series" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

     

    Lionel

     

  • ga34hoxga34hox Member Posts: 6 Contributor I
     

    lionelderkrikor

     

    thanks for your answer. But i have not yet recognized how this operator can help me. Can you explain in more detail?

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @ga34hox,

     

    Have you try to import and execute the process I shared ?

    This operator allow to shift the data of a column.

    To better understand set Breakpoints on the different operators to see the different transformation/calculations on the data.

     

    Regards,

     

     

    Lionel

Sign In or Register to comment.