ga34hoxga34hox Member Posts: 6 Contributor I
Hello everbody,


i hope somebody can help. I've got a data set of about 50000 rows and 2 columns (att1, att2). I wanna remove duplicates in which (and only if) the value of att1(row1) is equal to value of att2(row2) and value off att2(row1) is equal to att1(row2).



            att1   att2

row1  100    200

row2   200   100


Sow row2 will be eliminated. Anybody an idea/solution?

Thanks so much!



  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @ga34hox,


    I think that the Lag Series operator from the Value Series extension can help you.

    Does this process answer to your need ? (I just tested it on a partial dataset) : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="8.2.000" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Remove_Duplicates\Remove_Duplicates.xlsx"/>
    <parameter key="imported_cell_range" value="A1:B7"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Att1.true.integer.attribute"/>
    <parameter key="1" value="Att2.true.integer.attribute"/>
    <operator activated="true" class="series:lag_series" compatibility="7.4.000" expanded="true" height="82" name="Lag Series" width="90" x="246" y="34">
    <list key="attributes">
    <parameter key="Att1" value="1"/>
    <parameter key="Att2" value="1"/>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="380" y="34">
    <list key="function_descriptions">
    <parameter key="Duplicates" value="if((Att1==[Att2-1])&amp;&amp;(Att2==[Att1-1]),1,0)"/>
    <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="34">
    <parameter key="parameter_string" value="Duplicates &lt;&gt; 1"/>
    <parameter key="condition_class" value="attribute_value_filter"/>
    <list key="filters_list"/>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="648" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Att1|Att2"/>
    <connect from_op="Read Excel" from_port="output" to_op="Lag Series" to_port="example set input"/>
    <connect from_op="Lag Series" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>






  • ga34hoxga34hox Member Posts: 6 Contributor I



    thanks for your answer. But i have not yet recognized how this operator can help me. Can you explain in more detail?

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @ga34hox,


    Have you try to import and execute the process I shared ?

    This operator allow to shift the data of a column.

    To better understand set Breakpoints on the different operators to see the different transformation/calculations on the data.






