RapidMiner

Remove duplicates

Wisdom logo Registration now open for RapidMiner Wisdom Americas | New Orleans | October 10-12, 2018   Learn More
Learner III ga34hox
Learner III

Remove duplicates

Hello everbody,

 

i hope somebody can help. I've got a data set of about 50000 rows and 2 columns (att1, att2). I wanna remove duplicates in which (and only if) the value of att1(row1) is equal to value of att2(row2) and value off att2(row1) is equal to att1(row2).

 

Example:

            att1   att2

row1  100    200

row2   200   100

 

Sow row2 will be eliminated. Anybody an idea/solution?

Thanks so much!

3 REPLIES
Highlighted
Moderator Moderator
Moderator

Re: Remove duplicates

Hi @ga34hox,

 

I think that the Lag Series operator from the Value Series extension can help you.

Does this process answer to your need ? (I just tested it on a partial dataset) : 

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="8.2.000" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Remove_Duplicates\Remove_Duplicates.xlsx"/>
        <parameter key="imported_cell_range" value="A1:B7"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Att1.true.integer.attribute"/>
          <parameter key="1" value="Att2.true.integer.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="series:lag_series" compatibility="7.4.000" expanded="true" height="82" name="Lag Series" width="90" x="246" y="34">
        <list key="attributes">
          <parameter key="Att1" value="1"/>
          <parameter key="Att2" value="1"/>
        </list>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="380" y="34">
        <list key="function_descriptions">
          <parameter key="Duplicates" value="if((Att1==[Att2-1])&amp;&amp;(Att2==[Att1-1]),1,0)"/>
        </list>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="34">
        <parameter key="parameter_string" value="Duplicates &lt;&gt; 1"/>
        <parameter key="condition_class" value="attribute_value_filter"/>
        <list key="filters_list"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="648" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Att1|Att2"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Lag Series" to_port="example set input"/>
      <connect from_op="Lag Series" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regards,

 

 

Lionel

 

Learner III ga34hox
Learner III

Re: Remove duplicates

 

Hi @

 

Moderator Moderator
Moderator

Re: Remove duplicates

Hi @ga34hox,

 

Have you try to import and execute the process I shared ?

This operator allow to shift the data of a column.

To better understand set Breakpoints on the different operators to see the different transformation/calculations on the data.

 

Regards,

 

 

Lionel