Remove Duplicates

MuehliManMuehliMan Member Posts: 85 Maven
edited November 2018 in Help
Hi,

I have a problem using the Remove Duplicates Operator. Here is my workflow:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="646" width="1095">
      <operator activated="true" class="read_excel" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="E:\RM\Test_Duplicates.xls"/>
        <list key="annotations"/>
      </operator>
      <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
        <parameter key="name" value="CID"/>
        <parameter key="target_role" value="id"/>
      </operator>
      <operator activated="true" class="remove_duplicates" expanded="true" height="76" name="Remove Duplicates" width="90" x="313" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="CID"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

And here is the table I am using to test it.

CID Value
3596 X
4054 X
4054 X
3000 S
32135 S

When I use it with invert selction to get the duplicates only it gives me 2 examples which are not duplicate. Could someone tell me where I go wrong?

Cheers,
Markus

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    if you select the single attribute CID and then invert the selection, the only remaining attribute is Value. If you compare the attributes depending on this single attribute, there are only two distinct examples: One containing X, one S.
    So, where's the problem?

    Greetings,
      Sebastian
  • MuehliManMuehliMan Member Posts: 85 Maven
    Hi Sebastian,

    sorry, I think I wrote something unclear: What I would like is to have something giving back the duplicate CIDs. So the desired output would be:

    4054  X
    4054  X

    The Value column is not of interest, it is the CID column, where I cannot have duplicates, because I do a join afterwards.

    Cheers,
    Markus



  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Markus,
    then you mis understood the "invert" parameter: It is associated with the attribute selection, not with the example selection.

    This way it is a little bit more complex, but I designed a process for this task and uploaded it to myExperiment. You can access the process using the Community Extension and open the following process (currently last in list) "Finding all Examples that have duplicate values in certain attributes".

    Greetings,
      Sebastian
Sign In or Register to comment.