Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Remove Duplicates
Hi,
I have a problem using the Remove Duplicates Operator. Here is my workflow:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="646" width="1095">
<operator activated="true" class="read_excel" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="E:\RM\Test_Duplicates.xls"/>
<list key="annotations"/>
</operator>
<operator activated="true" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
<parameter key="name" value="CID"/>
<parameter key="target_role" value="id"/>
</operator>
<operator activated="true" class="remove_duplicates" expanded="true" height="76" name="Remove Duplicates" width="90" x="313" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="CID"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
And here is the table I am using to test it.
CID Value
3596 X
4054 X
4054 X
3000 S
32135 S
When I use it with invert selction to get the duplicates only it gives me 2 examples which are not duplicate. Could someone tell me where I go wrong?
Cheers,
Markus
I have a problem using the Remove Duplicates Operator. Here is my workflow:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="646" width="1095">
<operator activated="true" class="read_excel" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="E:\RM\Test_Duplicates.xls"/>
<list key="annotations"/>
</operator>
<operator activated="true" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
<parameter key="name" value="CID"/>
<parameter key="target_role" value="id"/>
</operator>
<operator activated="true" class="remove_duplicates" expanded="true" height="76" name="Remove Duplicates" width="90" x="313" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="CID"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
And here is the table I am using to test it.
CID Value
3596 X
4054 X
4054 X
3000 S
32135 S
When I use it with invert selction to get the duplicates only it gives me 2 examples which are not duplicate. Could someone tell me where I go wrong?
Cheers,
Markus
0
Answers
if you select the single attribute CID and then invert the selection, the only remaining attribute is Value. If you compare the attributes depending on this single attribute, there are only two distinct examples: One containing X, one S.
So, where's the problem?
Greetings,
Sebastian
sorry, I think I wrote something unclear: What I would like is to have something giving back the duplicate CIDs. So the desired output would be:
4054 X
4054 X
The Value column is not of interest, it is the CID column, where I cannot have duplicates, because I do a join afterwards.
Cheers,
Markus
then you mis understood the "invert" parameter: It is associated with the attribute selection, not with the example selection.
This way it is a little bit more complex, but I designed a process for this task and uploaded it to myExperiment. You can access the process using the Community Extension and open the following process (currently last in list) "Finding all Examples that have duplicate values in certain attributes".
Greetings,
Sebastian