Identify Duplicate examples

aliasgarscoolaliasgarscool Member Posts: 2 Contributor I
edited November 2018 in Help

Hi,

 

I've a data in which I want to identify duplicates (unlike remove duplicate i want duplicate fields)

 

For example I've below data 

Month                Name                         Amount

Jul-15                John                           10$

Aug-15              Alex                            15$

Sep-15             John                             5$

Jul-15                John                           10$

 

 

if the above table is my input then i want only below in my results

Month                Name                         Amount

Jul-15                John                           10$

Jul-15                John                           10$

 

Best Answer

  • dr-connie-brettdr-connie-brett RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 4 Contributor I
    Solution Accepted

    If you don't actually need the duplicated examples, but rather need the count of how many times they appear this is how I would handle it:

    1 - aggregate the table  (Aggregate operator - group by all attributes and count on one of them)

    2 - filter examples for all count(attribute) > 1

    Screen Shot 2016-09-25 at 9.59.00 AM.png

    I'm assuming since there is no unique identifier you are ignoring you don't really need the duplicates the number of times they appear, but it might be useful to know how many times they appear!

     

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi...that was a good puzzle.  I would do it this way:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_id" compatibility="7.2.002" expanded="true" height="82" name="Generate ID" width="90" x="179" y="136"/>
    <operator activated="true" class="multiply" compatibility="7.2.002" expanded="true" height="103" name="Multiply" width="90" x="313" y="136"/>
    <operator activated="true" class="remove_duplicates" compatibility="7.2.002" expanded="true" height="82" name="Remove Duplicates" width="90" x="514" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Amount|Month|Name"/>
    </operator>
    <operator activated="true" class="set_minus" compatibility="7.2.002" expanded="true" height="82" name="Set Minus" width="90" x="715" y="136"/>
    <connect from_port="input 1" to_op="Generate ID" to_port="example set input"/>
    <connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Remove Duplicates" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Set Minus" to_port="example set input"/>
    <connect from_op="Remove Duplicates" from_port="example set output" to_op="Set Minus" to_port="subtrahend"/>
    <connect from_op="Set Minus" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

Sign In or Register to comment.