filter all duplicate examples

neginzneginz Member Posts: 17 Maven
edited December 2018 in Help

hi

I'm a newbie in rapidminer. I want to filter all the example that has duplicate value, i use below process but if  a name appears 5 times the result show 4times of it how can I filter all the 5 and still have other attr in my result...

 

Capture.PNG2.PNG

 

Best Answer

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    Simple, this is just the complement of what I already posted.  Simply change the Filter Examples condition to count>1 rather than =1 and you will get ONLY the duplicates.  I thought you did NOT want the duplicates.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @neginz,

     

    Can you share your dataset and your process ?

    Can you too explain with an example what you get now and what you want obtain  ?

     

    Regards,

     

    Lionel

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    If I understand you correctly, you want to eliminate any records that have duplicates.  Here's a simple technique I have used to do this in the past.  First, use Aggregate to group by name (or whatever constitutes the unique key that defines a duplicate, and note this can be more than one field) and count of name, which will give you a count of how many times each name appears.  Filter Examples for that set for any record that has a count greater than one, and then Join (using Inner Join) back to the original dataset.  Presto---you should then have only the records that appeared once!

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • neginzneginz Member Posts: 17 Maven

    hi @lionelderkrikor

    my data are customer's comment and I want to extract rows with authors comment more than one time. in the process, I create for example when we have 2 rows with the same author the result show only one of them .(when the absolute count in pic =2 )I think its because of the operation "remove duplicate" it removes only duplicates value, not all of the value that has duplicates actually one of them remains and not remove.  

    7.PNGscreenshot of data

     

    3.png

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="free_memory" compatibility="8.2.001" expanded="true" height="68" name="Free Memory" width="90" x="782" y="646"/>
    <operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve tablet-300-eng-f" width="90" x="112" y="340">
    <parameter key="repository_entry" value="../../data/Digikala-Data/tablet-300-eng-f"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="112" y="34">
    <parameter key="invert_filter" value="true"/>
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Author.contains.guest"/>
    </list>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.2.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="34"/>
    <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="313" y="289">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="Author"/>
    <parameter key="attributes" value="Comment id|Author|Content"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="Author"/>
    <parameter key="attributes" value="Comment id|Author|Content"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="remove_duplicates" compatibility="8.2.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="514" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Author"/>
    </operator>
    <operator activated="true" class="set_minus" compatibility="8.2.001" expanded="true" height="82" name="Set Minus" width="90" x="715" y="238"/>
    <connect from_op="Retrieve tablet-300-eng-f" from_port="output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Set Minus" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
    <connect from_op="Remove Duplicates" from_port="example set output" to_op="Set Minus" to_port="subtrahend"/>
    <connect from_op="Set Minus" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

  • neginzneginz Member Posts: 17 Maven

    hi @Telcontar120

    tnx for the help. I try that before without joining part and the result only has 2 attr that one of them is a count and the of the other.coud u please more explain about the joining part?  

    6.PNGresult without inner join operator

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    If you post a small data sample, it would be easier to help you.

    Basically you want to take the output you are showing, but filter it for those records that only have a count of 1.

    Then you will use that to join back to the original full dataset that has all the duplicates, but the inner join will only keep the records that have a count of one.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • neginzneginz Member Posts: 17 Maven

     @Telcontar120

    sorry, but how can I post excel data here. it has error for file extension even whenIi use.rar .

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Just post it as csv or txt

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • neginzneginz Member Posts: 17 Maven

    tnx sorry @Telcontar120

    its small sample of my data. I want my result have the "comment id" attr.

    sorry for my English 

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Here is a process that does what you describe in your original post.  It removes posts from authors that have more than one comment (e.g., it removes all items included in duplicate sets by author).  You should be able to adapt this to your needs very easily.  The first operator will need to have the path to your data file modified of course.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="9.0.000-BETA" expanded="true" height="68" name="Read CSV" width="90" x="45" y="85">
    <parameter key="csv_file" value="C:\Users\brian\Downloads\forum.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="skip_comments" value="true"/>
    <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
    <list key="annotations"/>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Comment id.true.polynominal.attribute"/>
    <parameter key="1" value="Author.true.polynominal.attribute"/>
    <parameter key="2" value="Title.true.polynominal.attribute"/>
    </list>
    <parameter key="read_not_matching_values_as_missings" value="false"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="9.0.000-BETA" expanded="true" height="82" name="Aggregate" width="90" x="179" y="85">
    <list key="aggregation_attributes">
    <parameter key="Comment id" value="count"/>
    </list>
    <parameter key="group_by_attributes" value="Author"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="9.0.000-BETA" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="count(Comment id).eq.1"/>
    </list>
    </operator>
    <operator activated="true" class="concurrency:join" compatibility="9.0.000-BETA" expanded="true" height="82" name="Join" width="90" x="514" y="85">
    <parameter key="use_id_attribute_as_key" value="false"/>
    <list key="key_attributes">
    <parameter key="Author" value="Author"/>
    </list>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="original" to_op="Join" to_port="right"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Join" to_port="left"/>
    <connect from_op="Join" from_port="join" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • neginzneginz Member Posts: 17 Maven

    @Telcontar120 tnx for ur help but there is not what I wanted.  I need the result like in the picture below. I hope you'll get it now.

     

    New Microsoft PowerPoint Presentation.jpg

  • neginzneginz Member Posts: 17 Maven

    @Telcontar120

    yes, it works tnx a lot for ur help :smileyvery-happy: . my mistake was that I count the author instead of count comment id . . . 

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @Telcontar120,

     

    I will be severe : 

    I'm waiting from an Ambassador and beta tester of RM 9, that you realize this task with the new "turbo prep" tool: it is feasible !

    Dataset : 

    Remove_no_duplicates.png

    Result : 

    Remove_no_duplicates_2.png

     

    .....I'm joking of course !!!.....:catwink::catlol:

     

    Have a nice day and happy experimentations,

     

    Regards,

     

    Lionel

     

Sign In or Register to comment.