filter all duplicate examples

neginz · July 2018

hi

I'm a newbie in rapidminer. I want to filter all the example that has duplicate value, i use below process but if a name appears 5 times the result show 4times of it how can I filter all the 5 and still have other attr in my result...

Telcontar120 · July 2018

Simple, this is just the complement of what I already posted. Simply change the Filter Examples condition to count>1 rather than =1 and you will get ONLY the duplicates. I thought you did NOT want the duplicates.

lionelderkrikor · July 2018

Hi @neginz,

Can you share your dataset and your process ?

Can you too explain with an example what you get now and what you want obtain ?

Regards,

Lionel

Telcontar120 · July 2018

If I understand you correctly, you want to eliminate any records that have duplicates. Here's a simple technique I have used to do this in the past. First, use Aggregate to group by name (or whatever constitutes the unique key that defines a duplicate, and note this can be more than one field) and count of name, which will give you a count of how many times each name appears. Filter Examples for that set for any record that has a count greater than one, and then Join (using Inner Join) back to the original dataset. Presto---you should then have only the records that appeared once!

neginz · July 2018

hi @lionelderkrikor

my data are customer's comment and I want to extract rows with authors comment more than one time. in the process, I create for example when we have 2 rows with the same author the result show only one of them .(when the absolute count in pic =2 )I think its because of the operation "remove duplicate" it removes only duplicates value, not all of the value that has duplicates actually one of them remains and not remove.

screenshot of data

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="free_memory" compatibility="8.2.001" expanded="true" height="68" name="Free Memory" width="90" x="782" y="646"/>
      <operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve tablet-300-eng-f" width="90" x="112" y="340">
        <parameter key="repository_entry" value="../../data/Digikala-Data/tablet-300-eng-f"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="112" y="34">
        <parameter key="invert_filter" value="true"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Author.contains.guest"/>
        </list>
      </operator>
      <operator activated="true" class="multiply" compatibility="8.2.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="34"/>
      <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="313" y="289">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="Author"/>
        <parameter key="attributes" value="Comment id|Author|Content"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="Author"/>
        <parameter key="attributes" value="Comment id|Author|Content"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="remove_duplicates" compatibility="8.2.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="514" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Author"/>
      </operator>
      <operator activated="true" class="set_minus" compatibility="8.2.001" expanded="true" height="82" name="Set Minus" width="90" x="715" y="238"/>
      <connect from_op="Retrieve tablet-300-eng-f" from_port="output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Set Minus" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
      <connect from_op="Remove Duplicates" from_port="example set output" to_op="Set Minus" to_port="subtrahend"/>
      <connect from_op="Set Minus" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

neginz · July 2018

hi @Telcontar120

tnx for the help. I try that before without joining part and the result only has 2 attr that one of them is a count and the of the other.coud u please more explain about the joining part?

result without inner join operator

Telcontar120 · July 2018

If you post a small data sample, it would be easier to help you.

Basically you want to take the output you are showing, but filter it for those records that only have a count of 1.

Then you will use that to join back to the original full dataset that has all the duplicates, but the inner join will only keep the records that have a count of one.

neginz · July 2018

@Telcontar120

sorry, but how can I post excel data here. it has error for file extension even whenIi use.rar .

Telcontar120 · July 2018

Just post it as csv or txt

neginz · July 2018

tnx sorry @Telcontar120

its small sample of my data. I want my result have the "comment id" attr.

sorry for my English

Telcontar120 · July 2018

Here is a process that does what you describe in your original post. It removes posts from authors that have more than one comment (e.g., it removes all items included in duplicate sets by author). You should be able to adapt this to your needs very easily. The first operator will need to have the path to your data file modified of course.

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="9.0.000-BETA" expanded="true" height="68" name="Read CSV" width="90" x="45" y="85">
        <parameter key="csv_file" value="C:\Users\brian\Downloads\forum.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="skip_comments" value="true"/>
        <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
        <list key="annotations"/>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Comment id.true.polynominal.attribute"/>
          <parameter key="1" value="Author.true.polynominal.attribute"/>
          <parameter key="2" value="Title.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="aggregate" compatibility="9.0.000-BETA" expanded="true" height="82" name="Aggregate" width="90" x="179" y="85">
        <list key="aggregation_attributes">
          <parameter key="Comment id" value="count"/>
        </list>
        <parameter key="group_by_attributes" value="Author"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.0.000-BETA" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="count(Comment id).eq.1"/>
        </list>
      </operator>
      <operator activated="true" class="concurrency:join" compatibility="9.0.000-BETA" expanded="true" height="82" name="Join" width="90" x="514" y="85">
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="Author" value="Author"/>
        </list>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="original" to_op="Join" to_port="right"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="Join" from_port="join" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

neginz · July 2018

@Telcontar120 tnx for ur help but there is not what I wanted. I need the result like in the picture below. I hope you'll get it now.

New Microsoft PowerPoint Presentation.jpg

neginz · July 2018

@Telcontar120

yes, it works tnx a lot for ur help :smileyvery-happy: . my mistake was that I count the author instead of count comment id . . .

lionelderkrikor · July 2018

Hi @Telcontar120,

I will be severe :

I'm waiting from an Ambassador and beta tester of RM 9, that you realize this task with the new "turbo prep" tool: it is feasible !

Dataset :

Result :

.....I'm joking of course !!!.....:catwink::catlol:

Have a nice day and happy experimentations,

Regards,

Lionel

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

filter all duplicate examples

Best Answer

Answers