Options

How to compare each row with all other rows?

HaMu299HaMu299 Member Posts: 1 Newbie
Hello everyone

I have a very large Exampleset, more than 100,000 rows, it has two attributes, id, and long string. I want to find duplicates by comparing each row with all other rows using the string attribute.
The similarity does not have to be exact to detect the duplicate.

My idea is to use Cartesian Product to make a new Exampleset with attributes (id1,string1,id2,string2) then generate a new attribute for the similarity, but my problem is that the Cartesian operator does not support a large number of data. It displays an error saying the number of rows is limited.

Is there an alternative to this idea of using a Cartesian product? also, what is the best way to measure the similarity between two texts?

Thank you 

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi,
    For exact duplicates: You can just inner join with a key on every column. Only duplicates remain.
    Otherwise: Likely either cross distance or something with fuzzy Matching. It depends a bit on how you define duplicates.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi @HaMu299,

    you could use Loop Batches on the second table and select a small batch size like 10. Then inside the loop you use Cartesian join with the current batch and the entire large table (you could use Remember/Recall to get it efficiently) and Generate Attributes to apply your match formula. Then Filter Examples for the matches.

    Loop Batches doesn't have an output, so you could use a database for storing the current results, or CSV files with a counter you're incrementing in the loop, or Recall+Append+Remember for storing the results inside the RapidMiner process.

    Regards,
    Balázs
Sign In or Register to comment.