🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉
Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.
How to compare each row with all other rows?
I have a very large Exampleset, more than 100,000 rows, it has two attributes, id, and long string. I want to find duplicates by comparing each row with all other rows using the string attribute.
The similarity does not have to be exact to detect the duplicate.
My idea is to use Cartesian Product to make a new Exampleset with attributes (id1,string1,id2,string2) then generate a new attribute for the similarity, but my problem is that the Cartesian operator does not support a large number of data. It displays an error saying the number of rows is limited.
Is there an alternative to this idea of using a Cartesian product? also, what is the best way to measure the similarity between two texts?