🎉 🎉. RAPIDMINER 9.8 IS OUT!!! 🎉 🎉

RapidMiner 9.8 continues to innovate in data science collaboration, connectivity and governance

CLICK HERE TO DOWNLOAD

I have two excel i want to find if row number 1 from one excel is present in my second exel or not.

pallavpallav Member Posts: 38 Contributor II
edited February 27 in Help
I have two excel i want to find if row number 1 from first excel is present in my second excel or not. If not completely then by what percent it is matching . some thing like fuzzy  matching in python .

Its is text "outlet name can be name of any outlet , address can be any address city state and zip .  how we can see if that row is present in other excel or not . if not completely then by what percent it is matching





Tagged:
Jasmine_

Best Answer

Answers

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,957  Community Manager
    hi @pallav if you want a clear "is the row in the other sheet or not", I would simply do an inner join on the two tables on all the attributes and see if there are any examples coming out of the join. If you want a more fuzzy matching solution, I would concatenate all the text attributes into one attribute (using a space delimiter probably) and train a model to predict match/no match after text mining word vectorization.
    mschmitzJasmine_
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,752  RM Data Scientist
    Hi,
    the other way would be to append the 2 data sets and use Remove Duplicates afterwards. Rows which are in both data sets are returned at the dup port.
    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    sgenzerJasmine_[Deleted User]
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 283  RM Data Scientist
    edited February 27
    Hi @pallav,

    How about the similarity analytics? You can try the "cross distances" on separate data sets or "data to similarity" if you append two tables together. For strings like address/city/state/outlet name or zipcode, you may want to normalize all upper case letters to lower case and then apply "nominal to numerical" before calculating the cross-distance between the examples. Of course, we can do the nominal measurements. But you have more various formulas with numerical measurement to quantitatively define the difference.

    operator doc
    https://docs.rapidminer.com/latest/studio/operators/modeling/similarities/cross_distances.html
    previous discussions/knowledge base 
    https://community.rapidminer.com/discussion/53879/cross-distance-how-is-it-calculated
    https://community.rapidminer.com/discussion/52140/two-documents-similarity-using-cross-distance

    HTH!
    YY
    Jasmine_
  • pallavpallav Member Posts: 38 Contributor II
    @yyhuang @mschmitz @sgenzer - The problem is the text might not match 100% in that case if i will do cross join remove  duplicate and other technique it may not match .. In that case we have to do somethiing like how we do in python . Either tokenizing  the word and later taking tfid vector of each word find the cosine similarity with other excel after doing same preprocessing , This we can do in python but how can we do same in rapidminer .
    Jasmine_
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,752  RM Data Scientist
    Hi @pallav,

    i think we got three options here:
    1. Use Process Documents to get TF/IDF, Cross Distances to get the cosine similarity, filter and join the original data.
    2. Cross-Join the data, use Generate Levenshtein Distance
    3. Use the DataBase Envy extension to do a non equal join. (I am not sure if this supports complex joins on fuzzy stuff). @BalazsBarany can you give some feedback on this?

    Since you are also a customer, i would propose we do a quick call to walk you through? What times work better, European or East Coast work hours?

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    yyhuangJasmine_
  • pallavpallav Member Posts: 38 Contributor II
    @mschmitz -I am good with European timing. We can arrange call now.
    Jasmine_
Sign In or Register to comment.