Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Fuzzy Match of Strings

prachi138prachi138 Member Posts: 6 Contributor II
edited December 2018 in Help

I'm trying to work through a problem in Rapidminer. I'm trying to find approximate matches of strings of one dataset in another dataset.  Is there a way we can perform a fuzzy match on a string in rapidminer? Any help will be appreciated! 

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist

    Hi,

     

    there is an operator to calculate the Levenshtein Distance in operator toolbox.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @prachi138 - yes I'd recommend trying the Levenshtein Distance operator as a good starting point. If you can post some examples of what you're doing, that can help us give you more guidance.

     

    FYI I'm moving this to the Studio help forum.

     

    SG

     

  • prachi138prachi138 Member Posts: 6 Contributor II

    Hi,

     

    Thank you for your prompt reply! I'm trying to use the Levenstein distance in Rapidminer. However, I see that the second port requires a document input. On using Data to document, I have the IOObject Collection generated which is not accepted as input. Can you let me know which exact operator I should be using for 'Process documents' operator? Thank you once again!

  • kypexinkypexin RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @prachi138

     

    Which operator you are talking about?

    'Generate Levenstein Distance' operator has only one input port, which is an example set. Then you need to 1st and 2nd string attributes to compare, and the operator will calculate distance for all examples, making it a separate attribute in the output dataset:

     

    Screenshot 2018-04-14 08.46.47.pngScreenshot 2018-04-14 08.46.34.png

     

  • prachi138prachi138 Member Posts: 6 Contributor II

    Is it possible to perform a many to many comparison across datasets for the fuzzy matching operator? Basically have 5 columns of 5 different data sources and see where it matches? Currently, I’m comparing 2 attributes of 2 different data sources at once. Looking forward to your inputs.

  • Robin1992Robin1992 Member Posts: 5 Contributor I
    Hi Prachi, I have a similar case were I try to compare different documents with each other to extract entities. May I see your final result (screenshot)? 
Sign In or Register to comment.