🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS DEADLINE IS NOVEMBER 15   🦉 🎤

CLICK HERE TO GO TO ENTRY FORM

Daitch-Mokotoff Soundex for Word / Name Matching

sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,581  Community Manager
edited November 2018 in Knowledge Base

I find it quite common when doing text processing to match names or words that may not be spelled exactly the same.  For example, "SCHWARZ" can be spelled "SCHWARTZ", and "JENNIE" can be spelled "JENNY" or "JENI" and so forth.  A common technique to match words that sound the same is to use a "soundex" system.  It converts words to a code by the way the word sounds.  In the second example, "JENNIE", "JENNY", and "JENI" would all have the exact same Soundex code.

 

There are many Soundex systems that have been used over the years; my preference is the Daitch-Mokotoff Soundex (created by Randy Daitch and Gary Mokotoff in 1985 as a revised version of the Russell / NARA Soundex system developed in 1918).  It is the same soundex system used to search for names in the famous Ellis Island Immigrant Database.

 

I have written the RapidMiner code needed to take a nominal attribute (it must only have one word in it) and output its D-M Soundex code.  You can find it attached to this KB article as a .buildingblock file (if you use it as-is, you will need to name your attribute "att1").  I hope others find it as useful as I do.  There are probably ways to make this code more efficient; I welcome contributions any time.

 

Scott

 

----------------------
Don't forget to submit your great ideas for Wisdom 2020! Deadline is November 15.

Wisdom 2020 – Call for Speakers Form 
DocMusherThomas_Ottrobin

Comments

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,682  RM Founder

    Hi Scott, this is REALLY great stuff! I finally found some time to give it a test run. It took me some time to figure out that all letters need to be uppercase but then it worked like a charm. Many thanks for sharing! Happy holidays, Ingo

    RapidMiner Wisdom 2020
    February 11th and 12th 2020 in Boston, MA, USA

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,581  Community Manager

    Hello Ingo - Thank you for the nice note and sorry to not specify the upper case!  Glad it worked for you.  Happy holidays to everyone at RapidMiner.  Scott

    ----------------------
    Don't forget to submit your great ideas for Wisdom 2020! Deadline is November 15.

    Wisdom 2020 – Call for Speakers Form 
  • robinrobin Member Posts: 100  Guru
    @sgenzer what version of Rapid Miner was this created on? It looks very interesting, but the XML structure has definitely been changed since 2016. 

    Do you have this in something that would work for version 9 of RM?
  • robinrobin Member Posts: 100  Guru
    @sgenzer not to worry, I found generate phonetic encoding in the toolbox. 
    sgenzer
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,581  Community Manager
    yes exactly @robin...not long after I did that, our friend @mschmitz just built an operator for it. :smile: Glad to see you're using it!
    ----------------------
    Don't forget to submit your great ideas for Wisdom 2020! Deadline is November 15.

    Wisdom 2020 – Call for Speakers Form 
Sign In or Register to comment.