Daitch-Mokotoff Soundex for Word / Name Matching

sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
edited November 2018 in Knowledge Base

I find it quite common when doing text processing to match names or words that may not be spelled exactly the same.  For example, "SCHWARZ" can be spelled "SCHWARTZ", and "JENNIE" can be spelled "JENNY" or "JENI" and so forth.  A common technique to match words that sound the same is to use a "soundex" system.  It converts words to a code by the way the word sounds.  In the second example, "JENNIE", "JENNY", and "JENI" would all have the exact same Soundex code.

 

There are many Soundex systems that have been used over the years; my preference is the Daitch-Mokotoff Soundex (created by Randy Daitch and Gary Mokotoff in 1985 as a revised version of the Russell / NARA Soundex system developed in 1918).  It is the same soundex system used to search for names in the famous Ellis Island Immigrant Database.

 

I have written the RapidMiner code needed to take a nominal attribute (it must only have one word in it) and output its D-M Soundex code.  You can find it attached to this KB article as a .buildingblock file (if you use it as-is, you will need to name your attribute "att1").  I hope others find it as useful as I do.  There are probably ways to make this code more efficient; I welcome contributions any time.

 

Scott

 

Comments

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    Hi Scott, this is REALLY great stuff! I finally found some time to give it a test run. It took me some time to figure out that all letters need to be uppercase but then it worked like a charm. Many thanks for sharing! Happy holidays, Ingo

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Hello Ingo - Thank you for the nice note and sorry to not specify the upper case!  Glad it worked for you.  Happy holidays to everyone at RapidMiner.  Scott

  • robinrobin Member Posts: 100 Guru
    @sgenzer what version of Rapid Miner was this created on? It looks very interesting, but the XML structure has definitely been changed since 2016. 

    Do you have this in something that would work for version 9 of RM?
  • robinrobin Member Posts: 100 Guru
    @sgenzer not to worry, I found generate phonetic encoding in the toolbox. 
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    yes exactly @robin...not long after I did that, our friend @mschmitz just built an operator for it. :smile: Glad to see you're using it!
Sign In or Register to comment.