RapidMiner

‎11-18-2016 05:00 AM

I find it quite common when doing text processing to match names or words that may not be spelled exactly the same.  For example, "SCHWARZ" can be spelled "SCHWARTZ", and "JENNIE" can be spelled "JENNY" or "JENI" and so forth.  A common technique to match words that sound the same is to use a "soundex" system.  It converts words to a code by the way the word sounds.  In the second example, "JENNIE", "JENNY", and "JENI" would all have the exact same Soundex code.

 

There are many Soundex systems that have been used over the years; my preference is the Daitch-Mokotoff Soundex (created by Randy Daitch and Gary Mokotoff in 1985 as a revised version of the Russell / NARA Soundex system developed in 1918).  It is the same soundex system used to search for names in the famous Ellis Island Immigrant Database.

 

I have written the RapidMiner code needed to take a nominal attribute (it must only have one word in it) and output its D-M Soundex code.  You can find it attached to this KB article as a .buildingblock file (if you use it as-is, you will need to name your attribute "att1").  I hope others find it as useful as I do.  There are probably ways to make this code more efficient; I welcome contributions any time.

 

Scott

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Comments
RM Staff
RM Staff

Hi Scott, this is REALLY great stuff! I finally found some time to give it a test run. It took me some time to figure out that all letters need to be uppercase but then it worked like a charm. Many thanks for sharing! Happy holidays, Ingo

Community Manager Community Manager
Community Manager

Hello Ingo - Thank you for the nice note and sorry to not specify the upper case!  Glad it worked for you.  Happy holidays to everyone at RapidMiner.  Scott