RapidMiner 9.7 is Now Available
Lots of amazing new improvements including true version control! Learn more about what's new here.
Daitch-Mokotoff Soundex for Word / Name Matching
I find it quite common when doing text processing to match names or words that may not be spelled exactly the same. For example, "SCHWARZ" can be spelled "SCHWARTZ", and "JENNIE" can be spelled "JENNY" or "JENI" and so forth. A common technique to match words that sound the same is to use a "soundex" system. It converts words to a code by the way the word sounds. In the second example, "JENNIE", "JENNY", and "JENI" would all have the exact same Soundex code.
There are many Soundex systems that have been used over the years; my preference is the Daitch-Mokotoff Soundex (created by Randy Daitch and Gary Mokotoff in 1985 as a revised version of the Russell / NARA Soundex system developed in 1918). It is the same soundex system used to search for names in the famous Ellis Island Immigrant Database.
I have written the RapidMiner code needed to take a nominal attribute (it must only have one word in it) and output its D-M Soundex code. You can find it attached to this KB article as a .buildingblock file (if you use it as-is, you will need to name your attribute "att1"). I hope others find it as useful as I do. There are probably ways to make this code more efficient; I welcome contributions any time.