"Aggregate attributes despite spelling errors"

mattmatt Member Posts: 5 Contributor I
edited June 2019 in Help

New to data mining and rapidminer so any help is appreciated. I have a database with a column {Company Name}. I need to get a total Company count but the problem is there are spelling errors and inconsistencies in spelling in this attribute so a simple removal of duplicates doesn't work. I have around 15K results but I'm guessing there are really only about 800 actual companies in my database. Trying to avoid manually removing them in the CSV

 

Example:

ABC Company

ABC Co.

ABC Company Inc.

ABC Company Inc

ABC Company, Inc

ABC

 

 

I'd want the above to be grouped into 1 group since it's all the same company. I've only spent a few hours in Rapidminer but figured I'd ask if this is even possible before spending more time. Can I make a process that is smart enough to automatically aggregate or group attributes so I have an idea of total Companies? Doesn't have to be 100% accurate. 

Tagged:

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,327  RM Data Scientist

    Hi Matt,

     

    the key question is: Is there a pattern to exploit? Something like a list of words which should be removed (Inc, Corp,...) or so? If yes, you can do this with RapidMiner. Maybe you can even do a cross distance on the n_grams (chars) to find spelling misttakes.

     

    ~martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mattmatt Member Posts: 5 Contributor I

    Hi Martin,

    Yes, removing words like co and inc would help - and I can do that in excel - but then I'm still left with spelling variations.

     

    For example:

    Johnson and Son

    Johnson and Sons

    Johnson and Sons Construction

     

    Could examining the first two words in a row be a pattern to exploit? 

     

     

     

  • LaurenPlummerLaurenPlummer Member Posts: 4 Contributor I

    Hi Matt,

    My team has developed an extension to perform text analytics in RapidMiner - the Rosette Text Toolkit, which runs on our Rosette API. Our company has a lot of experience with name ambiguity (or name matching), and our extension has an operator called Match Names that produces a score that represents the likelihood that two names are the 'same'. It is designed to work cross-lingually (like when the name of a person in Russian is written in English, the way it's written in the English alphabet could vary a lot), but it's decent at English-English too.

     

    So, as an example, if you sign up for our free trial and try comparing 'Johnson and Son' and 'Johnson and Sons Construction' in our name-similarity endpoint here, the Rosette API would return a score of ~.79 -- sort of a 79% chance that those two names are the same. Comparing just 'Johnson' with 'Johnson and Sons Construction' would return ~.65 -- which you could say is not good enough confidence to be necessarily the same. In the same way, our RapidMiner Match Names operator will take two names in the same row of the two specified input attributes and return the similarity score.

     

    We're starting to build a RapidMiner operator that really "de-duplicates" lists of names (versus just shows you a score) -- and we'd love some feedback if you're willing. Pairwise matching is a tricky computational challenge -- which is why my company has larger on-premise solutions for "indexing" names (like building custom databases of unique entities for enterprise customers), which is currently outside of RapidMiner.

     

    Please let me know if you need help or are interested in giving feedback for a name de-duplicating RapidMiner operator.

    Thanks,

    Lauren

     

  • ahootanhaahootanha Member Posts: 69 Contributor I
    Hello. I want to repeat the words and add words such as Heeeellllloooo, which is the word Hello, and what is the correct word? How and with whom operators do I have in the RapidMiner? Please give me some guidance with many thanks
  • asn4293asn4293 Member Posts: 24 Contributor I

    I think you can do it via replace operator if that you were asking

Sign In or Register to comment.