Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Combining tokenized n-grammed name variations

ngaved91ngaved91 Member Posts: 3 Learner III
edited December 2018 in Help

Hello :)

 

I have to count Name occurences in a text file. The problem is that the format of the names vary. For example:

 

a) Bratt (Forename) Pitt (Lastname)

b) Pitt Bratt

c) B. Pitt

d) Pitt B.

e) Pitt

 

The inclusion of the forename is also important in order to distinguish actors with the same lastname.

 

My Process looks like this:

1) Main Process: Process Documents from files -> Data to documents -> Process documents

2) Process documents sub process:
    
Tokenize -> generate 2-grams -> replace tokens (to replace the "_" of the 2-grams with a space") -> Filter tokens using an example set (with an example set of actor names contain varations of name constellations as described above.)

 

The problem is that all variations are counted seperately. Is there a way to aggregate all results so that the final results shows "Bratt Pitt  Total occurences: xx."

 

@sgenzer 

Answers

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You can of course use "replace tokens" operator and enter the variations yourself, but this is very manual.

    The Aylien extension also does entity extraction for names, and that will handle common misspellings, etc.

    You might want to look into the extension from Namsor, they might have something for name grouping. 

    But I have a feeling you are not going to be able to find a solution that is 100% reliable.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.