Combining tokenized n-grammed name variations
I have to count Name occurences in a text file. The problem is that the format of the names vary. For example:
a) Bratt (Forename) Pitt (Lastname)
b) Pitt Bratt
c) B. Pitt
d) Pitt B.
The inclusion of the forename is also important in order to distinguish actors with the same lastname.
My Process looks like this:
1) Main Process: Process Documents from files -> Data to documents -> Process documents
2) Process documents sub process:
Tokenize -> generate 2-grams -> replace tokens (to replace the "_" of the 2-grams with a space") -> Filter tokens using an example set (with an example set of actor names contain varations of name constellations as described above.)
The problem is that all variations are counted seperately. Is there a way to aggregate all results so that the final results shows "Bratt Pitt Total occurences: xx."