Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Concatenate words in comments
Hello there!
We are currently writing a research project on microtransactions using natural language processing.
We have a Excel file containing 450.000 comments.
As to capture as many comments related to microtransactions, we would like to concatenate som variations of the spelling e.g.
Microtransactions = "micro transactions", "micro-transactions", "microtransact" etc...
We would very much like it to return all the 450.000 comments, though with the words concatenated as explained above.
How do we best achieve this?
Thanks a lot!
0
Answers
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Have you tried the Levenshtein Distance from operator toolbox extension? This could help you find the similar strings.
Suppose you have processed the 450000 comments with tokenize inside text mining operators, like "process documents", you will get a wordlist like this
Then you convert wordlist to data and generate pairs of keywords then apply the levenshtein distance on the pair-wised keywords.
I did a lagging on wordlist for a quick demo. But for n keywords, you will basically need n*(n-1)/2 pairs of keywords for distance calculation. Data to similarity operator will help you to expand data into pairwised format in a quick way.