Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Language filter to retain English only
Best Answers
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornIn theory you could tokenize based on spaces, which would give you a set of "words" that would be potentially in multiple languages. You could then use the filter token with dictionary operator to retain only those tokens which were in a given language dictionary (that you would need to supply as a txt file). This would be a kind of crude language filter using only native RapidMiner operators, but I think the accuracy would not be as high as you would like due to ambiguous words and also your treatment of potentially mixed language texts.6
-
JamieLim Member Posts: 3 Learner II ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.1
Answers
- Manually classify a set of documents and train a ML model to discriminate between them, then apply the model on all new documents.
- Use an external API such as Google Translate or AWS Translate to do this for you
Scott
or in other words, sometimes there is no quick-and-dirty answer.
Scott