Language filter to retain English only

JamieLim
New Altair Community Member
Best Answers
-
In theory you could tokenize based on spaces, which would give you a set of "words" that would be potentially in multiple languages. You could then use the filter token with dictionary operator to retain only those tokens which were in a given language dictionary (that you would need to supply as a txt file). This would be a kind of crude language filter using only native RapidMiner operators, but I think the accuracy would not be as high as you would like due to ambiguous words and also your treatment of potentially mixed language texts.1
-
I ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.1
Answers
-
ah interesting question. The short answer is "not easily".
In my mind you have two options:
- Manually classify a set of documents and train a ML model to discriminate between them, then apply the model on all new documents.
- Use an external API such as Google Translate or AWS Translate to do this for you
Scott0 -
In theory you could tokenize based on spaces, which would give you a set of "words" that would be potentially in multiple languages. You could then use the filter token with dictionary operator to retain only those tokens which were in a given language dictionary (that you would need to supply as a txt file). This would be a kind of crude language filter using only native RapidMiner operators, but I think the accuracy would not be as high as you would like due to ambiguous words and also your treatment of potentially mixed language texts.1
-
I ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.1