Language filter to retain English only

JamieLimJamieLim Member Posts: 3 Newbie
edited July 2020 in Help
I have documents that include English and a mixture of other languages. Can I filter to retain only the english text without going through all documents to identify all the other languages that I want to exclude?
Tagged:

Best Answers

  • JamieLimJamieLim Member Posts: 3 Newbie
    Solution Accepted
    I ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    ah interesting question. The short answer is "not easily". :smile: In my mind you have two options:

    - Manually classify a set of documents and train a ML model to discriminate between them, then apply the model on all new documents.
    - Use an external API such as Google Translate or AWS Translate to do this for you

    Scott
  • JamieLimJamieLim Member Posts: 3 Newbie
    edited July 2020
    sgenzer What about if we just retain alphanumeric and space in the text? Is there an easier way to achieve this ?
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    @JamieLim to quote Euclid of Alexandria:

    There is no royal road to geometry.

    or in other words, sometimes there is no quick-and-dirty answer. :smile:

    Scott
Sign In or Register to comment.