RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

Language filter to retain English only

JamieLimJamieLim Member Posts: 3 Newbie
edited July 28 in Help
I have documents that include English and a mixture of other languages. Can I filter to retain only the english text without going through all documents to identify all the other languages that I want to exclude?
Tagged:

Best Answers

  • JamieLimJamieLim Member Posts: 3 Newbie
    Solution Accepted
    I ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.
    sgenzer

Answers

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,952  Community Manager
    ah interesting question. The short answer is "not easily". :smile: In my mind you have two options:

    - Manually classify a set of documents and train a ML model to discriminate between them, then apply the model on all new documents.
    - Use an external API such as Google Translate or AWS Translate to do this for you

    Scott
  • JamieLimJamieLim Member Posts: 3 Newbie
    edited July 28
    sgenzer What about if we just retain alphanumeric and space in the text? Is there an easier way to achieve this ?
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,952  Community Manager
    @JamieLim to quote Euclid of Alexandria:

    There is no royal road to geometry.

    or in other words, sometimes there is no quick-and-dirty answer. :smile:

    Scott
    lionelderkrikor
Sign In or Register to comment.