Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Language filter to retain English only

JamieLimJamieLim Member Posts: 3 Learner I
edited July 2020 in Help
I have documents that include English and a mixture of other languages. Can I filter to retain only the english text without going through all documents to identify all the other languages that I want to exclude?
Tagged:

Best Answers

  • JamieLimJamieLim Member Posts: 3 Learner I
    Solution Accepted
    I ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    ah interesting question. The short answer is "not easily". :smile: In my mind you have two options:

    - Manually classify a set of documents and train a ML model to discriminate between them, then apply the model on all new documents.
    - Use an external API such as Google Translate or AWS Translate to do this for you

    Scott
  • JamieLimJamieLim Member Posts: 3 Learner I
    edited July 2020
    sgenzer What about if we just retain alphanumeric and space in the text? Is there an easier way to achieve this ?
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    @JamieLim to quote Euclid of Alexandria:

    There is no royal road to geometry.

    or in other words, sometimes there is no quick-and-dirty answer. :smile:

    Scott
Sign In or Register to comment.