Language filter to retain English only

JamieLim
JamieLim New Altair Community Member
edited November 2024 in Community Q&A
I have documents that include English and a mixture of other languages. Can I filter to retain only the english text without going through all documents to identify all the other languages that I want to exclude?

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    In theory you could tokenize based on spaces, which would give you a set of "words" that would be potentially in multiple languages.  You could then use the filter token with dictionary operator to retain only those tokens which were in a given language dictionary (that you would need to supply as a txt file).  This would be a kind of crude language filter using only native RapidMiner operators, but I think the accuracy would not be as high as you would like due to ambiguous words and also your treatment of potentially mixed language texts.
  • JamieLim
    JamieLim New Altair Community Member
    Answer ✓
    I ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.

Answers

  • sgenzer
    sgenzer
    Altair Employee
    ah interesting question. The short answer is "not easily". :smile: In my mind you have two options:

    - Manually classify a set of documents and train a ML model to discriminate between them, then apply the model on all new documents.
    - Use an external API such as Google Translate or AWS Translate to do this for you

    Scott
  • JamieLim
    JamieLim New Altair Community Member
    edited July 2020
    sgenzer What about if we just retain alphanumeric and space in the text? Is there an easier way to achieve this ?
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    In theory you could tokenize based on spaces, which would give you a set of "words" that would be potentially in multiple languages.  You could then use the filter token with dictionary operator to retain only those tokens which were in a given language dictionary (that you would need to supply as a txt file).  This would be a kind of crude language filter using only native RapidMiner operators, but I think the accuracy would not be as high as you would like due to ambiguous words and also your treatment of potentially mixed language texts.
  • sgenzer
    sgenzer
    Altair Employee
    @JamieLim to quote Euclid of Alexandria:

    There is no royal road to geometry.

    or in other words, sometimes there is no quick-and-dirty answer. :smile:

    Scott
  • JamieLim
    JamieLim New Altair Community Member
    Answer ✓
    I ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.