Options

"[SOLVED] Tokenize - Generate n-grams and Filters"

MarcosRLMarcosRL Member Posts: 53 Contributor II
edited June 2019 in Help
hello friends comunidad.Una query
I need to perform the following procedure.
1) Read a text document
2) tokenize
3) Generating compound words (n grams)
4) Delete all compound words that are not equal to last list.
I could tokenize and generate compound words.
and filter operator "Text: Filter Tokens (by Content)" in the "string" added the compound word to filter and I filters.
The problem is I not how to add more than one word, to filter various compounds.

From already thank you very much
Regards
Tagged:

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hey,

    you marked this topic as solved - the community would be grateful if you posted your solution :)

    TIA

    ~Marius
  • Options
    MarcosRLMarcosRL Member Posts: 53 Contributor II
    It's a secret, I can not tell  :D  ;D

    The solution was to use the "Filter Tokens (by Content)" parameter in the "condition" = "matches" and create a "regular expression" with all the words you want to filter in the following format:
    word1 | word2 | wordN
    This is separate words with the wildcard "|" unused spaces
    It took me four hours to find this solution
    Regards
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Thank you!
Sign In or Register to comment.