[SOLVED]The approach for filtering non-letter tokens

huaiyanggongzihuaiyanggongzi Member Posts: 39 Contributor II
edited November 2018 in Help
In Rapidminer, I use tokenize operator to process a lot of documents. Currently, I have some documents that have a lot of no-letter characters, such as digits, %, $ or any other non-letter symbols. Are there any operators that can allow me to filter these tokens? Thanks.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869   Unicorn
    Hi,

    first of all, you have to configure the Tokenize operator to use a splitting pattern appropriate to your problem. By default, it splits at "non-letters", you could change it to e.g. split by all space characters.

    Then, to filter, you can use the Filter Tokens operator with a customized pattern.

    If you have probems with the regular expressions, please post again.

    Happy Mining!
    ~Marius
  • huaiyanggongzihuaiyanggongzi Member Posts: 39 Contributor II
    Marius, Thanks.
Sign In or Register to comment.