Options

"tokenize and keep words with dash"

johannesweberjohannesweber Member Posts: 1 Contributor I
edited June 2019 in Help
Hello,

is there any way to tokenize into single words and don't split words with a dash?

For example, I want to keep the word "state-of-the-art" instead of having four words afterwards.

I saw the option to change the operator's mode to "specific  characters", however I don't understand the syntax requiered.

I would much appreciate an answer.

Best regards

Johannes

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Specific characters is fine, just list the characters that indicate word borders, e.g. dot, comma, space, questionmark etc.: "!? ,.". Think carefully and check the results to not forget any important delimiters :)

    Best regards,
    Marius
  • Options
    HelenZHelenZ Member Posts: 3 Contributor I
    This is a really good suggestion and very helpful. I tried using the "." to tokenize my document. But now, I face the Problem that a sentence containing e.g. the word "u.s." is tokenized right in the middle because u.s. contains a dot. Or to take another example a sentence containing the number "1.3%" is split.

    So is there a way to also include exceptions in the mode "specific characters" and what regex term do I use then? Or do I have to add another operator or something?


    Thank you for your great help. This is very much appreciated.


    Helen
Sign In or Register to comment.