Tokenize operator issue - help request

amitd · February 2022

I have to process some documents where the double exclamation !! when followed by a word should be an individual token by itself (e.g., sentence!! as a token, not 'sentence' and '!!' separate). Similarly, the smiley character : ) is expected to be a separate token. When I use the non-letters mode in Tokenize, the words get extracted okay but not the way I would like. When the mode = regular expression is used with the expression as [a-zA-Z!:)]+ it does not work at all. I tested the regular expression in the expression builder and it works okay when each document text is tested in its preview. However, the output of the process ends up being blank. I have no clue why this is happening. I have attached the two processes. Can someone please help?

The expected output would be (counts not shown).
: ) (I have added a space between colon and ) otherwise the editor converts it to a smiley emoji like this

a
all
another
here
is
last
new
of
sentence
sentence!!
sentences
this
yet

amitd · February 2022

I figured out the issue. Here, we have to use a regular expression that are tokens used for separating, not what we expect to keep. So the regular expression should be [ .,]+ and then it works fine.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Tokenize operator issue - help request

Best Answer