Options

Tokenize operator issue - help request

amitdamitd Member, University Professor Posts: 49 Maven
edited February 2022 in Help
I have to process some documents where the double exclamation !! when followed by a word should be an individual token by itself (e.g., sentence!! as a token, not 'sentence' and '!!' separate). Similarly, the smiley character : ) is expected to be a separate token. When I use the non-letters mode in Tokenize, the words get extracted okay but not the way I would like. When the mode = regular expression is used with the expression as [a-zA-Z!:)]+ it does not work at all. I tested the regular expression in the expression builder and it works okay when each document text is tested in its preview. However, the output of the process ends up being blank. I have no clue why this is happening. I have attached the two processes. Can someone please help?

The expected output would be (counts not shown).
: ) (I have added a space between colon and ) otherwise the editor converts it to a smiley emoji like this :)
a
all
another
here
is
last
new
of
sentence
sentence!! 
sentences
this
yet


Best Answer

  • Options
    amitdamitd Member, University Professor Posts: 49 Maven
    Solution Accepted
    I figured out the issue. Here, we have to use a regular expression that are tokens used for separating, not what we expect to keep. So the regular expression should be [ .,]+ and then it works fine. 
Sign In or Register to comment.