Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Tokenize operator issue - help request
I have to process some documents where the double exclamation !! when followed by a word should be an individual token by itself (e.g., sentence!! as a token, not 'sentence' and '!!' separate). Similarly, the smiley character : ) is expected to be a separate token. When I use the non-letters mode in Tokenize, the words get extracted okay but not the way I would like. When the mode = regular expression is used with the expression as [a-zA-Z!:)]+ it does not work at all. I tested the regular expression in the expression builder and it works okay when each document text is tested in its preview. However, the output of the process ends up being blank. I have no clue why this is happening. I have attached the two processes. Can someone please help?
The expected output would be (counts not shown).
: ) (I have added a space between colon and ) otherwise the editor converts it to a smiley emoji like this
a
all
another
here
is
last
new
of
sentence
sentence!!
sentences
this
yet
The expected output would be (counts not shown).
: ) (I have added a space between colon and ) otherwise the editor converts it to a smiley emoji like this
a
all
another
here
is
last
new
of
sentence
sentence!!
sentences
this
yet
0
Best Answer
-
amitd Member, University Professor Posts: 49 MavenI figured out the issue. Here, we have to use a regular expression that are tokens used for separating, not what we expect to keep. So the regular expression should be [ .,]+ and then it works fine.1