Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Tokenization vs N-grams
HeikoeWin786
Member Posts: 64 Contributor I
in Help
Hello guys,
I am doing sentiment analysis in Rapidminer. While performing word vector, I find that there is two approach tokenization (by non-letter) and generate n-grams. I am not sure the main difference between this two operator and their best use-cases. Can someone explain me how this two works differently in rapidminer? For sentiment analysis, which approach would you suggest; tokenization or n-grams?
Thanks and regards,
Heikoe
I am doing sentiment analysis in Rapidminer. While performing word vector, I find that there is two approach tokenization (by non-letter) and generate n-grams. I am not sure the main difference between this two operator and their best use-cases. Can someone explain me how this two works differently in rapidminer? For sentiment analysis, which approach would you suggest; tokenization or n-grams?
Thanks and regards,
Heikoe
0
Best Answer
-
kayman Member Posts: 662 Unicornn-grams are successive tokens (or words in this case), so they are related. Using n-grams never hurts an NLP workflow so just use them if your workflow can handle it. In this case you have both your single tokens (words) and the n-grams that can be used for your training.
Bi-grams will do fine for sentiment, anything more isn't typically give much added value.1
Answers
Thanks for your clarification here.
Meaning to say that, we use Bi-grams as a part of data pre-processing.
i.e. inside the process document to data operator, we put b-grams as a part of data pre-processing together with the tokenize, stem porter and etc?
Thanks and regards,
Heikoe