Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Tokenization vs N-grams

HeikoeWin786HeikoeWin786 Member Posts: 64 Contributor I
Hello guys,

I am doing sentiment analysis in Rapidminer. While performing word vector, I find that there is two approach tokenization (by non-letter) and generate n-grams. I am not sure the main difference between this two operator and their best use-cases. Can someone explain me how this two works differently in rapidminer? For sentiment analysis, which approach would you suggest; tokenization or n-grams?

Thanks and regards,
Heikoe

Best Answer

  • kaymankayman Member Posts: 662 Unicorn
    Solution Accepted
    n-grams are successive tokens (or words in this case), so they are related. Using n-grams never hurts an NLP workflow so just use them if your workflow can handle it. In this case you have both your single tokens (words) and the n-grams that can be used for your training.

     Bi-grams will do fine for sentiment, anything more isn't typically give much added value.

Answers

  • HeikoeWin786HeikoeWin786 Member Posts: 64 Contributor I
    @kayman

    Thanks for your clarification here.
    Meaning to say that, we use Bi-grams as a part of data pre-processing.
    i.e. inside the process document to data operator, we put b-grams as a part of data pre-processing together with the tokenize, stem porter and etc?

    Thanks and regards,
    Heikoe
Sign In or Register to comment.