Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

operating generate N-Grams (terms)

Fred12Fred12 Member Posts: 344 Unicorn
edited November 2018 in Help

hi,

I would like to know how the n-grams are generated, I noticed, some words are grouped together as n-gram (terms), and some others are not (single words), how does it decide which terms group together and which not? many of the most frequent occuring terms have no n-gram groupings...

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    The way n-grams works is like this if you set it to 2.  It will make combinations of the following sentence "RapidMiner Studio is the best."

     

    RapidMiner_Studio

    Studio_is

    is_the

    the_best

     

    Assuming your corpus of documents is about RapidMiner Studio reviews and you have TF-IDF set as your word vector creation, it will likely give "is_the" a very low value and "RapidMiner_Studio" and "the_best" as higher values. Of course if you have stemming, filtering, and pruning set, it might just drop out "is_the" completely out, and that's probably what's happening with your process.

  • Fred12Fred12 Member Posts: 344 Unicorn

    well inside process documents operator, I had tokenize, stemming, stopwords and n-gram operator, but this might have been the cause...

Sign In or Register to comment.