Options

operating generate N-Grams (terms)

Fred12Fred12 Member Posts: 344 Unicorn
edited November 2018 in Help

hi,

I would like to know how the n-grams are generated, I noticed, some words are grouped together as n-gram (terms), and some others are not (single words), how does it decide which terms group together and which not? many of the most frequent occuring terms have no n-gram groupings...

Answers

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    The way n-grams works is like this if you set it to 2.  It will make combinations of the following sentence "RapidMiner Studio is the best."

     

    RapidMiner_Studio

    Studio_is

    is_the

    the_best

     

    Assuming your corpus of documents is about RapidMiner Studio reviews and you have TF-IDF set as your word vector creation, it will likely give "is_the" a very low value and "RapidMiner_Studio" and "the_best" as higher values. Of course if you have stemming, filtering, and pruning set, it might just drop out "is_the" completely out, and that's probably what's happening with your process.

  • Options
    Fred12Fred12 Member Posts: 344 Unicorn

    well inside process documents operator, I had tokenize, stemming, stopwords and n-gram operator, but this might have been the cause...

Sign In or Register to comment.