Term Occurrences and Frequency - I have to be missing something

btibertbtibert Member, University Professor Posts: 146 Guru
I am following along with this post because I wanted to ensure my intuition was correct, because I was seeing results that didn't make sense, to me anyway.

The only difference that I see in my process to start is that I am reading in my data from Excel and not creating it by hand.

Here is the term occurrences after making it lower case, extracting stop words, tokenizing, and counting the tokens.

Just the like the post, I am using very simple sentences to keep the vocabulary small.

Now, here is the same exact data, the only difference is that I am now using term frequency within the Process Documents operator

Of course there is a very good change that I am missing a setting along the way, but why is the first example .577 for each of the three words, when the basic sentence, unprocessed, was I like turtles.

Thanks in advance.

Best Answer


  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    are you sure you don't use TF/IDF?

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    btibertbtibert Member, University Professor Posts: 146 Guru
    See below, and my dataset/process attached. Entirely possible I am missing something obvious, just not sure what it could be.

  • Options
    btibertbtibert Member, University Professor Posts: 146 Guru
    Absolutely fantastic, thanks!  I completely missed (as the title suggested) the normalized part, I just saw the output I expected and stopped reading like a dummy.  Many thanks for the example process as well, I haven't had a chance to wrap my head around looping the way you did it, but it appears straight forward enough.
  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    You're welcome, @btibert


Sign In or Register to comment.