🥳 RAPIDMINER 9.9 IS OUT!!! 🥳

The updates in 9.9 power advanced use cases and offer productivity enhancements for users who prefer to code.

CLICK HERE TO DOWNLOAD

"Undoing the cosine normalization in 'Process Documents' operator"

unit01unit01 Member Posts: 4 Contributor I
edited June 2019 in Help
Hello,

I have noticed that 'Process Documents' does not output term frequencies, when the coresponding mode is selected. As stated in http://rapid-i.com/rapidforum/index.php/topic,3728.msg13943.html#msg13943, cosine normalization is applied to the raw frequencies before outputting the result.

Is there a way to get raw term frequencies for each document, without normalization?

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458   Unicorn
    Hello

    Yes there is - I had to do the same thing.
    http://rapidminernotes.blogspot.co.uk/2011/11/normalizing-rows.html
    Basically, get the term occurrences then normalise the rows using the proportion transformation option.

    regards

    Andrew
  • unit01unit01 Member Posts: 4 Contributor I
    Hello Andrew!

    Long story short - you have saved the day once again! To help other RapidMiner newbies, here is a more detailed description of what happened:

    1. I have tried using 'Term occurences' before, but thought that this is not the 'number of times a specific term occurs in the doc'. The reason is - when manually counting the number of tokens in a document and comparing that with the sum of term frequency vector counts, these two measures did not match;

    2. Simultaneously, when using Andrews sample process, the term frequency vector component sum was correct  ???

    The problem turned out to be trivial. My RapidMiner process applied term pruning - any terms that occured less than two times in the corpus were removed. However, the tokens output by RapidMiner still included the removed ones - that's why results seemed unexpected. The example provided by Andrew did not apply pruning, therefore results were consistent.

    Hope this helps someone  :) As a side note, I would recommend the RapidMiner team make the 'Process Documents' operator generate term vectors consistently with the token list in order to avoid confusing dumb users like me :P

    Thanks, Andrew!
Sign In or Register to comment.