"Undoing the cosine normalization in 'Process Documents' operator"

unit01unit01 Member Posts: 4 Contributor I
edited June 2019 in Help
Hello,

I have noticed that 'Process Documents' does not output term frequencies, when the coresponding mode is selected. As stated in http://rapid-i.com/rapidforum/index.php/topic,3728.msg13943.html#msg13943, cosine normalization is applied to the raw frequencies before outputting the result.

Is there a way to get raw term frequencies for each document, without normalization?

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    Yes there is - I had to do the same thing.
    http://rapidminernotes.blogspot.co.uk/2011/11/normalizing-rows.html
    Basically, get the term occurrences then normalise the rows using the proportion transformation option.

    regards

    Andrew
  • unit01unit01 Member Posts: 4 Contributor I
    Hello Andrew!

    Long story short - you have saved the day once again! To help other RapidMiner newbies, here is a more detailed description of what happened:

    1. I have tried using 'Term occurences' before, but thought that this is not the 'number of times a specific term occurs in the doc'. The reason is - when manually counting the number of tokens in a document and comparing that with the sum of term frequency vector counts, these two measures did not match;

    2. Simultaneously, when using Andrews sample process, the term frequency vector component sum was correct  ???

    The problem turned out to be trivial. My RapidMiner process applied term pruning - any terms that occured less than two times in the corpus were removed. However, the tokens output by RapidMiner still included the removed ones - that's why results seemed unexpected. The example provided by Andrew did not apply pruning, therefore results were consistent.

    Hope this helps someone  :) As a side note, I would recommend the RapidMiner team make the 'Process Documents' operator generate term vectors consistently with the token list in order to avoid confusing dumb users like me :P

    Thanks, Andrew!
Sign In or Register to comment.