Options

Understanding TFIDF calculation

ram_nit05ram_nit05 Member Posts: 12 Contributor II
edited November 2018 in Help
Hi,

To provide a bried background to my exercise,
My objective is to create a SVM Classifier model which would classify particular feedback(attribute) into one of the various categories(Label) I have in dependent variable.For this am trying to generate word vectors from feedback verbatims which I pass as attributes.

Please find below the query,

When I manually calculated the TFIDF values and compared with those shown in the Data view of RapidMiner; they were very different. If anything the summation of square of TFIDF values in every row of Data view seemed to add upto 1.

So far I was assuming that the formula mentioned below would have been used for the calculation,

TFIDF( term)= (number of occurences of the term in that particular category/total number of occurences of all terms across all categories)*log(number of all  categories/number of categories where that particular terms appears)

Please help me understand the reason for this difference,

Many thanks in advance,
Ram

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Ram,
    did you use a wordlist for text input? Beside the words itself, the word list saves the number of occurences. They are then used for TFIDF calculation in order to be consistent to the training set during apply time.

    Greetings,
      Sebastian
  • Options
    ram_nit05ram_nit05 Member Posts: 12 Contributor II
    Thanks for the help sebastian.

    I did use the wordlist and its being saved the way you are saying. However the actual TFIDF values thrown by RapidMiner are pretty different from the ones that I calculated using the formula mentioned in the post. Is this because of some normalization or something, which I had not accounted for?

    Thanks again,
    Ram
Sign In or Register to comment.