Options

# Understanding TFIDF calculation

Hi,

To provide a bried background to my exercise,

My objective is to create a SVM Classifier model which would classify particular feedback(attribute) into one of the various categories(Label) I have in dependent variable.For this am trying to generate word vectors from feedback verbatims which I pass as attributes.

Please find below the query,

When I manually calculated the TFIDF values and compared with those shown in the Data view of RapidMiner; they were very different. If anything the summation of square of TFIDF values in every row of Data view seemed to add upto 1.

So far I was assuming that the formula mentioned below would have been used for the calculation,

TFIDF( term)= (number of occurences of the term in that particular category/total number of occurences of all terms across all categories)*log(number of all categories/number of categories where that particular terms appears)

Please help me understand the reason for this difference,

Many thanks in advance,

Ram

To provide a bried background to my exercise,

My objective is to create a SVM Classifier model which would classify particular feedback(attribute) into one of the various categories(Label) I have in dependent variable.For this am trying to generate word vectors from feedback verbatims which I pass as attributes.

Please find below the query,

When I manually calculated the TFIDF values and compared with those shown in the Data view of RapidMiner; they were very different. If anything the summation of square of TFIDF values in every row of Data view seemed to add upto 1.

So far I was assuming that the formula mentioned below would have been used for the calculation,

TFIDF( term)= (number of occurences of the term in that particular category/total number of occurences of all terms across all categories)*log(number of all categories/number of categories where that particular terms appears)

Please help me understand the reason for this difference,

Many thanks in advance,

Ram

0

## Answers

2,531Unicorndid you use a wordlist for text input? Beside the words itself, the word list saves the number of occurences. They are then used for TFIDF calculation in order to be consistent to the training set during apply time.

Greetings,

Sebastian

12Contributor III did use the wordlist and its being saved the way you are saying. However the actual TFIDF values thrown by RapidMiner are pretty different from the ones that I calculated using the formula mentioned in the post. Is this because of some normalization or something, which I had not accounted for?

Thanks again,

Ram