Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Understanding TFIDF calculation
Hi,
To provide a bried background to my exercise,
My objective is to create a SVM Classifier model which would classify particular feedback(attribute) into one of the various categories(Label) I have in dependent variable.For this am trying to generate word vectors from feedback verbatims which I pass as attributes.
Please find below the query,
When I manually calculated the TFIDF values and compared with those shown in the Data view of RapidMiner; they were very different. If anything the summation of square of TFIDF values in every row of Data view seemed to add upto 1.
So far I was assuming that the formula mentioned below would have been used for the calculation,
TFIDF( term)= (number of occurences of the term in that particular category/total number of occurences of all terms across all categories)*log(number of all categories/number of categories where that particular terms appears)
Please help me understand the reason for this difference,
Many thanks in advance,
Ram
To provide a bried background to my exercise,
My objective is to create a SVM Classifier model which would classify particular feedback(attribute) into one of the various categories(Label) I have in dependent variable.For this am trying to generate word vectors from feedback verbatims which I pass as attributes.
Please find below the query,
When I manually calculated the TFIDF values and compared with those shown in the Data view of RapidMiner; they were very different. If anything the summation of square of TFIDF values in every row of Data view seemed to add upto 1.
So far I was assuming that the formula mentioned below would have been used for the calculation,
TFIDF( term)= (number of occurences of the term in that particular category/total number of occurences of all terms across all categories)*log(number of all categories/number of categories where that particular terms appears)
Please help me understand the reason for this difference,
Many thanks in advance,
Ram
0
Answers
did you use a wordlist for text input? Beside the words itself, the word list saves the number of occurences. They are then used for TFIDF calculation in order to be consistent to the training set during apply time.
Greetings,
Sebastian
I did use the wordlist and its being saved the way you are saying. However the actual TFIDF values thrown by RapidMiner are pretty different from the ones that I calculated using the formula mentioned in the post. Is this because of some normalization or something, which I had not accounted for?
Thanks again,
Ram