tf.idf problem
Hey guys,
Within Rapidminer I set up a process to compute tf.idf-values for certain words of a text corpus. The problem is whenever i try to use "tf.idf" as vector creation method, all I get as a result is "term frequency" values (also I double checked the values choosing "term frequency" as vector creation method; so it really is term frequency).
When i am not using my word list, but choose to compute tf.idf-values for every single word within the text corpus, results seem to be tf.idf-values. But I really want to calculate tf.idf-values only for specific words (included in my special word list).
Does anyone have a idea how to solve this problem?
Kind regards,
Philipp
Best Answer
-
bhupendra_patil Employee, Member Posts: 168 RM Data Scientist
Hello @philippw
I am not sure of your first issue where changing teh vector creating method to tf-idf is not gving the correct results.
What I can recommend is from the process view do a select all, copy and then create a new process and paste it. may have to do something with compatibility etc, it has helped me in the past. do not copy the xml, but the process itself.
Check all operators you are using are using teh correct compatible version. You will notice a small link on the botton of parameters window to change it.
Also for your tf-idf for specific words I checked something
have you tried the filter tokens(by content) to keep only tokens of interest, Doing so will calculate tf-idf score based on the only the words of interest
See the attached example
0
Answers
You do not actually want to do is not to add your own wordlist (which contains the normalisations for TFIDF on future processing). This would stop new TFIDF scores being calculated on your data as it is mostly used for applying predictive models on new data.
See this post by Martin for more details on how to use the wordlist: Text Mining and the WordList
What you actually want to do is Filter Tokens, to remove all document tokens except for the ones that are in your file.
At first I thought, "That's easy, just use a Filter Tokens By Dictionary operator", unfortunately it seems there isn't one.
Here are your options:
1. Let all the TFIDFs calculate & then remove the columns you don't want at the end. (This is the easiest approach). Add your wordlist as weights and the Select by Weights operator to remove it.
2. Rewrite your wordlist creatively and use the Stem Dictionary operator. This approach would need a little work with regular expressions to work, but should be possible.
Thanks for your advice! I used your first option. I was able to remove the columns i did not want using the "select attributes" operator and regular expressions.