tf.idf problem

philippw · July 2016

Hey guys,

Within Rapidminer I set up a process to compute tf.idf-values for certain words of a text corpus. The problem is whenever i try to use "tf.idf" as vector creation method, all I get as a result is "term frequency" values (also I double checked the values choosing "term frequency" as vector creation method; so it really is term frequency).

When i am not using my word list, but choose to compute tf.idf-values for every single word within the text corpus, results seem to be tf.idf-values. But I really want to calculate tf.idf-values only for specific words (included in my special word list).

Does anyone have a idea how to solve this problem?

Kind regards,

Philipp

bhupendra_patil · July 2016

Hello @philippw

I am not sure of your first issue where changing teh vector creating method to tf-idf is not gving the correct results.

What I can recommend is from the process view do a select all, copy and then create a new process and paste it. may have to do something with compatibility etc, it has helped me in the past. do not copy the xml, but the process itself.

Check all operators you are using are using teh correct compatible version. You will notice a small link on the botton of parameters window to change it.

Also for your tf-idf for specific words I checked something

have you tried the filter tokens(by content) to keep only tokens of interest, Doing so will calculate tf-idf score based on the only the words of interest

See the attached example

JEdward · July 2016

You do not actually want to do is not to add your own wordlist (which contains the normalisations for TFIDF on future processing). This would stop new TFIDF scores being calculated on your data as it is mostly used for applying predictive models on new data.

See this post by Martin for more details on how to use the wordlist: Text Mining and the WordList

What you actually want to do is Filter Tokens, to remove all document tokens except for the ones that are in your file.

At first I thought, "That's easy, just use a Filter Tokens By Dictionary operator", unfortunately it seems there isn't one.

Here are your options:

1. Let all the TFIDFs calculate & then remove the columns you don't want at the end. (This is the easiest approach). Add your wordlist as weights and the Select by Weights operator to remove it.

2. Rewrite your wordlist creatively and use the Stem Dictionary operator. This approach would need a little work with regular expressions to work, but should be possible.

philippw · July 2016

Thanks for your advice! I used your first option. I was able to remove the columns i did not want using the "select attributes" operator and regular expressions.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

tf.idf problem

Best Answer

Answers