The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

tf.idf problem

philippwphilippw Member Posts: 3 Contributor I
edited November 2018 in Help

Hey guys,

 

Within Rapidminer I set up a process to compute tf.idf-values for certain words of a text corpus. The problem is whenever i try to use "tf.idf" as vector creation method, all I get as a result is "term frequency" values (also I double checked the values choosing "term frequency" as vector creation method; so it really is term frequency).

When i am not using my word list, but choose to compute tf.idf-values for every single word within the text corpus, results seem to be tf.idf-values. But I really want to calculate tf.idf-values only for specific words (included in my special word list).

 

Does anyone have a idea how to solve this problem?

 

Kind regards,

Philipp

Best Answer

  • bhupendra_patilbhupendra_patil Employee, Member Posts: 168 RM Data Scientist
    Solution Accepted

    Hello @philippw

     

    I am not sure of your first issue where changing teh vector creating method to tf-idf is not gving the correct results.

    What I can recommend is from the process view do a select all, copy and then create a new process and paste it. may have to do something with compatibility etc, it has helped me in the past. do not copy the xml, but the process itself.

    Check all operators you are using are using teh correct compatible version. You will notice a small link on the botton of parameters window to change it.

     

    Also for your tf-idf for specific words I checked something 

    have you tried the filter tokens(by content) to keep only tokens of interest, Doing so will calculate tf-idf score based on the only the words of interest

    See the attached example

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    You do not actually want to do is not to add your own wordlist (which contains the normalisations for TFIDF on future processing).  This would stop new TFIDF scores being calculated on your data as it is mostly used for applying predictive models on new data. 

     

    See this post by Martin for more details on how to use the wordlist: Text Mining and the WordList

     

    What you actually want to do is Filter Tokens, to remove all document tokens except for the ones that are in your file. 

    At first I thought, "That's easy, just use a Filter Tokens By Dictionary operator", unfortunately it seems there isn't one. 

     

    Here are your options:

    1. Let all the TFIDFs calculate & then remove the columns you don't want at the end.  (This is the easiest approach).  Add your wordlist as weights and the Select by Weights operator to remove it.

    2. Rewrite your wordlist creatively and use the Stem Dictionary operator.  This approach would need a little work with regular expressions to work, but should be possible. 

     

     

  • philippwphilippw Member Posts: 3 Contributor I

    Thanks for your advice! I used your first option. I was able to remove the columns i did not want using the "select attributes" operator and regular expressions.

Sign In or Register to comment.