SVD Performance on large TF-IDF Matrices

benjaminbradleybenjaminbradley Member Posts: 7 Contributor I
edited November 2018 in Help

All - I have 25K relatively short survey responses (most < 255 words). I am trying to cluster them into similar groups. My plan was to run the TF-IDF matrix thru SVD and then cluster them. Unfortunately the TF-IDF is very large (25K x 140K). The TDM alone took 15 minutes to process on my machine. SVD locks up after a few minutes of processing. This is an educational application and I am considering running the SVD in the cloud w/ my 100 credits. I fear this will not even come close to being enough. Has anyone got any ideas, suggestions or alternatives? Thanks.

Tagged:

Best Answer

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Solution Accepted

    Has pruning been evaluated too? The pruning method parameter on the Process Docuemnts from Data operator can do wonders for a large TFIDF set.

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    It sounds like you should look at some text preprocessing to thin out your tokens.  Did you filter by stopwords, by token length, by part of speech, etc.?  Usually a raw wordlist can be reduced significantly using those methods.  Look at your current wordlist and think about what you would like to drop.  After that, if the matrix is still large, you might want to consider taking a sample and developing a wordlist from that, and then applying it to your larger dataset. 

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • benjaminbradleybenjaminbradley Member Posts: 7 Contributor I

    Brian - Thanks for the response. Below is a list and attached is a snip of the text processing I've  done within the process documents operator so far:

    -transform lower case

    -tokenize using non-letters as the criteria

    -Filter English Stopwords

    -Stem(Porter)

    -Filter Tokens by length <3 characters

    -generate n-grams max length =2

     

    You also mentioned part of speech filtering. I see the two operators that filter by POS Tags and POS Ratio. Do you reccomend one over the other or have suggestions on the settings? The help for these operators is not completely clear to me. For example, in hte POS ratio, it says "min ratio of adjectives [verbs, nouns, etc] for each token to be kept. Does that mean if I set a 0.3 ratio for adjectives, then no adjectives will be kept if there are less than 30% in an individual document or the entire corpus? If the .3 is exceed then all of them will be kept correct?

     

    Additionally, I think I understand how to take a smaller sample and develop a word list as you suggested, but I don't know how to tell RapidMiner to apply that word list to a larger corpus. Can you walk me thru that process?

     

    Thanks again for the response. It was very helpful.

     

     

     

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I am glad my first comments were helpful.  Here are a few additional comments in response to your questions:

    • It sounds like you are already doing most of the standard text preprocessing I would recommend.  Of course, adding n-grams of length 2 could expand your wordlist significantly, so you can look at your list and determine whether they are really necessary (e.g., you already have "good" and "food" so do you need "good food" as a separate token).
    • I don't really use the POS ratio operator, just the normal POS tags one.  Based on your specific use case, you may only be interested in keeping nouns and adjectives (often these are the key words for topical grouping, for example).
    • To reuse a wordlist, you simply store the wordlist from your sample text process (from the output port "wor") in the repository, and then retrieve that wordlist later and input it into the input "wor" port of your other process document operator.  This will force the operator to conform to the original wordlist supplied rather than generating a new wordlist.  It's just like using any other preprocessing model in RapidMiner.

     

     

     

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • benjaminbradleybenjaminbradley Member Posts: 7 Contributor I

    Awesome Brian! Thanks so much...that's exactly what I needed. I'll give it a whirl. Any idea what size matrix I need to be below so the SVD operator doesn't choke?

     

    Thanks Again

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    That I don't know--you'll probably have to do some testing to find out.  I'd be curious what you find out on that score, though, so if you can update the thread with your results it would be helpful!

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • benjaminbradleybenjaminbradley Member Posts: 7 Contributor I

    Will do. I can tell you I ran a 1K x 1.5K Matrix local on my surface pro 3 the other day and it choked. That might have been the RAM available on my surface. I haven't tried the cloud w a higher RAM bc I only have access to 1 processer w/ my educational account. I fear it willl take so long I'll burn thru all my credits and never complete. I'll let you know what happens either way.

  • benjaminbradleybenjaminbradley Member Posts: 7 Contributor I

    Thanks, I hadn't considered that..i'll give it a try

Sign In or Register to comment.