02-20-2017 03:34 PM
All - I have 25K relatively short survey responses (most < 255 words). I am trying to cluster them into similar groups. My plan was to run the TF-IDF matrix thru SVD and then cluster them. Unfortunately the TF-IDF is very large (25K x 140K). The TDM alone took 15 minutes to process on my machine. SVD locks up after a few minutes of processing. This is an educational application and I am considering running the SVD in the cloud w/ my 100 credits. I fear this will not even come close to being enough. Has anyone got any ideas, suggestions or alternatives? Thanks.
Solved! Go to Solution.
02-20-2017 05:12 PM
It sounds like you should look at some text preprocessing to thin out your tokens. Did you filter by stopwords, by token length, by part of speech, etc.? Usually a raw wordlist can be reduced significantly using those methods. Look at your current wordlist and think about what you would like to drop. After that, if the matrix is still large, you might want to consider taking a sample and developing a wordlist from that, and then applying it to your larger dataset.
02-21-2017 06:36 AM - edited 02-21-2017 06:39 AM
Brian - Thanks for the response. Below is a list and attached is a snip of the text processing I've done within the process documents operator so far:
-transform lower case
-tokenize using non-letters as the criteria
-Filter English Stopwords
-Filter Tokens by length <3 characters
-generate n-grams max length =2
You also mentioned part of speech filtering. I see the two operators that filter by POS Tags and POS Ratio. Do you reccomend one over the other or have suggestions on the settings? The help for these operators is not completely clear to me. For example, in hte POS ratio, it says "min ratio of adjectives [verbs, nouns, etc] for each token to be kept. Does that mean if I set a 0.3 ratio for adjectives, then no adjectives will be kept if there are less than 30% in an individual document or the entire corpus? If the .3 is exceed then all of them will be kept correct?
Additionally, I think I understand how to take a smaller sample and develop a word list as you suggested, but I don't know how to tell RapidMiner to apply that word list to a larger corpus. Can you walk me thru that process?
Thanks again for the response. It was very helpful.
02-21-2017 07:55 AM
I am glad my first comments were helpful. Here are a few additional comments in response to your questions:
02-21-2017 08:16 AM
That I don't know--you'll probably have to do some testing to find out. I'd be curious what you find out on that score, though, so if you can update the thread with your results it would be helpful!
02-21-2017 11:40 AM
Will do. I can tell you I ran a 1K x 1.5K Matrix local on my surface pro 3 the other day and it choked. That might have been the RAM available on my surface. I haven't tried the cloud w a higher RAM bc I only have access to 1 processer w/ my educational account. I fear it willl take so long I'll burn thru all my credits and never complete. I'll let you know what happens either way.
02-21-2017 11:41 AM
Has pruning been evaluated too? The pruning method parameter on the Process Docuemnts from Data operator can do wonders for a large TFIDF set.