06-14-2017 05:13 AM
I have a 100K text records, each have a problem description. now want to achieve :
1) Take a count of "Similar" looking problem description. (How to achieve this)
2) Main roadblocker is that it takes a lifetime to run process.
Data import-->select attributes-->Process document to Data(tokenize.stopwords,n grams)-->K means/DBSCAN
How can I optimize this to run faster.
Solved! Go to Solution.
06-14-2017 07:16 AM
After TF-IDF anyway we get a lot of attributes, even if I try removing stopwords by dictionary and Pruning, this is taking forever to run.
Want to know how exactly do we proceed to do clustering for a lot of text and in any way an enterprise version solve my problem ?
06-14-2017 07:57 AM
That's simply still a lot for the K-Means. 900x100.000. Either you prune/stem/filter harder or you can go for PCA in front of k-means.
06-14-2017 08:06 AM
But we are dealing with text documents here.
Attribute will be all text (words,n grams).I Can try to further down it by max 10%. How will PCA help me here, please enlighten.
And So what is the process to cluster large text documents
06-14-2017 12:07 PM
Yes, but depending on if you're creating bi or tri_grams, you're blowing up the size of your data set and that affects training time.
When you have all those columns (aka attributes), you are creating a highly dimensional data set that the clustering alogrithm has to work hard at to calculate to group together. The less attributes you have the faster it will be.
You could use a larger machine with more memory or a RapidMiner Server on a large server but the best option is to do what @mschmitz said and try to reduce the amount of attributes by PCA, pruning, or reducing the # of n-grams. It's a trade off that you have to carefully think about.