RapidMiner

Text Mining - Document Similarity/Clustering

SOLVED
Contributor II

Text Mining - Document Similarity/Clustering

I have a 100K  text records, each have a problem description. now want to achieve :

1) Take a count of "Similar" looking problem description. (How to achieve this)

2) Main roadblocker is that it takes a lifetime to run process.

Steps :

Data import-->select attributes-->Process document to Data(tokenize.stopwords,n grams)-->K means/DBSCAN

 

How can I optimize this to run faster.

 

See more topics labeled with:

10 REPLIES
Moderator

Re: Text Mining - Document Similarity/Clustering

Hi,

 

what part of the analysis takes long? Clustering or Tokenizing? Do you use RM 7.2+?


~Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Contributor II

Re: Text Mining - Document Similarity/Clustering

I can make Process documen to Data to run faster, but the Clustering takes away more than a day to run.

 

I am using community verison 7.5. Please suggest how can I decrease run time. And will enterprise version solve the problem ?

Moderator

Re: Text Mining - Document Similarity/Clustering

Hi,

 

i think the way to go is to reduce the number of attribute e.g. by pruning.

 

~Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Contributor II

Re: Text Mining - Document Similarity/Clustering

After TF-IDF anyway we get a lot of attributes, even if I try removing stopwords by dictionary and Pruning, this is taking forever to run.

Want to know how exactly do we proceed to do clustering for a lot of text and in any way an enterprise version solve my problem ?

Moderator

Re: Text Mining - Document Similarity/Clustering

Hi,

 

how many attributes do you habve if you use percentual pruning with 5,50?

 

Best,

Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Contributor II

Re: Text Mining - Document Similarity/Clustering

900 approx. regular attributes

Highlighted
Moderator

Re: Text Mining - Document Similarity/Clustering

Hi,

 

That's simply still a lot for the K-Means. 900x100.000. Either you prune/stem/filter harder or you can go for PCA in front of k-means.

 

~Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Contributor II

Re: Text Mining - Document Similarity/Clustering

But we are dealing with text documents here.

Attribute will be all text (words,n grams).I Can try to further down it by max 10%. How will PCA help me here, please enlighten.

And So what is the process to cluster large text documents

Community Manager

Re: Text Mining - Document Similarity/Clustering

Yes, but depending on if you're creating bi or tri_grams, you're blowing up the size of your data set and that affects training time. 

 

When you have all those columns (aka attributes), you are creating a highly dimensional data set that the clustering alogrithm has to work hard at to calculate to group together. The less attributes you have the faster it will be.

 

You could use a larger machine with more memory or a RapidMiner Server on a large server but the best option is to do what @mschmitz said and try to reduce the amount of attributes by PCA, pruning, or reducing the # of n-grams. It's a trade off that you have to carefully think about. 

Regards,
Thomas
LinkedIn: Thomas Ott
Blog: Neural Market Trends