"Process Documents

marvinrj · March 2011

I was thinking if there's some clustering technique which allows automatic numbering of K (The number of clusters which should be used). i can't classify manually to confirm. could anyone advise me on it?

awchisholm · March 2011

Hello

The DBSCAN clustering algorithm will find a value of k but you still have to choose the optimum values for two parameters, namely, epsilon and min_points. So there is, unfortunately, no free lunch. You can use RapidMiner to try these parameter combinations and count the number of clusters that are found and then you can spot where there seem to be regions in the search space that tend to produce the same numbers of clusters.

<shameless self promotion>
You could download an example that I made here http://rapidminernotes.blogspot.com/2010/12/counting-clusters.html.
</shameless self promotion>

Many other techniques exist for finding clusters. The key is that they are unsupervised so a person always has to look at the answers to determine if they are right or not.

regards

Andrew

marvinrj · March 2011

hi,

It would be the solution to my problem. But when i've applied that clustering algorithm, the program was processing for over 2 hours. Then i 've stopped the process.
that was normal??

thanks awchisholm.

awchisholm · March 2011

Hello

You will often find that the run time is excessive; the number of examples, the number of attributes and the algorithm all contribute as well as the brute force nature of the search. To see what time to expect, you could reduce the example set by using the sample operator. Start with a very small number of examples like 1% of the total and see if the clustering completes at all. Then increase to 2%, 5% and so on. You should be able to make a prediction about how long it might take for the full data set.

Regards

Andrew

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Process Documents

Answers