Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
"Process Documents
I was thinking if there's some clustering technique which allows automatic numbering of K (The number of clusters which should be used). i can't classify manually to confirm. could anyone advise me on it?
Tagged:
0
Answers
The DBSCAN clustering algorithm will find a value of k but you still have to choose the optimum values for two parameters, namely, epsilon and min_points. So there is, unfortunately, no free lunch. You can use RapidMiner to try these parameter combinations and count the number of clusters that are found and then you can spot where there seem to be regions in the search space that tend to produce the same numbers of clusters.
<shameless self promotion>
You could download an example that I made here http://rapidminernotes.blogspot.com/2010/12/counting-clusters.html.
</shameless self promotion>
Many other techniques exist for finding clusters. The key is that they are unsupervised so a person always has to look at the answers to determine if they are right or not.
regards
Andrew
It would be the solution to my problem. But when i've applied that clustering algorithm, the program was processing for over 2 hours. Then i 've stopped the process.
that was normal??
thanks awchisholm.
You will often find that the run time is excessive; the number of examples, the number of attributes and the algorithm all contribute as well as the brute force nature of the search. To see what time to expect, you could reduce the example set by using the sample operator. Start with a very small number of examples like 1% of the total and see if the clustering completes at all. Then increase to 2%, 5% and so on. You should be able to make a prediction about how long it might take for the full data set.
Regards
Andrew