RapidMiner

Cluster Performance DBScan and agglomerative Clustering

Contributor II

Cluster Performance DBScan and agglomerative Clustering

Hello,

I want to try differnt clustering algorithms like k-means, DBSCAN and agglomertive Clustering on my Dataset and compare the results in order to select the "best" one. For validation of centroid based clustering I know there are the operators "Cluster Distance Performance" and "Cluster Density Performance". But what about Performance Evaluation for DBSCAN or agglomerative Clustering? How can I do this?

 

Is their still something like the Global Silhouette Index as used in "Rapid Miner - Data Mining Use Cases and Business Analytics Application" for this kind of problem?

 

Thanks for your help.

7 REPLIES
Elite III

Re: Cluster Performance DBScan and agglomerative Clustering

Good question.  I don't know about the Global Silhouette Index, but in the meantime, you do have a couple of other options.  You could turn your clusters into labels and then attempt to diagnose them using predictive modeling algorithms, where "best" in this case would correspond presumably in terms of the ability to separate them using simple classifiers such as Naive Bayes or Decision Trees.  Or if you already have labels (not the clusters themselves) then you could use "Map Clustering on Labels" and do something similar.  Or run a predictive model using only the cluster attribute against your existing labels.

 

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Contributor II

Re: Cluster Performance DBScan and agglomerative Clustering

Thanks for your quick response.

Unfortunately I don't have any labels.

So your suggestion is to interpret the clusters as labels and then use e.g. a Decision Tree with the clusters as label attribute, right? But with this, how exactly can I see which one is the best cluster then? I dind't get that yet.

Highlighted
Elite III

Re: Cluster Performance DBScan and agglomerative Clustering

Then I am not at all sure what you mean by "the best cluster" in this context.  If you have some way of assigning values to individual clusters (e.g., you have some other label variable) then you can do what I suggested above.  But if you don't have an external label, then you can only evaluate your clusters with respect to your (presumably many different) input attributes, which you can do by making your clusters the label and then looking for differences in the patterns of what distinguishes one cluster from the others.  But I am not sure how you could decide which individual cluster was best under that kind of scenario because I don't know what it would mean for one cluster to be "better" than another. You could however evaluate different clustering methods as a whole against each other, by seeing which ones produce clusters that are most distinct (based on turning the clusters into labels and then evaluating the strength of the models used to predict the clusters).

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Contributor II

Re: Cluster Performance DBScan and agglomerative Clustering

Yes sorry the word "best cluster" in my post was wrong. I ment I want to evaluate differnt clustering methods and compare these, but I didn't understand yet how I can evaluate the strength of the models used to predict the clusters e.g. with a Decision Tree as you suggested.

Elite III

Re: Cluster Performance DBScan and agglomerative Clustering

If you are using the clusters as labels, then once you build a few predictive models, you would simply use standard measures of model performance such as ROC AUC, accuracy, F1 score, etc.    Take a look at the "Performance (classification)" operator for more details and many different performance measure options.

 

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Community Manager

Re: Cluster Performance DBScan and agglomerative Clustering

thanks @Telcontar120 - I was thinking along the same lines.  Smiley Happy

 

@hana1 you may consider also trying the Davis-Bouldin Index as implemented in the Cluster Distance Performance operator as this appears to me (?) to accomplish a similar goal.  

 

I don't know the Global Silhouette Index either...always something new to learn about!

 

Scott

Contributor II

Re: Cluster Performance DBScan and agglomerative Clustering

But can I use the Davies Bouldin index also for DBScan and agglomerative Clustering ? Because in the documentary it's said that the distance performance is only for centroid based clustering.