Cluster Performance DBScan and agglomerative Clustering

hana1hana1 Member Posts: 6 Contributor II
edited December 2018 in Help

Hello,

I want to try differnt clustering algorithms like k-means, DBSCAN and agglomertive Clustering on my Dataset and compare the results in order to select the "best" one. For validation of centroid based clustering I know there are the operators "Cluster Distance Performance" and "Cluster Density Performance". But what about Performance Evaluation for DBSCAN or agglomerative Clustering? How can I do this?

 

Is their still something like the Global Silhouette Index as used in "Rapid Miner - Data Mining Use Cases and Business Analytics Application" for this kind of problem?

 

Thanks for your help.

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Good question.  I don't know about the Global Silhouette Index, but in the meantime, you do have a couple of other options.  You could turn your clusters into labels and then attempt to diagnose them using predictive modeling algorithms, where "best" in this case would correspond presumably in terms of the ability to separate them using simple classifiers such as Naive Bayes or Decision Trees.  Or if you already have labels (not the clusters themselves) then you could use "Map Clustering on Labels" and do something similar.  Or run a predictive model using only the cluster attribute against your existing labels.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • hana1hana1 Member Posts: 6 Contributor II

    Thanks for your quick response.

    Unfortunately I don't have any labels.

    So your suggestion is to interpret the clusters as labels and then use e.g. a Decision Tree with the clusters as label attribute, right? But with this, how exactly can I see which one is the best cluster then? I dind't get that yet.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Then I am not at all sure what you mean by "the best cluster" in this context.  If you have some way of assigning values to individual clusters (e.g., you have some other label variable) then you can do what I suggested above.  But if you don't have an external label, then you can only evaluate your clusters with respect to your (presumably many different) input attributes, which you can do by making your clusters the label and then looking for differences in the patterns of what distinguishes one cluster from the others.  But I am not sure how you could decide which individual cluster was best under that kind of scenario because I don't know what it would mean for one cluster to be "better" than another. You could however evaluate different clustering methods as a whole against each other, by seeing which ones produce clusters that are most distinct (based on turning the clusters into labels and then evaluating the strength of the models used to predict the clusters).

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • hana1hana1 Member Posts: 6 Contributor II

    Yes sorry the word "best cluster" in my post was wrong. I ment I want to evaluate differnt clustering methods and compare these, but I didn't understand yet how I can evaluate the strength of the models used to predict the clusters e.g. with a Decision Tree as you suggested.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    If you are using the clusters as labels, then once you build a few predictive models, you would simply use standard measures of model performance such as ROC AUC, accuracy, F1 score, etc.    Take a look at the "Performance (classification)" operator for more details and many different performance measure options.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    thanks @Telcontar120 - I was thinking along the same lines.  :)

     

    @hana1 you may consider also trying the Davis-Bouldin Index as implemented in the Cluster Distance Performance operator as this appears to me (?) to accomplish a similar goal.  

     

    I don't know the Global Silhouette Index either...always something new to learn about!

     

    Scott

  • hana1hana1 Member Posts: 6 Contributor II

    But can I use the Davies Bouldin index also for DBScan and agglomerative Clustering ? Because in the documentary it's said that the distance performance is only for centroid based clustering.

  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    @hana1 very good question! Can the DB-index can be applied on density based approaches like DBSCAN @sgenzer

    Thank you in advance for your support community!
Sign In or Register to comment.