08-10-2017 08:34 AM
I want to try differnt clustering algorithms like k-means, DBSCAN and agglomertive Clustering on my Dataset and compare the results in order to select the "best" one. For validation of centroid based clustering I know there are the operators "Cluster Distance Performance" and "Cluster Density Performance". But what about Performance Evaluation for DBSCAN or agglomerative Clustering? How can I do this?
Is their still something like the Global Silhouette Index as used in "Rapid Miner - Data Mining Use Cases and Business Analytics Application" for this kind of problem?
Thanks for your help.
08-10-2017 08:39 AM
Good question. I don't know about the Global Silhouette Index, but in the meantime, you do have a couple of other options. You could turn your clusters into labels and then attempt to diagnose them using predictive modeling algorithms, where "best" in this case would correspond presumably in terms of the ability to separate them using simple classifiers such as Naive Bayes or Decision Trees. Or if you already have labels (not the clusters themselves) then you could use "Map Clustering on Labels" and do something similar. Or run a predictive model using only the cluster attribute against your existing labels.
08-10-2017 08:50 AM
Thanks for your quick response.
Unfortunately I don't have any labels.
So your suggestion is to interpret the clusters as labels and then use e.g. a Decision Tree with the clusters as label attribute, right? But with this, how exactly can I see which one is the best cluster then? I dind't get that yet.
08-10-2017 09:00 AM
Then I am not at all sure what you mean by "the best cluster" in this context. If you have some way of assigning values to individual clusters (e.g., you have some other label variable) then you can do what I suggested above. But if you don't have an external label, then you can only evaluate your clusters with respect to your (presumably many different) input attributes, which you can do by making your clusters the label and then looking for differences in the patterns of what distinguishes one cluster from the others. But I am not sure how you could decide which individual cluster was best under that kind of scenario because I don't know what it would mean for one cluster to be "better" than another. You could however evaluate different clustering methods as a whole against each other, by seeing which ones produce clusters that are most distinct (based on turning the clusters into labels and then evaluating the strength of the models used to predict the clusters).
08-10-2017 09:06 AM
Yes sorry the word "best cluster" in my post was wrong. I ment I want to evaluate differnt clustering methods and compare these, but I didn't understand yet how I can evaluate the strength of the models used to predict the clusters e.g. with a Decision Tree as you suggested.
08-10-2017 09:08 AM
If you are using the clusters as labels, then once you build a few predictive models, you would simply use standard measures of model performance such as ROC AUC, accuracy, F1 score, etc. Take a look at the "Performance (classification)" operator for more details and many different performance measure options.
08-10-2017 12:15 PM
thanks @Telcontar120 - I was thinking along the same lines.
I don't know the Global Silhouette Index either...always something new to learn about!