Compare clustering performance

ahootanha · April 2018

Hello
How can I compare two kmeans and dbscan clustering algorithms and say what is better on a given data, for example? What criteria should I use?

sgenzer · April 2018

hello @ahootanha - there are several operators available to evaluate cluster performance:

Screen Shot 2018-04-06 at 9.15.11 AM.png

And if you go to any of these operators, there are tutorials on how to use them:

Screen Shot 2018-04-06 at 9.15.41 AM.png

Scott

ahootanha · April 2018

Hello, thank you very much for being grateful and guiding me
Yes, I know this. But I do not know how and by what criteria to compare two methods of clustering kmeans and dbscan and say which one is better.
???
Thankful

Telcontar120 · April 2018

What @sgenzer is suggesting is that there are multiple ways of comparing different clusters, and there isn't one single definition of which cluster is "better". This is even more true for clustering than with predictive models, because clustering is generally an unsupervised approach, so you don't know in advance what the outcome should look like. If you do any general reading about clustering performance, you will see that there is a lot of discussion in this field about what constitutes the "best" clustering solution for any given dataset and clustering method, and there is no universal agreement. So it depends on your use case and the goals of your project: what are you trying to accomplish with the clustering? For example, is it better if the observations in each cluster are more like each other, or is it better to have fewer clusters? No one on the forum can answer those questions for you, we can simply point you to the tools in RapidMiner that will help you understand and evaluate your clusters using a number of widely used methods.

ahootanha · April 2018

hi

How and according to what criteria, what is the best performance on my data?

kypexin · April 2018

Hi @ahootanha

I will try to explain further what previous commenters have pointed out.

Clustering result is subjective in the sense that you should understand what result and what kind of clusters separation you are expecting, and this is fully dependant on the domain and type of the dataset.

Have a look at the eaxmple plots below, where I performed clustering on the same dataset, but with different number of clusters (with k=2, k=3 and k=4):

2 clusters.png

3 clusters.png

4 clusters.png

Technically, all three results are valid, as data points are pretty well separated into clusters. You cannot say looking just at these plots that one of them is 'better' than other. You should also understand, what exactly this data represents and how exactly do you want to cluster it, given the nature an dthe domain.

But as soon as you know that this example is an Iris dataset where we know beforehand contains 3 different species to distinguish between, then the right number of clusters is 3. But at the same time clustering with 2 clusters only also makes sense, though it obviously reveals only 1 group of species which is significantly different from another. What it does not reveal is the further differences in the second group.

This said, you really need to formulate the business (or scientific, or whatever else) problem before you do clustering, and interpert the result having this particular question in mind.

jabra · April 2018

Hello
Is it possible to conclude such a clustering of text?
And is it possible to take a photo of the process of used operators?
How to use kmeans with map clastering on labels?

kypexin · April 2018

Hi @jabra

Sure it is possible; however I never accompliched this task myself. But still you can find pretty much posts in the community regarding text clustering: https://community.rapidminer.com/t5/forums/searchpage/tab/message?advanced=false&allow_punctuation=false&q=text%20clustering

jabra · April 2018

Thanks a lot
I went very far but I did not find. can you help me?

kypexin · April 2018

Hi @jabra

Maybe I could only come up with some ideas, in case you can share your dataset and describe clearly the goal you want to achieve by performing clustering on it.

pschlunder · April 2018

Hi,

maybe another view on the performance metrics for clustering. These methods are often based on descriptive statistics or just a mapping between data in a cluster and a number (data based/inherent metrics). Based on the final number of a single cluster alone you can rarely decide if something is good or not. It is often that context provides insight. E.g. the comparison of those numbers between different cluster techniques, settings or clusters.

A simple example would be the shortest distance between cluster boarders. Just knowing that two clusters are apart a certain value it would be hard to decide if the clusters are separating in a sufficient way, because the distance depends on the given attributes space metrics. But knowing that other clusters are apart a bigger number would help you understand that the clustering task might be easier due to the bigger gaps inbetween clusters.

Regards,

Philipp

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Compare clustering performance

Answers