"Evaluating Cluster Performance"

PinguiculaPinguicula Member Posts: 12 Contributor II
edited May 2019 in Help
Hi Everybody,

I'm ploughing my nose through the more basic clustering literature and I wonder that there is no basic criteria to evalute the overall quality of a clustering.
So far, the criteria I came across demand either a lot of calculation (silhouette coefficient) or are not a very good help in the decision on the number of cluster if the number of data points is reasonably high (e.g. Akaike, Schwarz Information Criterion, MDL ). (Actually I don't want to reduce 10.000 to 200 Clusters but to a number I can still interpret  ;)). The problem far me is that these criteria didn't seem to punish additional cluster severly enough in order to propose the solution with more clusters.
Currently I just guess I didn't hit the right literature so far, has anybody a clue good reference.

During the last days the idea came to my mind to compare the performance of a clustering of the real data with a given number of clusters (k) with the result that would be achieved with the same number of clusters on an artificial set.
In order to be reasonable candidate for a the final number of clusters k should perform significantly better on the real data set than on the white noise one. This lead me to following formula, which yields the minimum share of the dataset's variance s that should be explained by a given number of clusters, if the white noise is generated by uniform distribution.

s(d,k) = 1 - ((SUM(rn(d)^2))^0.5/(k*D^0.5))^(2/D)

s:= explained minimum share of the dataset's variance
D:= number of data dimension
k:= number of clusters
d element of D
r(d):= range of dimension d
rm(d):= maximum range of all considered dimension
rn(d):= r(d)/rm(d) [standardized range of the dimension]

In the end s(d,k) or its derivation could be used as very conservative knock out criteria to evaluate the quality of the clustering on real data, since their benchmarks should always be exceeded (at least in the case of kmeans).
In this context I have to questions:
A) is my idea correct?
B) Does anbody know the generalization to a normal distribution?

Thank you in advance for any comment.

Best regards

Norbert

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Norbert,

    During the last days the idea came to my mind to compare the performance of a clustering of the real data with a given number of clusters (k) with the result that would be achieved with the same number of clusters on an artificial set. In order to be reasonable candidate for a the final number of clusters k should perform significantly better on the real data set than on the white noise one.
    But then still the clusterting would get better with higher values for k, right?


    Did you have a look into the DBindex? This criterion should take the number of clusters at least a bit into account. Another idea is a multi-objective optimization of K and some criterion where K should be minimized. You can have a look into my 2005/2006 feature-selection/construction-for-clustering-literature if you are interested in those approaches.

    Cheers,
    Ingo
  • PinguiculaPinguicula Member Posts: 12 Contributor II
    Hi Ingo,

    Thank you for your instant reply and your hints.

    Yes, s(d,k) increases with an increasing number of clusters.
    My idea behind this criterion is that s(d,k)   depicts the increase in the clustering quality which would be seen in white noise data.
    Therefore one could use the development of s(d,k)   and compare it with the observed development of the clustering quality in the real data set. One then could reject additional clusters if they improve the quality only by a fraction which would be observed in a white noise setting.

    Looking at DBIndex which of the 18 options have you implemented in RM?
    (see (F. Azuaje, "A cluster validity framework for genome expression data,"
    Bioinfomatics, vol. 18, pp. 319-320, 2002)

    Best regards

    Norbert

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    Therefore one could use the development of  s(d,k)  and compare it with the observed development of the clustering quality in the real data set. One then could reject additional clusters if they improve the quality only by a fraction which would be observed in a white noise setting.
    I see. But still it could be that adding even more clusters again lead to better results, right? So this criterion might still not be fully appropriate for automatic optimizations but of course still it would be an interesting criterion. I am, however, not sure if I didn't read about something like this before - at least I remember that Michael Wurst (the main author of the RM clustering parts integrated the random clustering for a similar comparison.

    About the DB-Index: here is the relevant code (I hope at least) from the class CentroidBasedEvaluator:

        private double getDaviesBouldin(CentroidBasedClusterModel cm, ExampleSet es) {
            double[] withinClusterDistance = new double[cm.getNumberOfClusters()];

            for (int i = 0; i < cm.getNumberOfClusters(); i++) {
                int count = 0;
                double sum = 0.0;
                List<String> objs = new IterationArrayList<String>(cm.getClusterAt(i).getObjects());
                for (int j = 0; j < objs.size(); j++) {
                    String d = objs.get(j);
                    double v = cm.getDistanceFromCentroid(i, IdUtils.getExampleFromId(es, d));
                    sum = sum + v;
                    count++;
                }

                if (count > 0)
                    withinClusterDistance = sum / count;
                else
                    withinClusterDistance = 0.0;

            }

            double sum2 = 0.0;

            for (int i = 0; i < cm.getNumberOfClusters(); i++) {
                double max = Double.NEGATIVE_INFINITY;
                for (int j = 0; j < cm.getNumberOfClusters(); j++)
                    if (i != j) {
                        double val = (withinClusterDistance + withinClusterDistance) / cm.getCentroidDistance(i, j);
                        if (val > max)
                            max = val;
                    }
                sum2 = sum2 + max;
            }
            return sum2 / cm.getNumberOfClusters();
        }
    Maybe this helps.

    Cheers,
    Ingo
Sign In or Register to comment.