🥳 RAPIDMINER 9.9 IS OUT!!! 🥳
The updates in 9.9 power advanced use cases and offer productivity enhancements for users who prefer to code.
"Evaluating Cluster Performance"
I'm ploughing my nose through the more basic clustering literature and I wonder that there is no basic criteria to evalute the overall quality of a clustering.
So far, the criteria I came across demand either a lot of calculation (silhouette coefficient) or are not a very good help in the decision on the number of cluster if the number of data points is reasonably high (e.g. Akaike, Schwarz Information Criterion, MDL ). (Actually I don't want to reduce 10.000 to 200 Clusters but to a number I can still interpret ). The problem far me is that these criteria didn't seem to punish additional cluster severly enough in order to propose the solution with more clusters.
Currently I just guess I didn't hit the right literature so far, has anybody a clue good reference.
During the last days the idea came to my mind to compare the performance of a clustering of the real data with a given number of clusters (k) with the result that would be achieved with the same number of clusters on an artificial set.
In order to be reasonable candidate for a the final number of clusters k should perform significantly better on the real data set than on the white noise one. This lead me to following formula, which yields the minimum share of the dataset's variance s that should be explained by a given number of clusters, if the white noise is generated by uniform distribution.
s(d,k) = 1 - ((SUM(rn(d)^2))^0.5/(k*D^0.5))^(2/D)
s:= explained minimum share of the dataset's variance
= number of data dimension
k:= number of clusters
d element of D
r(d):= range of dimension d
rm(d):= maximum range of all considered dimension
rn(d):= r(d)/rm(d) [standardized range of the dimension]
In the end s(d,k) or its derivation could be used as very conservative knock out criteria to evaluate the quality of the clustering on real data, since their benchmarks should always be exceeded (at least in the case of kmeans).
In this context I have to questions:
A) is my idea correct?
Does anbody know the generalization to a normal distribution?
Thank you in advance for any comment.