Cluster Distance Performance Davies Bouldin fails on certain data sets

landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
edited December 2018 in Product Feedback - Resolved

Hi,

The Davies Bouldin index calculation fails and returns infinite numbers. This happens if you have cluster centroids that are not assigned any values. This must happen, if your data set has less rows than your cluster model has centroids. But it also happens by chance that k-Means (strangely) delivers cluster centroids that are not used at all. (I would like that not to happen, too. Probably bad initialization.)

 

The reason is in Line 250 of CetroidBasedEvaluator:

		// averaging by cluster sizes and sum over all
for (i = 0; i < numberOfClusters; i++) {
withinClusterDistance[i] /= clusterSizes[i];
}

clusterSizes is then 0, if no example has been assigned, causing the distance to become NaN. There is simply needed a check! Not dividing in this situation would be fine.

 

Greetings,

 Sebastian

Tagged:
0
0 votes

Fixed and Released · Last Updated

RM-3951

Comments

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi @land,

     

    shouldn't be the expected result for a cluster with no associated examples a missing value?

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Hi Martin,

    this is a performance measure and the situation is comparable with a classification, where one class is not used and your accuracy becomes infinity.

    I would also argue, that the average intra cluster distance is rather 0 than missing, but one can debate that. However, the Davis Bouldin Index should still be calculatable.

    So: No, this is just a plain bug, I'm pretty sure :)

    Greetings,

     Sebastian

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi @land,

    i think i disagree for the construction of a single DB value. I mean it's something like

    \sum_i db(i) for each i \elem cluster

    where i denotes an example. This is simply not defined if cluster ins a null set. If a function value is not defined, that it should be coded as NaN.

     

    Another story is, how do we average 10 DB values where one is a NaN. I think there we do something "odd". NaNs should be exlcuded in the averages. The only question is if you divide by 10 or 9. 

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Hi @mschmitz,

    there's no Davies Bouldin per example. The examples of a cluster are only used for the intra cluster distance. The value per cluster, which is later averaged to the final Davies Bouldin, is computed as maximum over all combinations with other clusters of the ratio of the intra cluster distance, to the inter cluster distance.

     

    Hence the problem is that an entire cluster is never used (meaning k-Means implementation has a problem with initialization). That causes the intra cluster distance to become NaN as it is 0/0. But then the max check fails. If a plain 0 intra cluster distance would be returned, the cluster would usualy not have an impact as it will unlikely being the max.

    Hence it would be fine to not change the average at the end at all.

     

    Greetings,

    Sebastian

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

    so the problem is the distance to the other clusters. Still fishy. The other cluster does not really exist. Maybe one should just add the distance to the cluster center and not the distance to the points? I just fear that adding not counting (i.e adding a 0) is a strong bias to the measure.

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.