🎉 🎉   RAPIDMINER 9.5 BETA IS OUT!!!   🎉 🎉

GRAB THE HOTTEST NEW BETA OF RAPIDMINER STUDIO, SERVER, AND RADOOP. LET US KNOW WHAT YOU THINK!

CLICK HERE TO DOWNLOAD

🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

Average silhouette vs sum of squares vs average within distance vs davies bouldi

mariozupanmariozupan Member Posts: 15 Contributor II
edited November 2018 in Help
I am trying to get optimal k-means clusters. I got the next values of some cluster performance operators:
Average silhouette (it needs to be closer to 1)
0.436629028918996 2
0.3082759533058591 3
0.28166001017015313 4
0.2642004909716735 5
0.2687266594105881 6
0.20684027606885227 7
0.20938717797555279 8
0.1989215746446572 9
0.2159248335388874 10
0.20862824967813512 11
0.21515776961871466 12
0.22229187379304438 13

sum of squares (closer to 0 is better)
0.5833789973221948 2.0
0.37635053401019425 3.0
0.2637793240351113 5.0
0.22072765997043042 6.0
0.1775519277095977 7.0
0.13894604067369032 9.0
0.13183279742765275 10.0
0.11787512536057321 11.0
0.12043920141437744 12.0
0.11111340867029912 14.0
0.0978794677474385 15.0

Davies Bouldin (closer to O is better)
2.0 0.9380429190179411
3.0 1.2019021137767585
5.0 1.223643902662405
6.0 1.133405289202767
7.0 1.0968281280723653
9.0 1.1200633376736615
10.0 1.1979846345568537
11.0 1.1630894077266136
12.0 1.2048150524976373
14.0 1.120210017075379
15.0 1.1432560808642207

Average within distance: (closer to 0 is better)
2.0 0.06534949797998725
3.0 0.05185423744778253
5.0 0.03845893742628533
6.0 0.03339595659274747
7.0 0.02958406174889975
9.0 0.02536301492397515
10.0 0.02418196109649237
11.0 0.022728641391481907
12.0 0.0218420365992699
14.0 0.019696264589330038
15.0 0.01864628535658701

Neither one of my performance operator is not so happy with my distribution. I tried to remove outliers, done logarithm on attributes, normalize from 0 to 1 and get the next results for 5 clusters:
attributes                cluster1                  cluster2                                cluster3                              cluster4                        cluster5
X222 0.832614470761885 0.6164551892773821 0.6682251804332917 0.5019367377913034 0.6709872198085056
X333 0.4813816731397629 0.8084517968969477 0.4073744166141768 0.4418416403356408 0.5815675749379681
X444 0.7072093106534784 0.6221056454535794 0.17922575220116604 0.10192647980428186 0.278179549313975
X111 0.7444156633161193 0.755888014090719 0.6086095238148184 0.3923249690067086 0.7476506411572069

How to improve performances? Does specfic results of shapiro-wilks test, ANOVA test or t-test, will give me a better k-means clusters?
Could you please, please show me the way, I really need a help.

Answers

  • mariozupanmariozupan Member Posts: 15 Contributor II
    I still can't find tutorials for improving performance. Pre-processing data with log, ln, outliers operator gave me almost the same performances. The same data I got after I transfer data to deciles.

    What if I remove silhouette negative values, as I read somewhere?
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869   Unicorn
    Hi,

    removing outliers is certainly a good idea, and for k-means normalization is a must. I usually go for the Z-Transformation (see Normalize operator). The tests of course only measure the performance, they don't influence the result of the clustering.

    You could experiment with different distance measures in k-Means, sometimes they have quite an impact on the results.

    Best, Marius
  • mariozupanmariozupan Member Posts: 15 Contributor II
    I tried z-transformation and other distance measures than Euclidean. I didn't noticed any significant improvement in average silhouette performances. Then I  tried that dataset, preprocessed in Rapidminer, in R. I executed kmeans and silhouette procedures. Guess what. I got silhouette 0.81, while in Rapid I didn't get more then 0.49.
    Could you explain me how kmeans operator in Rapid and R give me so different average silhouette performance, I will repeat: with the same dataset preprocessed in Rapid ?
  • septian_bagusseptian_bagus Member Posts: 2 Contributor I
    edited January 14
    Hi, @mariozupan, can you let me know how you gt those silhouette numbers using rapidminer?
Sign In or Register to comment.