Clustering accuracy & how can i pick the proper number of clusters

marou_mal96marou_mal96 Member Posts: 6 Newbie
edited December 2020 in Help
Hello there! I have two questions about clustering. The firtsi is about the number of clusters, more specifically Ι have only numerical attributes and i don't know what's the best cluster for my k-means clustering. The other question is if there is any way to perform my accuracy except from the "Map clustering on labels".

Thanks in advance!

Best Answers

  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    Solution Accepted
    What you need is to set up an experiment. Use Optimise Parameters (Grid) to vary the number of clusters in k-means and log the cluster performance measures. Inside you will need k-means and some cluster performance, typically Davis-Bouldin (closest to zero is best) which can be obtained from Cluster Distance Performance, or Sum of Squares from Cluster Distribution Performance. DB measure works well when your attributes are numerical and smooth (convex shape as well), when you collect a log of k vs DB performance plot it and find the DB closest to zero, ideally in a smooth stable segment of the plot, this will be around the optimum k. However, DB often fails that stability test, in which case the k vs Sum of Squares (average distance from cluster centres) plot is a nice informal method, called the elbow method, where you look for such k beyond which the gain in performance (highest SOS) is no longer significant as compared to the clustering complexity (k), it often looks like the tip of an elbow. 
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited December 2020 Solution Accepted
    Mapping of clusters on labels I find unreliable, especially when your clustering is not very good. One similar method is to combine k-means with k-nn to determine the cluster system ability to "predict" the cluster based on the neighbour distances and measure the accuracy of this process. However, when you consider what is important in clustering, ie all similar data points should be close to each other (as well as their cluster centroid) and far away from dissimilar ones (and centroids of other clusters), the other performance measures are more appropriate. It is also a good idea to use PCA to map your data into 2D and then plot your data in colour of the cluster to determine if clusters are cohesive and we'll separated. 
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    Solution Accepted
    One more warning: when you plot cluster performance make sure that you do not have any random effects affecting this process, e. g. clustering algorithm is influenced by the initial position of cluster centroids. So set the random seed of any operator which has the random element. Otherwise you will not know if the clustering improvement is due to the optimum k or the random effect. The random effect will usually show in your plot as the up and down zigzag. 
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    Solution Accepted
    Place it after your clustering and apply it to clustered examples (it can be a separate process), then scatter plot PC1 vs PC2 and use cluster as colour. You can also extract coordinates of the centroids from your cluster model using Extract Cluster Prototypes and you can plot them in the same PCA coordinate system as the rest of the data points (so simply apply that PCA model to the centroids and plot them separately). In this way you'll see if the cluster centres are well separated. 
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited December 2020 Solution Accepted
    The last advice: keep your k practical, so often rather than finding the global optimum for cluster number, you may prefer to find the best k within a range. For example if you are conducting the customer segmentation for a marketing campaign, you may not  be able to afford more than 10 separate campaigns, so it is not useful if the best number of clusters is 76, however it is practical if the best cluster number of up to 10 is 5.
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    Solution Accepted
    What I'd do is to build the PCA using the clustered examples but then apply the resulting PCA model to the centroids extracted from the cluster model, this way the PCA is built on lots of data and be more reliable. 
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited December 2020 Solution Accepted
    I am not sure how urgent is your project, I am planning to continue recording my YouTube videos (check ironfrown) and can record a mini series on cluster analysis in RapidMiner in January. In the meantime, I strongly suggest to get a book by Vijay Kotu and Bała Deshpande, Data Science: Concepts and Practice 2nd Edition, where chapter 7 describes cluster analysis in RapidMiner (yes the whole book uses RapidMiner to explain different examples). 

Answers

  • marou_mal96marou_mal96 Member Posts: 6 Newbie
    How can i use PCA ?
    At the moment i have this proccess. Where i can put the PCA?


  • marou_mal96marou_mal96 Member Posts: 6 Newbie

    How do you like it?
  • marou_mal96marou_mal96 Member Posts: 6 Newbie
    Cluster 3 gives me the best value DB
  • marou_mal96marou_mal96 Member Posts: 6 Newbie
    Is there any tutorial about this or something else to help  me create this you told me? I am a starter in rapidminer and i cannot understand much of what you said. Thank you again for your time sir!
  • marou_mal96marou_mal96 Member Posts: 6 Newbie
    Thank you very much sir! I appreciate it. Merry Christmas 🌲
Sign In or Register to comment.