Test & Validation Data - Unsupervised

OlusOlus Member Posts: 16 Maven
Dear all, 
I would like to drive an Unsupervised analysis on a Data Set, which later can be pursued by a Supervised Analysis. 
I think I do not need to split my data into Training & Test sets for the Unsupervised part (Clustering, Association or regression).
What do you think? 

Best Answer

  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited July 2020 Solution Accepted
    In many non-RM environments a typical approach to clustering is to create a k-means clustering and then use it to create a classifier, such as k-NN to be used to assign cluster values to new examples. It is also commonly practiced to create a classification model based on you cluster labels and then check the accuracy of this classification. However, the approach described above is not pure as clustering and classification seek different objectives, especially if your clustering and classification use different methods (e.g. density based clustering and a decision tree classification). So in theory you can cluster your entire data set, create a classifier based on those "labels" and even use the classifier to predict the cluster membership and assess its performance. My advice would be to cluster your data, use the performance measures specific for the clustering system, and then utilise the clustering model generated in the process so that it could be applied to new data (in exactly the same way as your classifiers).


  • OlusOlus Member Posts: 16 Maven
    Thanks a lot! Very useful and detailed feedback.

Sign In or Register to comment.