Test & Validation Data - Unsupervised

Olus · July 2020

Dear all,
I would like to drive an Unsupervised analysis on a Data Set, which later can be pursued by a Supervised Analysis.
I think I do not need to split my data into Training & Test sets for the Unsupervised part (Clustering, Association or regression).
What do you think?

jacobcybulski · July 2020

In many non-RM environments a typical approach to clustering is to create a k-means clustering and then use it to create a classifier, such as k-NN to be used to assign cluster values to new examples. It is also commonly practiced to create a classification model based on you cluster labels and then check the accuracy of this classification. However, the approach described above is not pure as clustering and classification seek different objectives, especially if your clustering and classification use different methods (e.g. density based clustering and a decision tree classification). So in theory you can cluster your entire data set, create a classifier based on those "labels" and even use the classifier to predict the cluster membership and assess its performance. My advice would be to cluster your data, use the performance measures specific for the clustering system, and then utilise the clustering model generated in the process so that it could be applied to new data (in exactly the same way as your classifiers).

Jacob

Olus · July 2020

Thanks a lot! Very useful and detailed feedback.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Test & Validation Data - Unsupervised

Best Answer

Answers