HOW TO Validate k-means Clustering?

shredlegend88 · October 2016

It seems like a simple question. I have a dataset I am performing a k-means cluster analysis for consumers bankruptcy tendency (k=2). I need to know the best way to validate my models predictive accuracy. I have wasted about 5 hours trying and failing.

My text states the easiest way is by generating a confusion/classification matrix, but for the life of me, I cannot figure out what setting/operator/selection etc. to do this in RM!!!

All I get for my results is shown below. This is not good enough for me to know how well my model is performing against my testing/validation set. I am using a cross validation operator containing my cluster model on the training section, and the apply model and cluster distance performance operator on the training section. All i get is this. Why so little information?

Avg. within centroid distance

Avg. within centroid distance: -6.053 +/- 0.279 (mikro: -6.053)

I have attached my dataset and xml of my process.

MartinLiebig · October 2016

Shredlegend88,

if you want to get a confusion matrix, you need to use a performance operator for supervised classification problem. This requieres a label. If you go purely unsupervised, you cannot define a confusion matrix.

~Martin

shredlegend88 · October 2016

My dataset has a label, however, when I try an use the performance operator, i get the error "Input ExampleSet does not have predicted label attribute".

What does this mean and how to I fix it? I have tried many approaches, adding dummy variables, changing my labels role/type/etc.

shredlegend88 · October 2016

Martin,

Good afternoon. I successfully gotten a confusion matrix output through trial and error, however, the accuracy is zero percent. Could you take a look at my process and let me know if you can see why? I think it has something to do with roles (label vs prediction) for my target variable (bankruptcy). I do not understand the critieria to have one or the other.

It seems that the Performance (Classification) operator requires a variable with a role of "prediction". Am I correct in assuming that the variable I am trying to isolate between my two clusters should be set to prediction?

When I change it from Label to Prediction, it performs the analysis, but the accuracy is zero and I don't understand why. All of the selected variables I chose are sufficiently correlated to my target variable (bankruptcy), however, the confusion matrix states an accuracy of zero. To further confuse things, there is a warning on ther performance operator "Input example set must have special attribute 'label'". My cluster model has "add as label" checked which is maybe why it does not error, but I am not sure.

When selected the Performance (Classification) operator, I see main criterion and it is currently set to "accuracy". Maybe this is the culprit. I do not see anywhere where these criterion are documented. Can you point me in the right direction? I am new to this tool and I have spent days now trying to figure this out and it is due tonight.

Thomas_Ott · November 2016

I replied already in your other thread. What Martin is getting at is that Clustering is unsupervised learning. Essentially you create statistical "blobs" (i know @mschmitz will groan at this) of similar data. You can easily see that Cluster 2 has tends to have higher rates of bankruptcy based on your normalized data. If you want to predict and calculate a confusion matrix, you will need to create a "label" such as "default" and "no default." Then you would use Cross Validation, measure the Classificaiton performance, and generate a confusion matrix.

With Clustering, there are ways to measure the performance but the results will not generate a confusion matrix.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

HOW TO Validate k-means Clustering?

Avg. within centroid distance

Answers