Interpreting k-means

iounas · April 2009

I have a sample data and im trying to understand how to interpret numbers that k-means gives me for clusters..
Data has continuous and non continuous attributes like country etc..

here is an example:

Cluster 1 age:	0.413 workclass:	0.151 fnlwgt:	-0.019 education:	-0.009 education-num:	0.591 marital-status:	-0.734 occupation:	-0.076 relationship:	-0.350 race:	-0.190 sex:	0.425 capital-gain:	0.471 capital-loss:	-0.216 hours-per-week:	0.412 native-country:	-0.184 label:	-1.775
Cluster 2 age:	0.208 workclass:	0.085 fnlwgt:	-0.048 education:	-0.033 education-num:	-0.037 marital-status:	0.864 occupation:	0.108 relationship:	1.257 race:	0.373 sex:	-1.080 capital-gain:	-0.118 capital-loss:	-0.208 hours-per-week:	-0.151 native-country:	-0.208 label:	0.497

Label is yes/no
and I need to figure out what attributes affect that and how

IngoRM · April 2009

Hi,

each cluster of a centroid based cluster model like that of k-means is represented by a centroid which can be interpreted as a prototypical point for this cluster. The numbers are the values for the different dimensions of each of the cluster centroid. For example, the examples of the first cluster have a (probably normalized) mean age of 0.413 where the examples in cluster 2 are younger on average and so on.

Cheers,
Ingo

mihai · April 2009

although previous answer is more than complete, for me it helped to read how the clustering algorithms work. they are easy to understand and after this interpreting the result was easier.

as far as i know k-means works like this:

you select how many clusters you want
you randomly generate k points as cluster centers (or randomly generate k clusters from your data)

assign each "point" to the nearest cluster center.
recompute the centers of the clusters.

repeat the last 2 steps until you don t have changes anymore. (or until a stopping criterion is met)

sometimes the k-means may give different results (because of the randomization procedure in the beginning) and it also depends on the kind of data you have. there are optimizations/variants of this clustering method.
take a short look at least at http://en.wikipedia.org/wiki/Cluster_analysis if not at relevant papers.

even if my answer is trivial and the solution was already given, it helped me when it was the case so maybe it helps others as well.

iounas · May 2009

Hi, i forgot to mention that z-transform normalization was applied before k-means.
How does that change interpretation of results.. Is it the higher the value the more influence on the cluster or similar?
And what about nominal values with nominal2numerical? Does it assign 0,1,2,3...

IngoRM · May 2009

Hi,

Is it the higher the value the more influence on the cluster or similar?

No. In most cases, the range of the value does not have anything to do with the importance or influence of this attribute for the cluster (with the exception of tfidf like features like in text clustering etc.). It is just the location of the cluster centroid in the corresponding dimension.

For nominal attributes, I would suggest to first use the Nominal2Binominal operator and then the Nominal2Numerical operator (dichotomization). It usually produces much better results.

Cheers,
Ingo

Student_ · June 2016

Can the coordinates of my centroids tell me where an operator belongs to? If not, how can I identify which operator belongs to which cluster?

Product Type = DRO Cluster 1: 3,082661241 Cluster 2: -0,125515749 Cluster 3: -0,038457373 Cluster 4: -0,125515749

Product Type is the operator

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Interpreting k-means

Answers