Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Interpreting k-means
I have a sample data and im trying to understand how to interpret numbers that k-means gives me for clusters..
Data has continuous and non continuous attributes like country etc..
here is an example:
and I need to figure out what attributes affect that and how
Data has continuous and non continuous attributes like country etc..
here is an example:
Cluster 1 age: 0.413 workclass: 0.151 fnlwgt: -0.019 education: -0.009 education-num: 0.591 marital-status: -0.734 occupation: -0.076 relationship: -0.350 race: -0.190 sex: 0.425 capital-gain: 0.471 capital-loss: -0.216 hours-per-week: 0.412 native-country: -0.184 label: -1.775Label is yes/no
Cluster 2 age: 0.208 workclass: 0.085 fnlwgt: -0.048 education: -0.033 education-num: -0.037 marital-status: 0.864 occupation: 0.108 relationship: 1.257 race: 0.373 sex: -1.080 capital-gain: -0.118 capital-loss: -0.208 hours-per-week: -0.151 native-country: -0.208 label: 0.497
and I need to figure out what attributes affect that and how
0
Answers
each cluster of a centroid based cluster model like that of k-means is represented by a centroid which can be interpreted as a prototypical point for this cluster. The numbers are the values for the different dimensions of each of the cluster centroid. For example, the examples of the first cluster have a (probably normalized) mean age of 0.413 where the examples in cluster 2 are younger on average and so on.
Cheers,
Ingo
as far as i know k-means works like this:
you select how many clusters you want
you randomly generate k points as cluster centers (or randomly generate k clusters from your data)
assign each "point" to the nearest cluster center.
recompute the centers of the clusters.
repeat the last 2 steps until you don t have changes anymore. (or until a stopping criterion is met)
sometimes the k-means may give different results (because of the randomization procedure in the beginning) and it also depends on the kind of data you have. there are optimizations/variants of this clustering method.
take a short look at least at http://en.wikipedia.org/wiki/Cluster_analysis if not at relevant papers.
even if my answer is trivial and the solution was already given, it helped me when it was the case so maybe it helps others as well.
How does that change interpretation of results.. Is it the higher the value the more influence on the cluster or similar?
And what about nominal values with nominal2numerical? Does it assign 0,1,2,3...
For nominal attributes, I would suggest to first use the Nominal2Binominal operator and then the Nominal2Numerical operator (dichotomization). It usually produces much better results.
Cheers,
Ingo
Can the coordinates of my centroids tell me where an operator belongs to? If not, how can I identify which operator belongs to which cluster?
Product Type = DRO Cluster 1: 3,082661241 Cluster 2: -0,125515749 Cluster 3: -0,038457373 Cluster 4: -0,125515749
Product Type is the operator