Options

Interpreting k-means

iounasiounas Member Posts: 3 Contributor I
edited November 2018 in Help
I have a sample data and im trying to understand how to interpret numbers that k-means gives me for clusters..
Data has continuous and non continuous attributes like country etc..

here is an example:
Cluster 1 age:	0.413 workclass:	0.151 fnlwgt:	-0.019 education:	-0.009 education-num:	0.591 marital-status:	-0.734 occupation:	-0.076 relationship:	-0.350 race:	-0.190 sex:	0.425 capital-gain:	0.471 capital-loss:	-0.216 hours-per-week:	0.412 native-country:	-0.184 label:	-1.775
Cluster 2 age: 0.208 workclass: 0.085 fnlwgt: -0.048 education: -0.033 education-num: -0.037 marital-status: 0.864 occupation: 0.108 relationship: 1.257 race: 0.373 sex: -1.080 capital-gain: -0.118 capital-loss: -0.208 hours-per-week: -0.151 native-country: -0.208 label: 0.497
Label is yes/no
and I need to figure out what attributes affect that and how

Answers

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    each cluster of a centroid based cluster model like that of k-means is represented by a centroid which can be interpreted as a prototypical point for this cluster. The numbers are the values for the different dimensions of each of the cluster centroid. For example, the examples of the first cluster have a (probably normalized) mean age of 0.413 where the examples in cluster 2 are younger on average and so on.

    Cheers,
    Ingo
  • Options
    mihaimihai Member Posts: 3 Contributor I
    although previous answer is more than complete, for me it helped to read how the clustering algorithms work. they are easy to understand and after this interpreting the result was easier.

    as far as i know k-means works like this:

    you select how many clusters you want
    you randomly generate k points as cluster centers (or randomly generate k clusters from your data)

    assign each "point" to the  nearest cluster center.
    recompute the centers of the clusters.

    repeat the last 2 steps until you don t have changes anymore. (or until a stopping criterion is met)

    sometimes the k-means may give different results (because of the randomization procedure in the beginning) and it also depends on the kind of data you have. there are optimizations/variants of this clustering method.
    take a short look at least at http://en.wikipedia.org/wiki/Cluster_analysis if not at relevant papers.

    even if my answer is trivial and the solution was already given, it helped me when it was the case so maybe it helps others as well.
  • Options
    iounasiounas Member Posts: 3 Contributor I
    Hi, i forgot to mention that z-transform normalization was applied before k-means.
    How does that change interpretation of results.. Is it the higher the value the more influence on the cluster or similar?
    And what about nominal values with nominal2numerical? Does it assign 0,1,2,3...
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    Is it the higher the value the more influence on the cluster or similar?
    No. In most cases, the range of the value does not have anything to do with the importance or influence of this attribute for the cluster (with the exception of tfidf like features like in text clustering etc.). It is just the location of the cluster centroid in the corresponding dimension.

    For nominal attributes, I would suggest to first use the Nominal2Binominal operator and then the Nominal2Numerical operator (dichotomization). It usually produces much better results.

    Cheers,
    Ingo
  • Options
    Student_Student_ Member Posts: 1 Contributor I

    Can the coordinates of my centroids tell me where an operator belongs to? If not, how can I identify which operator belongs to which cluster?

     

    Product Type = DRO       Cluster 1:  3,082661241     Cluster 2: -0,125515749     Cluster 3: -0,038457373   Cluster 4: -0,125515749

    Product Type is the operator

             
Sign In or Register to comment.