K means group centroid and visualisation options

timc03timc03 Member Posts: 4 Contributor I
edited November 2018 in Help
I am running a k means clustering in v6.0.008.

I am looking to visualise the results of the clustering as shown here (k means clustering graph): http://en.wikipedia.org/wiki/K-means_clustering#mediaviewer/File:ClusterAnalysis_Mouse.svg

Any suggestions on how to achieve this? I would be happy to use PCA before K Means clustering if that helps.

Also, as an aside, where is the 'cluster centroid' or the mean for each cluster? I have the centroids for each attribute in each cluster in the Cluster Model - cetroid table, but cannot find the cluster mean.

Thanks

Answers

  • Marco_BoeckMarco_Boeck Team Lead Software Engineering Moderator, Employee, Member, University Professor Posts: 1,800   RM Engineering
    Hi,

    I used the following process to import the mouse data taken from here: http://elki.dbs.ifi.lmu.de/wiki/DataSets

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.001-SNAPSHOT">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_csv" compatibility="6.1.001-SNAPSHOT" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
           <parameter key="csv_file" value="C:\Users\boeck\Desktop\mouse.csv"/>
           <parameter key="column_separators" value="\s"/>
           <parameter key="skip_comments" value="true"/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations"/>
           <parameter key="encoding" value="UTF-8"/>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="att1.true.real.attribute"/>
             <parameter key="1" value="att2.true.real.attribute"/>
             <parameter key="2" value="att3.true.polynominal.label"/>
           </list>
         </operator>
         <operator activated="true" class="k_means" compatibility="6.1.001-SNAPSHOT" expanded="true" height="76" name="Clustering" width="90" x="179" y="30">
           <parameter key="k" value="3"/>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Clustering" to_port="example set"/>
         <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
         <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
    You can then simply use the Chart tab of the results to visualize this.

    image

    I'm not sure regarding your bonus question, I don't think there is an explicit option to see that, but I may be wrong there.

    Regards,
    Marco
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,078  RM Data Scientist
    Hello timc03!

    First regarding the centeroids. If you take a look at the model itself, it has an "centeroid table" tab. There you can find your centeroids.

    Furthermore there is a way to display the "boarders" of the cluster. Therfore you apply the clustering on random values in a given range. The result is the picture below:

    image

    I modified marco's process a bit so it creates this picture and connected the model:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.000">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_csv" compatibility="6.1.000" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
           <parameter key="csv_file" value="C:\Users\Martin\Downloads\mouse.csv"/>
           <parameter key="column_separators" value="\s"/>
           <parameter key="skip_comments" value="true"/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations"/>
           <parameter key="encoding" value="UTF-8"/>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="att1.true.real.attribute"/>
             <parameter key="1" value="att2.true.real.attribute"/>
             <parameter key="2" value="att3.true.polynominal.label"/>
           </list>
         </operator>
         <operator activated="true" class="k_means" compatibility="6.1.000" expanded="true" height="76" name="Clustering" width="90" x="380" y="30">
           <parameter key="k" value="3"/>
         </operator>
         <operator activated="true" class="generate_data" compatibility="6.1.000" expanded="true" height="60" name="Generate Data" width="90" x="514" y="255">
           <parameter key="number_examples" value="10000"/>
           <parameter key="attributes_lower_bound" value="0.0"/>
           <parameter key="attributes_upper_bound" value="1.0"/>
         </operator>
         <operator activated="true" class="multiply" compatibility="6.1.000" expanded="true" height="94" name="Multiply" width="90" x="514" y="120"/>
         <operator activated="true" class="apply_model" compatibility="6.1.000" expanded="true" height="76" name="Apply Model" width="90" x="715" y="165">
           <list key="application_parameters"/>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Clustering" to_port="example set"/>
         <connect from_op="Clustering" from_port="cluster model" to_op="Multiply" to_port="input"/>
         <connect from_op="Clustering" from_port="clustered set" to_port="result 1"/>
         <connect from_op="Generate Data" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
         <connect from_op="Multiply" from_port="output 1" to_port="result 2"/>
         <connect from_op="Multiply" from_port="output 2" to_op="Apply Model" to_port="model"/>
         <connect from_op="Apply Model" from_port="labelled data" to_port="result 3"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
         <portSpacing port="sink_result 4" spacing="0"/>
       </process>
     </operator>
    </process>
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • timc03timc03 Member Posts: 4 Contributor I
    Thanks for both answers - however, the mice data set used has 2 dimensions ie it has already had dimensions reduced  by PCA or other. I am looking for a way to visualise k means clustering results without dimension reduction.
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,078  RM Data Scientist
    Hi,

    what about a deviation plot? This way you could show in which attributes the cluster differ.
    That would look like this for the sonar data set:

    image

    I would recommend the local normalization option

    Edit: There is a similar plot for the centeroids in the model..
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • timc03timc03 Member Posts: 4 Contributor I
    So maybe I should rephrase - this using a text mining example. So, after K means, every term belongs more or less to a cluster. I want to chart the relative position of each term to each cluster. This should be able to be done in a low dimensional graphical space given each cluster has a mean centroid. I hope that helps
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,078  RM Data Scientist
    Your cluster centoroids are given bei an n-dimensional vector. In case of textmining the vector has most likely some thousand entries. I guess there is no way do show a 1000-dimensional vector.
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • timc03timc03 Member Posts: 4 Contributor I
    "First regarding the centeroids. If you take a look at the model itself, it has an "centeroid table" tab. There you can find your centeroids."

    This table contains only values for each variable, not the mean group centroid - the mean group centroid is the value I am interested in.  Any suggestions?
Sign In or Register to comment.