Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

K means group centroid and visualisation options

timc03timc03 Member Posts: 4 Contributor I
edited May 2020 in Help
I am running a k means clustering in v6.0.008.

I am looking to visualise the results of the clustering as shown here (k means clustering graph): http://en.wikipedia.org/wiki/K-means_clustering#mediaviewer/File:ClusterAnalysis_Mouse.svg

Any suggestions on how to achieve this? I would be happy to use PCA before K Means clustering if that helps.

Also, as an aside, where is the 'cluster centroid' or the mean for each cluster? I have the centroids for each attribute in each cluster in the Cluster Model - cetroid table, but cannot find the cluster mean.

Thanks
Tagged:

Answers

  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,996 RM Engineering
    Hi,

    I used the following process to import the mouse data taken from here: http://elki.dbs.ifi.lmu.de/wiki/DataSets

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.001-SNAPSHOT">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_csv" compatibility="6.1.001-SNAPSHOT" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
           <parameter key="csv_file" value="C:\Users\boeck\Desktop\mouse.csv"/>
           <parameter key="column_separators" value="\s"/>
           <parameter key="skip_comments" value="true"/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations"/>
           <parameter key="encoding" value="UTF-8"/>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="att1.true.real.attribute"/>
             <parameter key="1" value="att2.true.real.attribute"/>
             <parameter key="2" value="att3.true.polynominal.label"/>
           </list>
         </operator>
         <operator activated="true" class="k_means" compatibility="6.1.001-SNAPSHOT" expanded="true" height="76" name="Clustering" width="90" x="179" y="30">
           <parameter key="k" value="3"/>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Clustering" to_port="example set"/>
         <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
         <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
    You can then simply use the Chart tab of the results to visualize this.

    image

    I'm not sure regarding your bonus question, I don't think there is an explicit option to see that, but I may be wrong there.

    Regards,
    Marco
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
    Hello timc03!

    First regarding the centeroids. If you take a look at the model itself, it has an "centeroid table" tab. There you can find your centeroids.

    Furthermore there is a way to display the "boarders" of the cluster. Therfore you apply the clustering on random values in a given range. The result is the picture below:

    image

    I modified marco's process a bit so it creates this picture and connected the model:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.000">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_csv" compatibility="6.1.000" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
           <parameter key="csv_file" value="C:\Users\Martin\Downloads\mouse.csv"/>
           <parameter key="column_separators" value="\s"/>
           <parameter key="skip_comments" value="true"/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations"/>
           <parameter key="encoding" value="UTF-8"/>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="att1.true.real.attribute"/>
             <parameter key="1" value="att2.true.real.attribute"/>
             <parameter key="2" value="att3.true.polynominal.label"/>
           </list>
         </operator>
         <operator activated="true" class="k_means" compatibility="6.1.000" expanded="true" height="76" name="Clustering" width="90" x="380" y="30">
           <parameter key="k" value="3"/>
         </operator>
         <operator activated="true" class="generate_data" compatibility="6.1.000" expanded="true" height="60" name="Generate Data" width="90" x="514" y="255">
           <parameter key="number_examples" value="10000"/>
           <parameter key="attributes_lower_bound" value="0.0"/>
           <parameter key="attributes_upper_bound" value="1.0"/>
         </operator>
         <operator activated="true" class="multiply" compatibility="6.1.000" expanded="true" height="94" name="Multiply" width="90" x="514" y="120"/>
         <operator activated="true" class="apply_model" compatibility="6.1.000" expanded="true" height="76" name="Apply Model" width="90" x="715" y="165">
           <list key="application_parameters"/>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Clustering" to_port="example set"/>
         <connect from_op="Clustering" from_port="cluster model" to_op="Multiply" to_port="input"/>
         <connect from_op="Clustering" from_port="clustered set" to_port="result 1"/>
         <connect from_op="Generate Data" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
         <connect from_op="Multiply" from_port="output 1" to_port="result 2"/>
         <connect from_op="Multiply" from_port="output 2" to_op="Apply Model" to_port="model"/>
         <connect from_op="Apply Model" from_port="labelled data" to_port="result 3"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
         <portSpacing port="sink_result 4" spacing="0"/>
       </process>
     </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • timc03timc03 Member Posts: 4 Contributor I
    Thanks for both answers - however, the mice data set used has 2 dimensions ie it has already had dimensions reduced  by PCA or other. I am looking for a way to visualise k means clustering results without dimension reduction.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
    Hi,

    what about a deviation plot? This way you could show in which attributes the cluster differ.
    That would look like this for the sonar data set:

    image

    I would recommend the local normalization option

    Edit: There is a similar plot for the centeroids in the model..
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • timc03timc03 Member Posts: 4 Contributor I
    So maybe I should rephrase - this using a text mining example. So, after K means, every term belongs more or less to a cluster. I want to chart the relative position of each term to each cluster. This should be able to be done in a low dimensional graphical space given each cluster has a mean centroid. I hope that helps
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
    Your cluster centoroids are given bei an n-dimensional vector. In case of textmining the vector has most likely some thousand entries. I guess there is no way do show a 1000-dimensional vector.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • timc03timc03 Member Posts: 4 Contributor I
    "First regarding the centeroids. If you take a look at the model itself, it has an "centeroid table" tab. There you can find your centeroids."

    This table contains only values for each variable, not the mean group centroid - the mean group centroid is the value I am interested in.  Any suggestions?
Sign In or Register to comment.