"Calculating K-means Performance"

DancingSheep · May 2011

Hello,

I have the following flow which uses a simple k-means and works perfectly (for my purpose at least!).

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="parallelize_main_process" value="false"/>
    <process expanded="true" height="521" width="681">
      <operator activated="true" class="read_csv" compatibility="5.1.006" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
        <parameter key="csv_file" value="/Users/GiO/Desktop/csv/AlmostFull.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="trim_lines" value="false"/>
        <parameter key="use_quotes" value="true"/>
        <parameter key="quotes_character" value="&quot;"/>
        <parameter key="escape_character_for_quotes" value="\"/>
        <parameter key="skip_comments" value="false"/>
        <parameter key="comment_characters" value="#"/>
        <parameter key="parse_numbers" value="true"/>
        <parameter key="decimal_character" value="."/>
        <parameter key="grouped_digits" value="false"/>
        <parameter key="grouping_character" value=","/>
        <parameter key="date_format" value=""/>
        <parameter key="first_row_as_names" value="true"/>
        <list key="annotations"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="encoding" value="SYSTEM"/>
        <list key="data_set_meta_data_information"/>
        <parameter key="read_not_matching_values_as_missings" value="true"/>
        <parameter key="datamanagement" value="double_array"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.1.006" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
        <parameter key="name" value="favgame"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.1.006" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="30">
        <parameter key="attribute_filter_type" value="value_type"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="polynominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="replace_missing_values" compatibility="5.1.006" expanded="true" height="94" name="Replace Missing Values" width="90" x="447" y="30">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="no_missing_values"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="default" value="zero"/>
        <list key="columns"/>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.1.006" expanded="true" height="76" name="Clustering" width="90" x="246" y="390">
        <parameter key="add_cluster_attribute" value="true"/>
        <parameter key="add_as_label" value="false"/>
        <parameter key="remove_unlabeled" value="false"/>
        <parameter key="k" value="3"/>
        <parameter key="max_runs" value="10"/>
        <parameter key="max_optimization_steps" value="100"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <operator activated="true" class="extract_prototypes" compatibility="5.1.006" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="380" y="300"/>
      <operator activated="true" class="sort" compatibility="5.1.006" expanded="true" height="76" name="Sort" width="90" x="380" y="390">
        <parameter key="attribute_name" value="cluster"/>
        <parameter key="sorting_direction" value="increasing"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Sort" to_port="example set input"/>
      <connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
      <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
      <connect from_op="Sort" from_port="example set output" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="270"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="54"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Now I'd like to calculate its performance, but connecting Cluster Count Performance to Extract Cluster Prototypes' exa port gave a -0.000 for every cluster.
I have no idea if I'm doing something wrong or if that's the supposed result. Could you give any suggestions?

Thanks

EDIT: I couldn't care less for sorting the result anymore, feel free to delete it when working with my code.

awchisholm · May 2011

Hello

You can use the "map clustering on labels" operator to see how close the clusters are to what they should be. You can then feed the result to a performance operator to get a confusion matrix. Cluster count performance should be used before extracting prototypes to count clusters but in this case, it will always return the value of k in the k-means operator.

Here's an example using the iris data set.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <process expanded="true" height="611" width="748">
      <operator activated="false" class="read_csv" compatibility="5.1.006" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
        <parameter key="csv_file" value="/Users/GiO/Desktop/csv/AlmostFull.csv"/>
        <parameter key="column_separators" value=","/>
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="false" class="set_role" compatibility="5.1.006" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
        <parameter key="name" value="favgame"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="false" class="select_attributes" compatibility="5.1.006" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="30">
        <parameter key="attribute_filter_type" value="value_type"/>
        <parameter key="value_type" value="polynominal"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <operator activated="false" class="replace_missing_values" compatibility="5.1.006" expanded="true" height="94" name="Replace Missing Values" width="90" x="447" y="30">
        <parameter key="attribute_filter_type" value="no_missing_values"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="default" value="zero"/>
        <list key="columns"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="5.1.006" expanded="true" height="60" name="Retrieve" width="90" x="112" y="255">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.1.006" expanded="true" height="76" name="Clustering" width="90" x="246" y="255">
        <parameter key="k" value="3"/>
      </operator>
      <operator activated="true" class="cluster_count_performance" compatibility="5.1.006" expanded="true" height="76" name="Performance (2)" width="90" x="380" y="300"/>
      <operator activated="false" class="sort" compatibility="5.1.006" expanded="true" height="76" name="Sort" width="90" x="380" y="435">
        <parameter key="attribute_name" value="cluster"/>
      </operator>
      <operator activated="true" class="map_clustering_on_labels" compatibility="5.1.006" expanded="true" height="76" name="Map Clustering on Labels" width="90" x="514" y="165"/>
      <operator activated="true" class="extract_prototypes" compatibility="5.1.006" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="648" y="165"/>
      <operator activated="true" class="performance" compatibility="5.1.006" expanded="true" height="76" name="Performance" width="90" x="648" y="30"/>
      <connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Retrieve" from_port="output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Performance (2)" to_port="cluster model"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Map Clustering on Labels" to_port="example set"/>
      <connect from_op="Performance (2)" from_port="cluster model" to_op="Map Clustering on Labels" to_port="cluster model"/>
      <connect from_op="Performance (2)" from_port="performance" to_port="result 5"/>
      <connect from_op="Map Clustering on Labels" from_port="example set" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Map Clustering on Labels" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
      <connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 3"/>
      <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 4"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
      <portSpacing port="sink_result 6" spacing="0"/>
    </process>
  </operator>
</process>

regards

Andrew

DancingSheep · May 2011

First of all thanks for the answer.

Now, I have an issue with Map Clustering on Labels: I have 7 possible labels and only k = 3 (I'd like to try 3 <= k <= 6). Any solution for this?

EDIT: Wait! I might have found what I was looking for... Time to check!

DancingSheep · May 2011

Ok, sorry for the double post, but I finally tried what I had in mind.
I need to check the performance of cluster prediction; here is what I made:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <process expanded="true" height="511" width="681">
      <operator activated="true" class="read_csv" compatibility="5.1.006" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
        <parameter key="csv_file" value="/Users/GiO/Desktop/csv/AlmostFull.csv"/>
        <parameter key="column_separators" value=","/>
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.1.006" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
        <parameter key="name" value="favgame"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.1.006" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="30">
        <parameter key="attribute_filter_type" value="value_type"/>
        <parameter key="value_type" value="polynominal"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <operator activated="true" class="replace_missing_values" compatibility="5.1.006" expanded="true" height="94" name="Replace Missing Values" width="90" x="447" y="30">
        <parameter key="attribute_filter_type" value="no_missing_values"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="default" value="zero"/>
        <list key="columns"/>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.1.006" expanded="true" height="76" name="Clustering" width="90" x="45" y="210">
        <parameter key="k" value="3"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.1.006" expanded="true" height="76" name="Set Role (2)" width="90" x="246" y="165">
        <parameter key="name" value="cluster"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="cluster_count_performance" compatibility="5.1.006" expanded="true" height="76" name="Performance (2)" width="90" x="246" y="300"/>
      <operator activated="true" class="map_clustering_on_labels" compatibility="5.1.006" expanded="true" height="76" name="Map Clustering on Labels" width="90" x="380" y="165"/>
      <operator activated="true" class="extract_prototypes" compatibility="5.1.006" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="581" y="165"/>
      <operator activated="true" class="performance" compatibility="5.1.006" expanded="true" height="76" name="Performance" width="90" x="514" y="300"/>
      <connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Performance (2)" to_port="cluster model"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Map Clustering on Labels" to_port="example set"/>
      <connect from_op="Performance (2)" from_port="cluster model" to_op="Map Clustering on Labels" to_port="cluster model"/>
      <connect from_op="Performance (2)" from_port="performance" to_port="result 4"/>
      <connect from_op="Map Clustering on Labels" from_port="example set" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Map Clustering on Labels" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
      <connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
      <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
      <connect from_op="Performance" from_port="example set" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="144"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="108"/>
      <portSpacing port="sink_result 4" spacing="36"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

It doesn't work, but I don't understand the error.

awchisholm · May 2011

Hello

Difficult to tell but it's probably because you are setting the cluster to a label.

Delete the "set role(2)" operator

regards

Andrew

DancingSheep · May 2011

But that's exactly what I'm trying to achieve: performance on cluster prediction. Any (other?) way I could make this work?

awchisholm · May 2011

Hello

There is already a label - favgame. The clustering algorithm creates clusters and the "Map clustering on labels" operator tries to map the labels to the clusters. If there is no attribute with the role cluster, it cannot work.

regards

Andrew

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Calculating K-means Performance"

Answers