"Cross validation on clustering (k-means)"

siamak_want · September 2012

Hi all,

I want to evaluate the performance of k-means clustering with X-validation operator. My data does not contain a label attribute. RM says: "Input exampleset does not have a label attribute".
Is it necessary to have a label attribute even in clustering?!!!!!! Do I use the cross validation operator incorrectly?

Any help would be appreciated.

here is my process XML:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.009">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.009" expanded="true" name="Process">
    <process expanded="true" height="390" width="820">
      <operator activated="true" class="generate_data" compatibility="5.2.009" expanded="true" height="60" name="Generate Data" width="90" x="112" y="30">
        <parameter key="target_function" value="transactions dataset"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.2.009" expanded="true" height="112" name="Validation (3)" width="90" x="313" y="30">
        <description>A cross-validation evaluating a decision tree model.</description>
        <process expanded="true" height="408" width="383">
          <operator activated="true" class="k_means" compatibility="5.2.009" expanded="true" height="76" name="Clustering (3)" width="90" x="146" y="30">
            <parameter key="k" value="10"/>
          </operator>
          <connect from_port="training" to_op="Clustering (3)" to_port="example set"/>
          <connect from_op="Clustering (3)" from_port="cluster model" to_port="model"/>
          <connect from_op="Clustering (3)" from_port="clustered set" to_port="through 1"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <portSpacing port="sink_through 2" spacing="0"/>
        </process>
        <process expanded="true" height="408" width="383">
          <operator activated="true" class="cluster_distance_performance" compatibility="5.2.009" expanded="true" height="94" name="Performance (3)" width="90" x="146" y="39"/>
          <connect from_port="model" to_op="Performance (3)" to_port="cluster model"/>
          <connect from_port="test set" to_op="Performance (3)" to_port="example set"/>
          <connect from_op="Performance (3)" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="source_through 2" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Validation (3)" to_port="training"/>
      <connect from_op="Validation (3)" from_port="averagable 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

awchisholm · September 2012

Hello

Evaluating performance requires a label representing the expected value and a predicted value to compare with it. Clustering puts examples into clusters where the examples in one cluster are similar to one another and are different from examples in other clusters, it does not need a label. Clustering does not produce a guaranteed "correct" answer, what is does is highlight to an expert what clusters may be interesting. It is up to the expert to decide whether this is right or not. There are various techniques to help identify clusterings that are potentially interesting and these can be implemented in RapidMiner. For k-means, the value of k is important and must be varied to determine which clustering is better using these techniques.

See http://rapidminernotes.blogspot.co.uk/2011/03/counting-clusters-part-ii.html for an example.

hope that helps...

Andrew

siamak_want · September 2012

Hi Awchisholm ,

Thanks for your attension. I read the page you provided. As you have mentioned, the clustering does not need any label. For example, evaluating a clustering with Daivis Bouldin does not need any label. So why does the cross validation in RM expects a label? Please run my process in RM for better understanding of my problem.

Thanks a lot.

awchisholm · September 2012

Hello

The generated data needs a label for cross validation to stop complaining. So you could just create a dummy one using "generate attributes" and "set role" since the intent is not to compare the performance of the clustering against a label, it is to determine the inherent goodness of the clustering.

regards

Andrew

siamak_want · September 2012

Thank you Andrew,

I did exactly as you explained: i.e. adding a set role and also a dummy label attribute. At first, I was afraid that my dummy label affects the calculation of performance. But now I think it has not any effect in calculating "Daivis Bouldin". and my process works fine. Just to be honest, I can not understand why RM just accept data sets with labels for cross validation. maybe this is an issue that might be considered by RM developers. Please, correct me if I am wrong.

awchisholm · September 2012

Hellio

I can't speak for the developers but they might say that there is no need to implement an enhancement because a workaround exists

I don't know the details of what you are doing but bear in mind that the cross validation is performing a clustering on 90% of the data, applying the model to a 10% test set and then calculating a performance measure using the 90% cluster model and 10% test set. This is repeated 10 times for different subsets of the data and an average is calculated. Each of the 10 cluster models is almost certain to have a different set of cluster prototypes thereby affecting the Davies-Bouldin calculation and in turn this will affect the results. The averaging will reduce this. The 11th iteration of the cross validation will produce a cluster model on all the data that will be different from each of the 10 and would therefore produce a different clustering when applied to unseen data. The average of the 10 iterations should be a good estimate of what would happen for the 11th iteration model.

The trouble is that the clustering has had no human input and so it could be wrong. The cross validation estimate of performance on unseen data won't be valuable if the clustering itself has no meaning.

A single Davies-Bouldin measure by itself is of no value. The interesting thing is to compare it with the measure in different circumstances. The key thing that must be varied is k for k-means and the really interesting thing is to see how this measure varies for different values of k.

As I said, I don't know the details of what you want so it's up to you to decide what is best for your case.

regards

Andrew

siamak_want · September 2012

HI Andrew,

Thanks for the nice accurate explanation. As you have mentioned it is really an interesting topic. I want to use validation for examining different values of k in k-means. I have found that when I choose the number of validation equal to 10, it will run 11 times, if I designate it for 5-fold cros validation, it will run 6 times, and so on... I was wondering for a while. But now, I think you have mentioned the answer: the last iteration run on the whole data. So the output model is built upon the whole(100%) data. Before this discussion, I was thinking that the output port of validation is just delivers the "best model" out of ten non- complete(i.e. 90%) models. Now I have a better sight from X-val. please correct me if I am wrong.

Again thanks for this nice discussion.

awchisholm · September 2012

Hello

Cross validation estimates model performance on unseen data.

If you look at the following blog entry http://rapidminernotes.blogspot.co.uk/2011/01/what-does-x-validation-do.html, it gives an example.

Cross validation needs labels in order to produce a result that has meaning. With clustering, there are no labels so any result that is produced will not be comparable to anything. As a by-product of the way it works, the cross validation is producing an average of 10 performances but I am not convinced it is better than simply using all of the data.

regards

Andrew

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Cross validation on clustering (k-means)"

Answers