"universal clustering validation"

IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,642  RM Founder
edited May 23 in Help
Original messages from SourceForge forum at http://sourceforge.net/forum/forum.php?thread_id=2036214&;forum_id=390413

Hi,
I have tried that validation->clustering many way. example, ClusterCentroidEvaluator is 
only for k-means(needs centroid based learner) and almost every other are for 
hierarchical kind of clustering, density estimator- I have not foud learner which 
produce FlatClusterModel. Anyway the best measure is that supervised kind comparing the 
clustered data and the labeled input data (for classification). Kernel of the problem is 
that those validation can't be applied on two different unsupervised learners. 
Supervised way for estimation of data error, will produce the best result, but can I do 
that? I know it is possible, but can it be done in Rapidminer? Is some kind of 
validation in Rapidminer applicable for every learner model and give me good clustering 
validity? Or somebody done it in java aditionaly? I have read manual and I can't find any information about that.

Thanks for your replies


Answer by ****:

Hello,

for supervised cluster evaluation you have at least two options:

1) compare the cluster names with a set of predefined labels.

One could of course ask why one should cluster data which is already labelled.

--> there is currently no such operator available in RapidMiner and you would have to implement something like this yourself. We also added this operator on our todo list.

For smaller number of clusters, there is a workaround using only existing operators without any coding. You could find out which cluster number corresponds best to which label and use the AttributeValueMapper for mapper for mapping the cluster number to the corresponding label. Then, change the cluster attribute role to a prediction by using the ChangeAttributeRole operator and use one of the performance evaluation operators to calculate the performance. The single operator mentioned above could do that automatically, especially the search for the best clustering / label mapping will become cumbersome for larger numbers of clusters.


2) use a cross validation on a supervised learning scheme with the cluster as label and look how good it can be learned.

There is a lot of dicussion about this evaluation method outside (which I will not start here) but at least this can easily be done with the existing operators.

Cheers,
Ingo


Answer by topic starter:

So I have for example iris data and these labels: iris-setosa, versicol.,viginica and I applied K-means which give me clusters classes for example: 2,0,1 (there isn't any order) and may be cluster 2 mistacely split setosa and versicol and clust. 0 have only half of versicol. So if this can be done by this way as you have wrote what I need to write into these block you have talk about?


Edit by topic starter:

The best solussion could be if you will send some xml code. I don't know how to set it up. Firstly I don't know what means "attributes" and "replace what" in AttributeValueMapper. Replace what could be iris-set., iris-virg,iris-vers. and by: 0,1,2 I think. But what means attributes? What means the name in ChangeAttributeRole? Better send some example, it will be more quick then hard explaining to me.


Answer by Ingo:

Hi,

here you go (although you really could try it first to find such a setup - you will learn quicker then ;-):

<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="number_examples" value="400"/>
<parameter key="number_of_attributes" value="2"/>
<parameter key="target_function" value="gaussian mixture clusters"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="4"/>
</operator>
<operator name="AttributeValueMapper" class="AttributeValueMapper">
<parameter key="apply_to_special_features" value="true"/>
<parameter key="attributes" value="cluster"/>
<parameter key="replace_by" value="cluster3"/>
<parameter key="replace_what" value="0"/>
</operator>
<operator name="AttributeValueMapper (2)" class="AttributeValueMapper">
<parameter key="apply_to_special_features" value="true"/>
<parameter key="attributes" value="cluster"/>
<parameter key="replace_by" value="cluster2"/>
<parameter key="replace_what" value="1"/>
</operator>
<operator name="AttributeValueMapper (3)" class="AttributeValueMapper">
<parameter key="apply_to_special_features" value="true"/>
<parameter key="attributes" value="cluster"/>
<parameter key="replace_by" value="cluster0"/>
<parameter key="replace_what" value="2"/>
</operator>
<operator name="AttributeValueMapper (4)" class="AttributeValueMapper">
<parameter key="apply_to_special_features" value="true"/>
<parameter key="attributes" value="cluster"/>
<parameter key="replace_by" value="cluster1"/>
<parameter key="replace_what" value="3"/>
</operator>
<operator name="AttributeCopy" class="AttributeCopy">
<parameter key="attribute_name" value="cluster"/>
<parameter key="new_name" value="cluster_pred"/>
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole">
<parameter key="name" value="cluster_pred"/>
<parameter key="target_role" value="prediction"/>
</operator>
<operator name="Performance" class="Performance">
</operator>
</operator>

Please note that the mappings would not be necessary if we would add an operator performing the search for the best mapping. The attribute copy ("attributes" are the same as "features", "variables", often "columns" in RapidMiner) is necessary since the ClusterModel depends on the cluster attribute and we are not simply allowed to change the role of the cluster attribute. Instead of this, you could also copy the complete data set with an IOMultiplier (only the view is copied, not the data) or remove the cluster model with an IOConsumer. You see there are often a lot of options for achieving the same goal in RapidMiner.

Cheers,
Ingo


Answer by John:

Many thanks. Good sophisticated way :). This finaly helped. There is not problem to deal with 3 clusters. So I try to find highest accuracy number which I get from Validation block Performance. The results seems too bad. The Best are k-medoids and k-means with 89%. It's very similar to Adjusted rand criteria I think, it has the same table how much objects from some class fit to another class. Interesting is that better solussion when the dataset is not normalized for k-medoid a k-means (with normalisation it is only 82% both)and batch k-means(simple k-means from weka) + x-means better have dataset normalized, better about 1%. So what do you think? Is better use normalized data or not? I was thinking before that normalized data are important. Mainly why there is degradation of clustering quality in k-means and k-medoid with normalization?

Thanks for your help with that validation through that mapping. 
Have a nice day
Reagards John


Answer by Anonymous:

hi. 

We must delivered a paper about unsupervised clustering and with rapid miner we don´t know wich operators we can use.

Any help is important!

Thanks


Edit by Anonymous:

Sorry...i missed say that we used k-means and now we have to validate it, and we dont know how.

Thanks


Answer by Ingo:

Hello,

> We must delivered a paper about unsupervised clustering and with rapid miner we don´t know wich operators we can use. 

You mean beside the ones discussed above? I would suggest that you first try the process specified above. There are also a lot of examples in the sample directory delivered together with RapidMiner about Clustering.

Cheers,
Ingo


Answer by Anonymous:

Yes. But all for supervised classification. We want a unsupervised learner. We investigate that onde measure of validation is SSE, how can we do that in rapid miner? we try ClusterCentroidEvaluator but de DB=-0,56 was the result. What does it mean? Anyone knows?

Thanks.
Tagged:

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,642  RM Founder
    Hi,

    Yes. But all for supervised classification. We want a unsupervised learner. We investigate that onde measure of validation is SSE, how can we do that in rapid miner? we try ClusterCentroidEvaluator but de DB=-0,56 was the result. What does it mean? Anyone knows?
    DB is short for Davis-Bouldin Index (google for it). The value usually has to be minimized but since all optimization problems in RapidMiner are internally solved as maximization problems, we multiply the DB index with -1 and hence we get a maximization problem instead.

    Cheers,
    Ingo
  • dafedafe Member Posts: 3 Contributor I
    Hi Ingo

    this post is a bit old (and I'm not sure this is the right form for my question) but I've been trying to find some references to point 2) of your reply about clustering validation and couldn't really find anything related:
    **** wrote:

    2) use a cross validation on a supervised learning scheme with the cluster as label and look how good it can be learned.

    There is a lot of dicussion about this evaluation method outside (which I will not start here) but at least this can easily be done with the existing operators. 
    Could you point me to literature(papers)/websites/forms... where this discussion is going on?

    Thanks a lot!
    damon
  • amyargamyarg Member Posts: 2 Contributor I
    Hi, i am argentine
    Hola una consulta, deseo validar clustering bajo algoritmos como dbscan o medoides, mi duda es como validarlos en rapidminer, que tiene para analizar performance solo para algoritmo de KMEDIAS- XMEDIAS , lei que se puede insertar validadores de R, mediante la extensión en rapidminer, pero no se como?. Alguna sugerencia para poder llegar  a decir estos resultados de clustering de dbscan o kmedoide son bueno? ... gracias

    Hi a query, I validate clustering algorithms like "dbscan" or
    "medoids". My question is as validate these algorithms clustering  in RapidMiner,
    Is possible to implement validation of R in rapidminer? how?
    is "davies doublin" index used alone for "kmedia" or "kmedoide"? help please!!
Sign In or Register to comment.