"Cluster Number sorted after saving to a file"

vijaypshah · December 2009

Hello,

I am using kmeans clustering. After performing all this process I write out the results to file. However the cluster numbering is changed after I write it to file. It seems like the cluster numbering is sorted based on the cluster mean value.
For example: I saw in centroid table that cluster 13 had mean value of 100 200 200 200 200. However, when I load the save result file in other software to find out the means for cluster 13 it would be different. Then I saw that cluster 13 was renamed as cluster 0 when I saved the file (and other cluster number also changed).

It seems like the cluster numbering is sorted based on the cluster mean value. Is this true? I can send you data file if you want to experiment this with same dataset.

<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="C:\sources.aml"/>
<parameter key="column_separators" value=";"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="15"/>
<parameter key="max_runs" value="1000"/>
<parameter key="max_optimization_steps" value="10000"/>
</operator>
<operator name="ResultWriter" class="ResultWriter">
<parameter key="result_file" value="C:\cluster15_em_resultstat.res"/>
</operator>
<operator name="ItemDistributionEvaluator" class="ItemDistributionEvaluator">
<parameter key="measure" value="SumOfSquares"/>
</operator>
<operator name="ClusterNumberEvaluator" class="ClusterNumberEvaluator">
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole">
<parameter key="name" value="cluster"/>
</operator>
<operator name="Nominal2Numerical" class="Nominal2Numerical">
</operator>
<operator name="DataStatistics" class="DataStatistics">
</operator>
<operator name="ResultWriter (2)" class="ResultWriter">
<parameter key="result_file" value="C:\cluster15_em_stat.res"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\cluster15_em.dat"/>
<parameter key="attribute_description_file" value="C:\cluster15_em.aml"/>
<parameter key="format" value="special_format"/>
<parameter key="special_format" value="$v[cluster]"/>
<parameter key="overwrite_mode" value="overwrite"/>
</operator>
<operator name="ClusterModelWriter" class="ClusterModelWriter">
<parameter key="cluster_model_file" value="C:\cluster15_em.clm"/>
</operator>
</operator>

land · January 2010

Hi,
in which of the result files did you take a look? Into the one written with the result writer?

Greetings,
Sebastian

vijaypshah · January 2010

Yes, the file written by result writer.

However, I think now I understand the problem. Cluster number are in the nominal values, like "cluster_0," cluster_1," etc... So the result writer will be taking cluster_0 as cluster=0 and so on. But, when I apply filter nominal2numeric this cluster number may be changing ie. cluster_0 might be 1 and cluster_1 might be 0.

So just to be safe, I recalculate mean from the attribute in other program where I use the numeric cluster numbers.

Possibly this is the flaw in way I designed the process .

Regards,
Vijay

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Cluster Number sorted after saving to a file"

Answers