Options

"Cluster Number sorted after saving to a file"

vijaypshahvijaypshah Member Posts: 30 Maven
edited June 2019 in Help
Hello,

I am using kmeans clustering. After performing all this process I write out the results to file. However the cluster numbering is changed after I write it to file. It seems  like the cluster numbering is sorted based on the  cluster mean value.
For example: I saw in centroid table that cluster 13 had mean value of 100 200 200 200 200. However, when I load the save result file in other software to find out the means for cluster 13 it would be different. Then I saw that cluster 13 was renamed as cluster 0 when I saved the file (and other cluster number also changed).

It seems  like the cluster numbering is sorted based on the  cluster mean value. Is this true? I can send you data file if you want to experiment this with same dataset.

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes" value="C:\sources.aml"/>
        <parameter key="column_separators" value=";"/>
    </operator>
    <operator name="KMeans" class="KMeans">
        <parameter key="k" value="15"/>
        <parameter key="max_runs" value="1000"/>
        <parameter key="max_optimization_steps" value="10000"/>
    </operator>
    <operator name="ResultWriter" class="ResultWriter">
        <parameter key="result_file" value="C:\cluster15_em_resultstat.res"/>
    </operator>
    <operator name="ItemDistributionEvaluator" class="ItemDistributionEvaluator">
        <parameter key="measure" value="SumOfSquares"/>
    </operator>
    <operator name="ClusterNumberEvaluator" class="ClusterNumberEvaluator">
    </operator>
    <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
        <parameter key="name" value="cluster"/>
    </operator>
    <operator name="Nominal2Numerical" class="Nominal2Numerical">
    </operator>
    <operator name="DataStatistics" class="DataStatistics">
    </operator>
    <operator name="ResultWriter (2)" class="ResultWriter">
        <parameter key="result_file" value="C:\cluster15_em_stat.res"/>
    </operator>
    <operator name="ExampleSetWriter" class="ExampleSetWriter">
        <parameter key="example_set_file" value="C:\cluster15_em.dat"/>
        <parameter key="attribute_description_file" value="C:\cluster15_em.aml"/>
        <parameter key="format" value="special_format"/>
        <parameter key="special_format" value="$v[cluster]"/>
        <parameter key="overwrite_mode" value="overwrite"/>
    </operator>
    <operator name="ClusterModelWriter" class="ClusterModelWriter">
        <parameter key="cluster_model_file" value="C:\cluster15_em.clm"/>
    </operator>
</operator>

Tagged:

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    in which of the result files did you take a look? Into the one written with the result writer?

    Greetings,
      Sebastian
  • Options
    vijaypshahvijaypshah Member Posts: 30 Maven
    Yes, the file written by result writer.

    However, I think now I understand the problem. Cluster number are in the nominal values, like "cluster_0," cluster_1," etc... So the result writer will be taking cluster_0 as cluster=0 and so on. But, when I apply filter nominal2numeric this cluster number may be changing ie. cluster_0 might be 1 and cluster_1 might be 0.

    So just to be safe, I recalculate mean from the attribute in other program where I use the numeric cluster numbers.

    Possibly this is the flaw in way I designed the process .

    Regards,
    Vijay
Sign In or Register to comment.