Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

KMeans - different output on the same data

ShubhaShubha Member Posts: 139 Maven
edited November 2018 in Help
Hi,

Please find the below code:
<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function" value="sum classification"/>
        <parameter key="number_examples" value="50"/>
    </operator>
    <operator name="IOMultiplier" class="IOMultiplier">
        <parameter key="io_object" value="ExampleSet"/>
    </operator>
    <operator name="KMeans" class="KMeans">
    </operator>
    <operator name="IOSelector" class="IOSelector">
        <parameter key="io_object" value="ExampleSet"/>
        <parameter key="select_which" value="2"/>
    </operator>
    <operator name="KMeans (2)" class="KMeans">
    </operator>
</operator>
I have generated a dataset(required for KMeans) and made a copy of the same, so now I have 2 copies of the same dataset. I then apply the "KMeans" operator on both the examplesets. But the cluster centroids and also the cluster groupings are different for two examplesets. Why is this? Is it dependent on some seed value?

If I re-run the code, the same examplesets are generated with the same cluster centroids and groupings for the first and second examplesets. Why is is this? I then clicked the option, "use_local_random_seed" in KMeans. This made the cluster centroids and grouping look identical for both the data.

Questions:
1. What actually happens by the usage of "use_local_random_seed"?

2. The cluster centroids and groupings of the first and second examplesets are always the same irrespective of how many times we run it. But the KMeans applied on the same data in a single run is always different. Does this mean, RM when detects a KMeans operator for the first time applies a seed "A" and for the second time "B" always?

3. How do we choose the "use_local_random_seed"? What are its minimum and maximum values?

4. For simplicity, one can consider the below code too.
<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function" value="sum classification"/>
        <parameter key="number_examples" value="50"/>
    </operator>
    <operator name="KMeans" class="KMeans" breakpoints="before">
    </operator>
    <operator name="KMeans (2)" class="KMeans" breakpoints="before">
    </operator>
</operator>

Many Thanks,
Shubha.

Answers

  • haddockhaddock Member Posts: 849 Maven
    There has been quite a lot of discussion recently about random numbers, did you read it?
  • ShubhaShubha Member Posts: 139 Maven
    Thanks Haddock. That was helpful reading the thread, http://rapid-i.com/rapidforum/index.php/topic,2251.0.html.

    I encountered another problem with respect to KMedoids.

    The code below:
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="sum classification"/>
            <parameter key="number_examples" value="50"/>
        </operator>
        <operator name="KMedoids" class="KMedoids" breakpoints="before">
        </operator>
        <operator name="KMedoids (2)" class="KMedoids" breakpoints="before">
        </operator>
        <operator name="KMedoids (3)" class="KMedoids" breakpoints="before">
        </operator>
        <operator name="KMedoids (4)" class="KMedoids" breakpoints="before">
        </operator>
    </operator>
    The operators, "KMedoids", "KMedoids (2)", "KMedoids (3)" and "KMedoids (4)" all have same options, though "KMedoids" behave different than the other KMedoids operators.

    "KMedoids" have centroid value of Cluster 0 as
    Cluster 0 att1:1.395    att2:5.951    att3:2.581    att4:-0.637    att5:-2.674
    The "KMedoids (2)",  "KMedoids (3)" and "KMedoids (4)" has centroid values of cluster 0 as:
    Cluster 0 att1:3.270    att2:-7.064    att3:-2.953    att4:-4.882    att5:6.055
    After "KMedoids (4)", I tried introducing other KMedoids operators, still the same centorid vaues of "KMedoids (2)" prevail.

    So, then I introduced "KMedoids (5)" by COPYING "KMedoids" operator. Then to my surprise, the centroid values of cluster 0 is same as "KMedoids", instead of "KMedoids (2)"....

    But identically "KMedoids", "KMedoids (2)", "KMedoids (3)", "KMedoids (4)" and "KMedoids (5)" are all the same with local random seed -1.

    Thanks,
    Shubha
  • haddockhaddock Member Posts: 849 Maven
    Is this 4.6 code?
  • ShubhaShubha Member Posts: 139 Maven
    No, Its actually 4.4. Due to some contraints at this stage cannot switch to the newer version.

  • ShubhaShubha Member Posts: 139 Maven
    But, is this dependent on the version?

    Thanks,
    Shubha.
  • haddockhaddock Member Posts: 849 Maven
    Well yes. 4.4 is no longer supported here by RM, so you need someone else with 4.4 to verify your problem. Good luck with that!!!

Sign In or Register to comment.