"Clustering - how to determine

radoneradone RapidMiner Certified Expert, Member Posts: 74 Guru
edited May 2019 in Help
Hello,
could anyone point me how to do an unsupervised data clustering on data, where I am not sure how many clusters is present in data (i.e. how to determine k for e.g. k-means)?
Or is the best possible way to determine the k visually (I have 13 attributes and the data might be quite noisy)?

Thanks for any suggestion,
radone
Tagged:

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    Clustering always requires a human to look at and interpret the results but a helping hand can be given by using various cluster performance operators.

    Here's an example showing the Cluster Distance Performance operator producing measures for "average within centroid distance" and Davies-Bouldin as k is varied in a k-means clustering experiment. The example data in this case contains 1000 examples that are grouped into 8 neat clusters in a three dimensional space. At the end of the experiment look at the Log tab in the results and plot the two recorded measures as a function of k and you should see that something interesting is happening at k = 8.

    Fortunately, this corresponds to the "correct" answer but in real life, it won't be as easy. The characteristics of the input data such as cluster shape, noise and data size will determine what clustering approach to use as well as what performance measure could be appropriate. Guidance is hard to give because a) it depends on the data and b) I probably don't know :)

    regards,

    Andrew
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.004">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
        <parameter key="random_seed" value="-1"/>
        <process expanded="true" height="665" width="710">
          <operator activated="true" class="generate_data" compatibility="5.1.004" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="target_function" value="gaussian mixture clusters"/>
            <parameter key="number_examples" value="1000"/>
            <parameter key="number_of_attributes" value="3"/>
          </operator>
          <operator activated="true" class="loop_parameters" compatibility="5.1.004" expanded="true" height="76" name="Loop Parameters" width="90" x="246" y="30">
            <list key="parameters">
              <parameter key="Clustering.k" value="[2.0;20;19;linear]"/>
            </list>
            <process expanded="true" height="665" width="710">
              <operator activated="true" class="k_means" compatibility="5.1.004" expanded="true" height="76" name="Clustering" width="90" x="45" y="30">
                <parameter key="k" value="20"/>
                <parameter key="max_runs" value="1000"/>
                <parameter key="use_local_random_seed" value="true"/>
                <parameter key="local_random_seed" value="2"/>
              </operator>
              <operator activated="true" class="cluster_distance_performance" compatibility="5.1.004" expanded="true" height="94" name="Performance" width="90" x="246" y="30">
                <parameter key="normalize" value="true"/>
              </operator>
              <operator activated="true" class="log" compatibility="5.1.004" expanded="true" height="76" name="Log" width="90" x="447" y="30">
                <list key="log">
                  <parameter key="DaviesBouldin" value="operator.Performance.value.DaviesBouldin"/>
                  <parameter key="avgWithinDistance" value="operator.Performance.value.avg_within_distance"/>
                  <parameter key="k" value="operator.Clustering.parameter.k"/>
                </list>
              </operator>
              <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
              <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
              <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
              <connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="performance"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
              <portSpacing port="sink_result 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Loop Parameters" to_port="input 1"/>
          <connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.