Dynamically determine number of clusters k-means

namachoco99 · June 2017

I have a CSV file containing approximately a million records and 3 features that will be used to determine which cluster each record will belong. I want to have these records clustered using k-Means algorithm (and using the Euclidean Distance) and I'll use the Davies Bouldin Index (DBI) to find the optimal number of clusters.

Is there any way for me to be able to automate finding the optimal number of clusters by repeating/looping through the process with the k nmber of clusters incrementing on each iteration? I'm new to RapidMiner so I'm not yet sure on how to implement this by implementing an XML code.

Thanks for any help and suggestion that will be given!

MartinLiebig · June 2017

Hello,

there are two things you can do.

1. Use the X-Means operator. It runs k-means but uses internally heuristics (i think based on DB?) to determine k

2. Put a loop around and run the algorithm with several k. You can than pick the best k. I've done this in a blog post on hearthstone a year ago: https://rapidminer.com/creative-use-hearthstone-cluster-analysis/ this also includes some python scripts for charting, which you might not need.

Best,

Martin

Telcontar120 · June 2017

I have had good results using the X-Means operator. It generally finds a sensible value of k in my experience.

yyhuang · June 2017

I have a process to optimize on clustering performance based on DB index. You will pick the best k for k-means after iterating on different values of k.

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.5.001" expanded="true" height="68" name="Ripley-Set" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Ripley-Set"/>
      </operator>
      <operator activated="true" class="optimize_parameters_grid" compatibility="7.5.001" expanded="true" height="145" name="Optimize Parameters (Grid)" width="90" x="313" y="34">
        <list key="parameters">
          <parameter key="Clustering.k" value="[2.0;20;19;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="k_means" compatibility="7.5.001" expanded="true" height="82" name="Clustering" width="90" x="112" y="34">
            <parameter key="k" value="20"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
          </operator>
          <operator activated="true" class="cluster_distance_performance" compatibility="7.5.001" expanded="true" height="103" name="Performance" width="90" x="447" y="34">
            <parameter key="main_criterion" value="Davies Bouldin"/>
            <parameter key="main_criterion_only" value="true"/>
          </operator>
          <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
          <connect from_op="Performance" from_port="performance" to_port="performance"/>
          <connect from_op="Performance" from_port="example set" to_port="result 1"/>
          <connect from_op="Performance" from_port="cluster model" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <description align="left" color="green" colored="true" height="173" resized="true" width="626" x="309" y="166">Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin. The Silhouette Index measure the distance between each data point, the centroid of the cluster it was assigned to and the closest centroid belonging to another cluster. If you consider that this is a good criterion, go for the silhouette index.&lt;br&gt;&lt;br&gt;How can we say that a clustering quality measure is good?. Available from: https://www.researchgate.net/post/How_can_we_say_that_a_clustering_quality_measure_is_good.</description&gt;
        </process>
      </operator>
      <connect from_op="Ripley-Set" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 2"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="result 3"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="result 2" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

Cheers,

YY

namachoco99 · July 2017

This one looks promising! Thanks for the suggestion!

I have two questions though:

1. Will this output some sort of a table that will display the DBI values for each k number of clusters? That's because I would need to store all the results and create a graph using those values.

2. Additionally, do you have any idea how long this one runs? I found a similar code somewhere in the same forum but the code runs somewhere between 6-12 hours per iteration and my goal is to have a range of 2-100 clusters, if possible.

Currently, my dataset is composed of about 1.9 million records with only 4 columns (a column label and 3 other columns which will be used for clustering, all normalized already).

Thanks again!

Telcontar120 · July 2017

At this thread (http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/How-to-reuse-preprocessing-results-in-a-range-of-k-means/m-p/40191) there is an example of a process using a loop to set parameters. If you run the sample process on the sonar data (as supplied) you can see that the output is a collection, where each element corresponds to one of the clusters at the different k-values you have supplied. If you want to do other things like calculate the DBI and store that for each output, then you'll need to add the appropriate operators from YY's process inside the loop. After that you can have other operators to pull all that data together and append it into a single dataset where you can graph the results.

As far as runtime is concerned, that can vary significantly depending on the quality of the hardware you are running. 4 attributes is not a lot for clustering but 1.9MM records is, so I am not surprised to hear it is taking a while. You might consider taking a smaller sample and then doing your k-optimization on that dataset so you only have to apply the single selected k-value to the entire dataset once.

bhupendra_patil · August 2017

You can also use the "Log" operator to see the results of every iteration

Details in this document here

http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Capture-intermediate-results-during-optimization/ta-p/32083

See below example

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.6.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="85"/>
<operator activated="true" class="optimize_parameters_grid" compatibility="7.6.000" expanded="true" height="145" name="Optimize Parameters (Grid)" width="90" x="380" y="34">
<list key="parameters">
<parameter key="Clustering.k" value="[2.0;70;10;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="k_means" compatibility="7.6.000" expanded="true" height="82" name="Clustering" width="90" x="112" y="34">
<parameter key="k" value="70"/>
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<operator activated="true" class="cluster_distance_performance" compatibility="7.6.000" expanded="true" height="103" name="Performance" width="90" x="447" y="34">
<parameter key="main_criterion" value="Davies Bouldin"/>
<parameter key="main_criterion_only" value="true"/>
</operator>
<operator activated="true" class="log" compatibility="7.6.000" expanded="true" height="82" name="Log" width="90" x="715" y="85">
<list key="log">
<parameter key="DB" value="operator.Performance.value.DaviesBouldin"/>
<parameter key="k" value="operator.Clustering.parameter.k"/>
<parameter key="avgwithindistance" value="operator.Performance.value.avg_within_distance"/>
</list>
</operator>
<connect from_port="input 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
<connect from_op="Performance" from_port="performance" to_port="performance"/>
<connect from_op="Performance" from_port="example set" to_port="result 1"/>
<connect from_op="Performance" from_port="cluster model" to_op="Log" to_port="through 1"/>
<connect from_op="Log" from_port="through 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<description align="left" color="green" colored="true" height="173" resized="true" width="626" x="309" y="166">Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin. The Silhouette Index measure the distance between each data point, the centroid of the cluster it was assigned to and the closest centroid belonging to another cluster. If you consider that this is a good criterion, go for the silhouette index.<br><br>How can we say that a clustering quality measure is good?. Available from: https://www.researchgate.net/post/How_can_we_say_that_a_clustering_quality_measure_is_good.</description>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 3"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Dynamically determine number of clusters k-means

Answers