RapidMiner

Dynamically determine number of clusters k-means

Contributor

Dynamically determine number of clusters k-means

I have a CSV file containing approximately a million records and 3 features that will be used to determine which cluster each record will belong. I want to have these records clustered using k-Means algorithm (and using the Euclidean Distance) and I'll use the Davies Bouldin Index (DBI) to find the optimal number of clusters.

Is there any way for me to be able to automate finding the optimal number of clusters by repeating/looping through the process with the k nmber of clusters incrementing on each iteration? I'm new to RapidMiner so I'm not yet sure on how to implement this by implementing an XML code.

 

Thanks for any help and suggestion that will be given!

See more topics labeled with:

6 REPLIES
RMStaff

Re: Dynamically determine number of clusters k-means

Hello,

 

there are two things you can do.

 

1. Use the X-Means operator. It runs k-means but uses internally heuristics (i think based on DB?) to determine k

 

2. Put a loop around and run the algorithm with several k. You can than pick the best k. I've done this in a blog post on hearthstone a year ago: https://rapidminer.com/creative-use-hearthstone-cluster-analysis/ this also includes some python scripts for charting, which you might not need.

 

Best,

Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Elite III

Re: Dynamically determine number of clusters k-means

I have had good results using the X-Means operator.  It generally finds a sensible value of k in my experience.

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
RMStaff

Re: Dynamically determine number of clusters k-means

I have a process to optimize on clustering performance based on DB index. You will pick the best k for k-means after iterating on different values of k.

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.5.001" expanded="true" height="68" name="Ripley-Set" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Ripley-Set"/>
      </operator>
      <operator activated="true" class="optimize_parameters_grid" compatibility="7.5.001" expanded="true" height="145" name="Optimize Parameters (Grid)" width="90" x="313" y="34">
        <list key="parameters">
          <parameter key="Clustering.k" value="[2.0;20;19;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="k_means" compatibility="7.5.001" expanded="true" height="82" name="Clustering" width="90" x="112" y="34">
            <parameter key="k" value="20"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
          </operator>
          <operator activated="true" class="cluster_distance_performance" compatibility="7.5.001" expanded="true" height="103" name="Performance" width="90" x="447" y="34">
            <parameter key="main_criterion" value="Davies Bouldin"/>
            <parameter key="main_criterion_only" value="true"/>
          </operator>
          <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
          <connect from_op="Performance" from_port="performance" to_port="performance"/>
          <connect from_op="Performance" from_port="example set" to_port="result 1"/>
          <connect from_op="Performance" from_port="cluster model" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <description align="left" color="green" colored="true" height="173" resized="true" width="626" x="309" y="166">Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin. The Silhouette Index measure the distance between each data point, the centroid of the cluster it was assigned to and the closest centroid belonging to another cluster. If you consider that this is a good criterion, go for the silhouette index.&lt;br&gt;&lt;br&gt;How can we say that a clustering quality measure is good?. Available from: https://www.researchgate.net/post/How_can_we_say_that_a_clustering_quality_measure_is_good.</description>
        </process>
      </operator>
      <connect from_op="Ripley-Set" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 2"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="result 3"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="result 2" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

Cheers,

YY

Contributor

Re: Dynamically determine number of clusters k-means

This one looks promising! Thanks for the suggestion!

I have two questions though:

1. Will this output some sort of a table that will display the DBI values for each k number of clusters? That's because I would need to store all the results and create a graph using those values.

2. Additionally, do you have any idea how long this one runs? I found a similar code somewhere in the same forum but the code runs somewhere between 6-12 hours per iteration and my goal is to have a range of 2-100 clusters, if possible.

 

Currently, my dataset is composed of about 1.9 million records with only 4 columns (a column label and 3 other columns which will be used for clustering, all normalized already).

 

Thanks again!

Elite III

Re: Dynamically determine number of clusters k-means

[ Edited ]

At this thread (http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/How-to-reuse-preprocessing-results-in-a-r...) there is an example of a process using a loop to set parameters.  If you run the sample process on the sonar data (as supplied) you can see that the output is a collection, where each element corresponds to one of the clusters at the different k-values you have supplied.  If you want to do other things like calculate the DBI and store that for each output, then you'll need to add the appropriate operators from YY's process inside the loop.  After that you can have other operators to pull all that data together and append it into a single dataset where you can graph the results.

 

As far as runtime is concerned, that can vary significantly depending on the quality of the hardware you are running.  4 attributes is not a lot for clustering but 1.9MM records is, so I am not surprised to hear it is taking a while.  You might consider taking a smaller sample and then doing your k-optimization on that dataset so you only have to apply the single selected k-value to the entire dataset once.

 

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Highlighted

Re: Dynamically determine number of clusters k-means

[ Edited ]

You can also use the "Log" operator to see the results of every iteration

 Details in this document here

http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Capture-intermediate-results-dur...

 

See below example

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.6.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="85"/>
<operator activated="true" class="optimize_parameters_grid" compatibility="7.6.000" expanded="true" height="145" name="Optimize Parameters (Grid)" width="90" x="380" y="34">
<list key="parameters">
<parameter key="Clustering.k" value="[2.0;70;10;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="k_means" compatibility="7.6.000" expanded="true" height="82" name="Clustering" width="90" x="112" y="34">
<parameter key="k" value="70"/>
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<operator activated="true" class="cluster_distance_performance" compatibility="7.6.000" expanded="true" height="103" name="Performance" width="90" x="447" y="34">
<parameter key="main_criterion" value="Davies Bouldin"/>
<parameter key="main_criterion_only" value="true"/>
</operator>
<operator activated="true" class="log" compatibility="7.6.000" expanded="true" height="82" name="Log" width="90" x="715" y="85">
<list key="log">
<parameter key="DB" value="operator.Performance.value.DaviesBouldin"/>
<parameter key="k" value="operator.Clustering.parameter.k"/>
<parameter key="avgwithindistance" value="operator.Performance.value.avg_within_distance"/>
</list>
</operator>
<connect from_port="input 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
<connect from_op="Performance" from_port="performance" to_port="performance"/>
<connect from_op="Performance" from_port="example set" to_port="result 1"/>
<connect from_op="Performance" from_port="cluster model" to_op="Log" to_port="through 1"/>
<connect from_op="Log" from_port="through 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<description align="left" color="green" colored="true" height="173" resized="true" width="626" x="309" y="166">Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin. The Silhouette Index measure the distance between each data point, the centroid of the cluster it was assigned to and the closest centroid belonging to another cluster. If you consider that this is a good criterion, go for the silhouette index.&lt;br&gt;&lt;br&gt;How can we say that a clustering quality measure is good?. Available from: https://www.researchgate.net/post/How_can_we_say_that_a_clustering_quality_measure_is_good.</descrip...>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 3"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>