Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Dynamically determine number of clusters k-means
namachoco99
Member Posts: 3 Contributor I
I have a CSV file containing approximately a million records and 3 features that will be used to determine which cluster each record will belong. I want to have these records clustered using k-Means algorithm (and using the Euclidean Distance) and I'll use the Davies Bouldin Index (DBI) to find the optimal number of clusters.
Is there any way for me to be able to automate finding the optimal number of clusters by repeating/looping through the process with the k nmber of clusters incrementing on each iteration? I'm new to RapidMiner so I'm not yet sure on how to implement this by implementing an XML code.
Thanks for any help and suggestion that will be given!
Tagged:
0
Answers
Hello,
there are two things you can do.
1. Use the X-Means operator. It runs k-means but uses internally heuristics (i think based on DB?) to determine k
2. Put a loop around and run the algorithm with several k. You can than pick the best k. I've done this in a blog post on hearthstone a year ago: https://rapidminer.com/creative-use-hearthstone-cluster-analysis/ this also includes some python scripts for charting, which you might not need.
Best,
Martin
Dortmund, Germany
I have had good results using the X-Means operator. It generally finds a sensible value of k in my experience.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
I have a process to optimize on clustering performance based on DB index. You will pick the best k for k-means after iterating on different values of k.
Cheers,
YY
This one looks promising! Thanks for the suggestion!
I have two questions though:
1. Will this output some sort of a table that will display the DBI values for each k number of clusters? That's because I would need to store all the results and create a graph using those values.
2. Additionally, do you have any idea how long this one runs? I found a similar code somewhere in the same forum but the code runs somewhere between 6-12 hours per iteration and my goal is to have a range of 2-100 clusters, if possible.
Currently, my dataset is composed of about 1.9 million records with only 4 columns (a column label and 3 other columns which will be used for clustering, all normalized already).
Thanks again!
At this thread (http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/How-to-reuse-preprocessing-results-in-a-range-of-k-means/m-p/40191) there is an example of a process using a loop to set parameters. If you run the sample process on the sonar data (as supplied) you can see that the output is a collection, where each element corresponds to one of the clusters at the different k-values you have supplied. If you want to do other things like calculate the DBI and store that for each output, then you'll need to add the appropriate operators from YY's process inside the loop. After that you can have other operators to pull all that data together and append it into a single dataset where you can graph the results.
As far as runtime is concerned, that can vary significantly depending on the quality of the hardware you are running. 4 attributes is not a lot for clustering but 1.9MM records is, so I am not surprised to hear it is taking a while. You might consider taking a smaller sample and then doing your k-optimization on that dataset so you only have to apply the single selected k-value to the entire dataset once.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
You can also use the "Log" operator to see the results of every iteration
Details in this document here
http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Capture-intermediate-results-during-optimization/ta-p/32083
See below example
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.6.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="85"/>
<operator activated="true" class="optimize_parameters_grid" compatibility="7.6.000" expanded="true" height="145" name="Optimize Parameters (Grid)" width="90" x="380" y="34">
<list key="parameters">
<parameter key="Clustering.k" value="[2.0;70;10;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="k_means" compatibility="7.6.000" expanded="true" height="82" name="Clustering" width="90" x="112" y="34">
<parameter key="k" value="70"/>
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<operator activated="true" class="cluster_distance_performance" compatibility="7.6.000" expanded="true" height="103" name="Performance" width="90" x="447" y="34">
<parameter key="main_criterion" value="Davies Bouldin"/>
<parameter key="main_criterion_only" value="true"/>
</operator>
<operator activated="true" class="log" compatibility="7.6.000" expanded="true" height="82" name="Log" width="90" x="715" y="85">
<list key="log">
<parameter key="DB" value="operator.Performance.value.DaviesBouldin"/>
<parameter key="k" value="operator.Clustering.parameter.k"/>
<parameter key="avgwithindistance" value="operator.Performance.value.avg_within_distance"/>
</list>
</operator>
<connect from_port="input 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
<connect from_op="Performance" from_port="performance" to_port="performance"/>
<connect from_op="Performance" from_port="example set" to_port="result 1"/>
<connect from_op="Performance" from_port="cluster model" to_op="Log" to_port="through 1"/>
<connect from_op="Log" from_port="through 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<description align="left" color="green" colored="true" height="173" resized="true" width="626" x="309" y="166">Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin. The Silhouette Index measure the distance between each data point, the centroid of the cluster it was assigned to and the closest centroid belonging to another cluster. If you consider that this is a good criterion, go for the silhouette index.<br><br>How can we say that a clustering quality measure is good?. Available from: https://www.researchgate.net/post/How_can_we_say_that_a_clustering_quality_measure_is_good.</description>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 3"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>