kMeans with Davies Bouldin Index as infinity?

namachoco99 · July 2017

Hi guys! I'm currently processing a dataset of 1.9 million records (all necessary values are normalized) using kMeans. The output of the process supposedly is the DBI of that k number of clusters. My problem, however, is that I encountered having DBI values of infinity after a lengthy process.

Is there an explanation as to why this occurs and what a possible solution/fix for this could be?

Thanks!

Telcontar120 · July 2017

I think this is related to your other post about k-means as well. As a general principle when dealing with datasets this large, I strongly suggest starting on a much smaller sample of your records, which will allow you to work out the bugs in the process and also check the results to see whether they are consistent with your expectations, without having to wait hours to get the output. Once you have a process that you are happy with, you can then attempt to scale it up to your much larger dataset. I don't have any insight specifically related to the question of the DBI value going to infinity, though.

Thomas_Ott · July 2017

Here's a great RapidMiner helper to guage how fast/slow/memory intensive something is. Just visit http://mod.rapidminer.com/#app

sangeet · August 2017

Any updates on this one Team ?

Why does DBI comes up as infinity and why is average within centroid distance cluster_x comes up as UNKNOWN.

sgenzer · August 2017

hi @sangeet - could you post your process XML and dataset (or a sample) so we can take a look?

Scott

sangeet · November 2017

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve TrainingDataset" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Hareesh/Data/TrainingDataset"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
<parameter key="condition_class" value="no_missing_attributes"/>
<list key="filters_list"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Description"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="82" name="Multiply" width="90" x="581" y="238"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="715" y="238">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="10.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
<operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
<operator activated="true" class="text:stem_porter" compatibility="7.5.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="581" y="34"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="581" y="136"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="store" compatibility="7.6.001" expanded="true" height="68" name="Store" width="90" x="782" y="340">
<parameter key="repository_entry" value="Data/TFIDF"/>
</operator>
<operator activated="true" class="loop_parameters" compatibility="7.6.001" expanded="true" height="103" name="Loop Parameters" width="90" x="916" y="238">
<list key="parameters">
<parameter key="KMeans.k" value="[50;200;10;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="k_means" compatibility="7.6.001" expanded="true" height="82" name="KMeans" width="90" x="45" y="34">
<parameter key="k" value="200"/>
<parameter key="max_optimization_steps" value="50"/>
</operator>
<operator activated="true" class="cluster_distance_performance" compatibility="7.6.001" expanded="true" height="103" name="Evaluation" width="90" x="179" y="34"/>
<operator activated="true" class="log" compatibility="7.6.001" expanded="true" height="124" name="Log" width="90" x="313" y="34">
<list key="log">
<parameter key="k" value="operator.KMeans.parameter.k"/>
<parameter key="DB" value="operator.Evaluation.value.DaviesBouldin"/>
<parameter key="w" value="operator.Evaluation.value.avg_within_distance"/>
</list>
</operator>
<connect from_port="input 1" to_op="KMeans" to_port="example set"/>
<connect from_op="KMeans" from_port="cluster model" to_op="Evaluation" to_port="cluster model"/>
<connect from_op="KMeans" from_port="clustered set" to_op="Evaluation" to_port="example set"/>
<connect from_op="Evaluation" from_port="performance" to_op="Log" to_port="through 1"/>
<connect from_op="Evaluation" from_port="example set" to_op="Log" to_port="through 2"/>
<connect from_op="Log" from_port="through 1" to_port="performance"/>
<connect from_op="Log" from_port="through 2" to_port="result 1"/>
<connect from_op="Log" from_port="through 3" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve TrainingDataset" from_port="output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_op="Loop Parameters" to_port="input 1"/>
<connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

sangeet · November 2017

Here is the XML. Team I really want to understand the reason for encountering a negative infinity of DBIndex when performing clustering on Text Data.

sgenzer · November 2017

Hello @sangeet - with the caveat that I do not have PhD in data science, I will share some thoughts on this interesting problem. So from what I understand, the DBI is a measure of cluster density where the lower, the better. So an infinitely low DBI would imply, to me, that you have a "perfect fit" of some kind. I played around with a text mining example (see attached) and can replicate an infinite DBI by just creating an example set of 40 documents that are just 20 examples - each one repeated once. Hence if I "cluster" with k=20, I should (?) get a perfect fit. And sure enough, the DBI goes to infinity with k-mediods and just crashes with k-means (not a good thing, actually - it should do something).

In general I like to set up small examples like this when I do not understand a problem. Standard engineering: make a process simpler until you undertstand it, then make more complex. I guess that's why I'm an engineer and not a theorist.

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

kMeans with Davies Bouldin Index as infinity?

Answers