kMeans with Davies Bouldin Index as infinity?

namachoco99namachoco99 Member Posts: 3 Contributor I
edited December 2018 in Help

Hi guys! I'm currently processing a dataset of 1.9 million records (all necessary values are normalized) using kMeans. The output of the process supposedly is the DBI of that k number of clusters. My problem, however, is that I encountered having DBI values of infinity after a lengthy process.

 

Is there an explanation as to why this occurs and what a possible solution/fix for this could be?

 

Thanks!

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I think this is related to your other post about k-means as well.  As a general principle when dealing with datasets this large, I strongly suggest starting on a much smaller sample of your records, which will allow you to work out the bugs in the process and also check the results to see whether they are consistent with your expectations, without having to wait hours to get the output.  Once you have a process that you are happy with, you can then attempt to scale it up to your much larger dataset.  I don't have any insight specifically related to the question of the DBI value going to infinity, though.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Here's a great RapidMiner helper to guage how fast/slow/memory intensive something is. Just visit http://mod.rapidminer.com/#app

  • sangeetsangeet Member Posts: 10 Contributor I

    Any updates on this one Team ?

     

    Why does DBI comes up as infinity and why is average within centroid distance cluster_x comes up as UNKNOWN.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @sangeet - could you post your process XML and dataset (or a sample) so we can take a look?


    Scott

  • sangeetsangeet Member Posts: 10 Contributor I

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve TrainingDataset" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Hareesh/Data/TrainingDataset"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
    <parameter key="condition_class" value="no_missing_attributes"/>
    <list key="filters_list"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Description"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="82" name="Multiply" width="90" x="581" y="238"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="715" y="238">
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="percentual"/>
    <parameter key="prune_below_percent" value="10.0"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
    <operator activated="true" class="text:stem_porter" compatibility="7.5.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="581" y="34"/>
    <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="581" y="136"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
    <connect from_op="Stem (Porter)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
    <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="store" compatibility="7.6.001" expanded="true" height="68" name="Store" width="90" x="782" y="340">
    <parameter key="repository_entry" value="Data/TFIDF"/>
    </operator>
    <operator activated="true" class="loop_parameters" compatibility="7.6.001" expanded="true" height="103" name="Loop Parameters" width="90" x="916" y="238">
    <list key="parameters">
    <parameter key="KMeans.k" value="[50;200;10;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="k_means" compatibility="7.6.001" expanded="true" height="82" name="KMeans" width="90" x="45" y="34">
    <parameter key="k" value="200"/>
    <parameter key="max_optimization_steps" value="50"/>
    </operator>
    <operator activated="true" class="cluster_distance_performance" compatibility="7.6.001" expanded="true" height="103" name="Evaluation" width="90" x="179" y="34"/>
    <operator activated="true" class="log" compatibility="7.6.001" expanded="true" height="124" name="Log" width="90" x="313" y="34">
    <list key="log">
    <parameter key="k" value="operator.KMeans.parameter.k"/>
    <parameter key="DB" value="operator.Evaluation.value.DaviesBouldin"/>
    <parameter key="w" value="operator.Evaluation.value.avg_within_distance"/>
    </list>
    </operator>
    <connect from_port="input 1" to_op="KMeans" to_port="example set"/>
    <connect from_op="KMeans" from_port="cluster model" to_op="Evaluation" to_port="cluster model"/>
    <connect from_op="KMeans" from_port="clustered set" to_op="Evaluation" to_port="example set"/>
    <connect from_op="Evaluation" from_port="performance" to_op="Log" to_port="through 1"/>
    <connect from_op="Evaluation" from_port="example set" to_op="Log" to_port="through 2"/>
    <connect from_op="Log" from_port="through 1" to_port="performance"/>
    <connect from_op="Log" from_port="through 2" to_port="result 1"/>
    <connect from_op="Log" from_port="through 3" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve TrainingDataset" from_port="output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Store" to_port="input"/>
    <connect from_op="Store" from_port="through" to_op="Loop Parameters" to_port="input 1"/>
    <connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

  • sangeetsangeet Member Posts: 10 Contributor I

    Here is the XML. Team I really want to understand the reason for encountering a negative infinity of DBIndex when performing clustering on Text Data.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Hello @sangeet - with the caveat that I do not have PhD in data science, I will share some thoughts on this interesting problem.  So from what I understand, the DBI is a measure of cluster density where the lower, the better.  So an infinitely low DBI would imply, to me, that you have a "perfect fit" of some kind.  I played around with a text mining example (see attached) and can replicate an infinite DBI by just creating an example set of 40 documents that are just 20 examples - each one repeated once.  Hence if I "cluster" with k=20, I should (?) get a perfect fit.  And sure enough, the DBI goes to infinity with k-mediods and just crashes with k-means (not a good thing, actually - it should do something).

     

    In general I like to set up small examples like this when I do not understand a problem.  Standard engineering: make a process simpler until you undertstand it, then make more complex.  I guess that's why I'm an engineer and not a theorist.  :)

     

    Scott

     

     

Sign In or Register to comment.