Davis bouldin index

Shaimaa · March 2019

Hi, I am using davis bouldin index and got minus 2. When I changed in attributes I got - 4
Which one is better? - 2 or - 4?

yyhuang · March 2019

Hi @shaimaa,

Great question! The D-B index was multiplied by -1 internally for maximizing it. It is a kind-of bug. You could ignore the negative sign from the performance output. So the clustering model with DB index -2 is better.

"clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm" -Wikipedia

The Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin.

My attached process is an optimization to pick the best K for K-means model, which returns k=3 has the lowest D-B index. You can also try X-means to get an optimized clustering.

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value="yhuang@rapidminer.com"/>
    <parameter key="process_duration_for_mail" value="1"/>
    <parameter key="encoding" value="UTF-8"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Ripley-Set" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Ripley-Set"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply" width="90" x="279" y="34"/>
      <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="8.2.000" expanded="true" height="145" name="Optimize Parameters" width="90" x="514" y="34">
        <list key="parameters">
          <parameter key="Clustering.k" value="[2.0;20;19;linear]"/>
        </list>
        <parameter key="error_handling" value="fail on error"/>
        <parameter key="log_performance" value="true"/>
        <parameter key="log_all_criteria" value="false"/>
        <parameter key="synchronize" value="false"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="fast_k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering" width="90" x="246" y="34">
            <parameter key="add_cluster_attribute" value="true"/>
            <parameter key="add_as_label" value="false"/>
            <parameter key="remove_unlabeled" value="false"/>
            <parameter key="k" value="2"/>
            <parameter key="determine_good_start_values" value="false"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
            <parameter key="max_runs" value="10"/>
            <parameter key="max_optimization_steps" value="100"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <operator activated="true" class="cluster_distance_performance" compatibility="9.2.000" expanded="true" height="103" name="Performance" width="90" x="648" y="34">
            <parameter key="main_criterion" value="Davies Bouldin"/>
            <parameter key="main_criterion_only" value="true"/>
            <parameter key="normalize" value="false"/>
            <parameter key="maximize" value="false"/>
          </operator>
          <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
          <connect from_op="Performance" from_port="performance" to_port="performance"/>
          <connect from_op="Performance" from_port="example set" to_port="output 1"/>
          <connect from_op="Performance" from_port="cluster model" to_port="model"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <description align="left" color="green" colored="true" height="173" resized="false" width="626" x="109" y="164">Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin. The Silhouette Index measure the distance between each data point, the centroid of the cluster it was assigned to and the closest centroid belonging to another cluster. If you consider that this is a good criterion, go for the silhouette index.&lt;br&gt;&lt;br&gt;How can we say that a clustering quality measure is good?. Available from: https://www.researchgate.net/post/How_can_we_say_that_a_clustering_quality_measure_is_good.</description&gt;
        </process>
        <description align="center" color="transparent" colored="false" width="126">figure out the best k for k-means</description>
      </operator>
      <operator activated="true" class="x_means" compatibility="9.0.000" expanded="true" height="82" name="X-Means" width="90" x="514" y="289">
        <parameter key="add_cluster_attribute" value="true"/>
        <parameter key="add_as_label" value="false"/>
        <parameter key="remove_unlabeled" value="false"/>
        <parameter key="k_min" value="2"/>
        <parameter key="k_max" value="10"/>
        <parameter key="determine_good_start_values" value="false"/>
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="EuclideanDistance"/>
        <parameter key="divergence" value="GeneralizedIDivergence"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
        <parameter key="clustering_algorithm" value="KMeans"/>
        <parameter key="max_runs" value="10"/>
        <parameter key="max_optimization_steps" value="100"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <description align="center" color="transparent" colored="false" width="126">run x-means for an optimzied clustering</description>
      </operator>
      <connect from_op="Ripley-Set" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Optimize Parameters" to_port="input 1"/>
      <connect from_op="Multiply" from_port="output 2" to_op="X-Means" to_port="example set"/>
      <connect from_op="Optimize Parameters" from_port="parameter set" to_port="result 1"/>
      <connect from_op="Optimize Parameters" from_port="output 1" to_port="result 2"/>
      <connect from_op="X-Means" from_port="clustered set" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="42"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="189"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

YY

Shaimaa · March 2019

Hi @yyhuang
Thanks for the reply
But I saw other comments here for other post asking same question and got different reply. We should take the minimum and if maximized (remove multiplication by - 1) we should take the greate number. This what makes me confused

SGolbert · March 2019

The -1 appears in several operators that are based on distances. It's quite annoying!

Telcontar120 · March 2019

Agreed, it would be very nice to convert these types of measures back to their "standard form" so when we share output from RapidMiner it is comparable to the way the rest of the world expects them to work

IngoRM · March 2019

Well, in fact that is what is supposed to happen anyway. We have a mechanism in all those performance criteria to show the value and also to deliver a fitness (which is always to be maximized independent of what value is shown). Unfortunately, some of the criteria (or their developers ;-) are a bit lazy and do not correctly implement this behavior and simply return a negative value instead for both... You can help us actually by pointing out those cases. DB-Index is one, any others you have noticed and remember from the top of your head?

Thanks,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Davis bouldin index

Declined · Last Updated October 2019

Comments