Davis bouldin index

ShaimaaShaimaa Member Posts: 2 Learner I
Hi, I am using davis bouldin index and got minus 2. When I changed in attributes I got - 4
Which one is better? - 2  or - 4?
0
0 votes

Declined · Last Updated

No activity or votes since March 2019. Please comment and cc sgenzer if this should be reopened. RM-3972

Comments

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Hi @shaimaa,

    Great question! The D-B index was multiplied by -1 internally for maximizing it. It is a kind-of bug. You could ignore the negative sign from the performance output. So the clustering model with DB index -2 is better.

     "clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm" -Wikipedia

     The Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin.

     My attached process is an optimization to pick the best K for K-means model, which returns k=3 has the lowest D-B index. You can also try X-means to get an optimized clustering. 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value="yhuang@rapidminer.com"/>
        <parameter key="process_duration_for_mail" value="1"/>
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Ripley-Set" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//Samples/data/Ripley-Set"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply" width="90" x="279" y="34"/>
          <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="8.2.000" expanded="true" height="145" name="Optimize Parameters" width="90" x="514" y="34">
            <list key="parameters">
              <parameter key="Clustering.k" value="[2.0;20;19;linear]"/>
            </list>
            <parameter key="error_handling" value="fail on error"/>
            <parameter key="log_performance" value="true"/>
            <parameter key="log_all_criteria" value="false"/>
            <parameter key="synchronize" value="false"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="fast_k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering" width="90" x="246" y="34">
                <parameter key="add_cluster_attribute" value="true"/>
                <parameter key="add_as_label" value="false"/>
                <parameter key="remove_unlabeled" value="false"/>
                <parameter key="k" value="2"/>
                <parameter key="determine_good_start_values" value="false"/>
                <parameter key="measure_types" value="NumericalMeasures"/>
                <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
                <parameter key="nominal_measure" value="NominalDistance"/>
                <parameter key="numerical_measure" value="EuclideanDistance"/>
                <parameter key="divergence" value="GeneralizedIDivergence"/>
                <parameter key="kernel_type" value="radial"/>
                <parameter key="kernel_gamma" value="1.0"/>
                <parameter key="kernel_sigma1" value="1.0"/>
                <parameter key="kernel_sigma2" value="0.0"/>
                <parameter key="kernel_sigma3" value="2.0"/>
                <parameter key="kernel_degree" value="3.0"/>
                <parameter key="kernel_shift" value="1.0"/>
                <parameter key="kernel_a" value="1.0"/>
                <parameter key="kernel_b" value="0.0"/>
                <parameter key="max_runs" value="10"/>
                <parameter key="max_optimization_steps" value="100"/>
                <parameter key="use_local_random_seed" value="false"/>
                <parameter key="local_random_seed" value="1992"/>
              </operator>
              <operator activated="true" class="cluster_distance_performance" compatibility="9.2.000" expanded="true" height="103" name="Performance" width="90" x="648" y="34">
                <parameter key="main_criterion" value="Davies Bouldin"/>
                <parameter key="main_criterion_only" value="true"/>
                <parameter key="normalize" value="false"/>
                <parameter key="maximize" value="false"/>
              </operator>
              <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
              <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
              <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
              <connect from_op="Performance" from_port="performance" to_port="performance"/>
              <connect from_op="Performance" from_port="example set" to_port="output 1"/>
              <connect from_op="Performance" from_port="cluster model" to_port="model"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
              <description align="left" color="green" colored="true" height="173" resized="false" width="626" x="109" y="164">Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin. The Silhouette Index measure the distance between each data point, the centroid of the cluster it was assigned to and the closest centroid belonging to another cluster. If you consider that this is a good criterion, go for the silhouette index.&lt;br&gt;&lt;br&gt;How can we say that a clustering quality measure is good?. Available from: https://www.researchgate.net/post/How_can_we_say_that_a_clustering_quality_measure_is_good.</description&gt;
            </process>
            <description align="center" color="transparent" colored="false" width="126">figure out the best k for k-means</description>
          </operator>
          <operator activated="true" class="x_means" compatibility="9.0.000" expanded="true" height="82" name="X-Means" width="90" x="514" y="289">
            <parameter key="add_cluster_attribute" value="true"/>
            <parameter key="add_as_label" value="false"/>
            <parameter key="remove_unlabeled" value="false"/>
            <parameter key="k_min" value="2"/>
            <parameter key="k_max" value="10"/>
            <parameter key="determine_good_start_values" value="false"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
            <parameter key="clustering_algorithm" value="KMeans"/>
            <parameter key="max_runs" value="10"/>
            <parameter key="max_optimization_steps" value="100"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <description align="center" color="transparent" colored="false" width="126">run x-means for an optimzied clustering</description>
          </operator>
          <connect from_op="Ripley-Set" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Optimize Parameters" to_port="input 1"/>
          <connect from_op="Multiply" from_port="output 2" to_op="X-Means" to_port="example set"/>
          <connect from_op="Optimize Parameters" from_port="parameter set" to_port="result 1"/>
          <connect from_op="Optimize Parameters" from_port="output 1" to_port="result 2"/>
          <connect from_op="X-Means" from_port="clustered set" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="42"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="189"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    


    YY

  • ShaimaaShaimaa Member Posts: 2 Learner I
    Hi @yyhuang
    Thanks for the reply
    But I saw other comments here for other post asking same question and got different reply. We should take the minimum and if maximized (remove multiplication by - 1) we should take the greate number. This what makes me confused
  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
    The -1 appears in several operators that are based on distances. It's quite annoying!
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Agreed, it would be very nice to convert these types of measures back to their "standard form" so when we share output from RapidMiner it is comparable to the way the rest of the world expects them to work :smiley:
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Well, in fact that is what is supposed to happen anyway.  We have a mechanism in all those performance criteria to show the value and also to deliver a fitness (which is always to be maximized independent of what value is shown).  Unfortunately, some of the criteria (or their developers ;-) are a bit lazy and do not correctly implement this behavior and simply return a negative value instead for both...  You can help us actually by pointing out those cases.  DB-Index is one, any others you have noticed and remember from the top of your head?
    Thanks,
    Ingo
Sign In or Register to comment.