Optimize Parameters fails on F-measure

HeikoPaulheimHeikoPaulheim Member Posts: 13 Contributor II
edited November 2018 in Help
Hi,

I try to optimize parameters towards F-measure. There may be cases where the F-measure is undefined (if there are no true positives), but I know that some configurations exist where F-measure is at least defined (i.e., at least one true positive).

The optimize (grid) operator, however, always returns a configuration where F-measure is undefined.

Is there any way to circumvent that behavior?

Best,
Heiko

Answers

  • frasfras Member Posts: 93 Contributor II
    Are you really optimizing your model with respect to F-measure ? Please put your process here to check:

    XML
  • HeikoPaulheimHeikoPaulheim Member Posts: 13 Contributor II
    There it is. Yields an F-measure of 0. If I change the main measure to AUC, it yields an F-measure of ~37%, so it's technically possible to get a higher value here.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_csv" compatibility="5.3.015" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
           <parameter key="csv_file" value="C:\Users\Heiko\Documents\Forschung\DBpediaDebugging\redirects\training_features.csv"/>
           <parameter key="column_separators" value="&#9;"/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations">
             <parameter key="0" value="Name"/>
           </list>
           <parameter key="encoding" value="windows-1252"/>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="Original.true.polynominal.id"/>
             <parameter key="1" value="Replaced.true.polynominal.batch"/>
             <parameter key="2" value="Correct.true.binominal.label"/>
             <parameter key="3" value="Plausible.true.integer.attribute"/>
             <parameter key="4" value="Distribution.true.real.attribute"/>
             <parameter key="5" value="Levenstein.true.integer.attribute"/>
             <parameter key="6" value="Levenstein (relative).true.real.attribute"/>
             <parameter key="7" value="Jaccard.true.real.attribute"/>
             <parameter key="8" value="Jaro.true.real.attribute"/>
             <parameter key="9" value="JaroWinkler.true.real.attribute"/>
             <parameter key="10" value="Prefix.true.real.attribute"/>
             <parameter key="11" value="Prefix2.true.real.attribute"/>
             <parameter key="12" value="Substring1.true.real.attribute"/>
             <parameter key="13" value="Substring2.true.real.attribute"/>
             <parameter key="14" value="Redirects.true.integer.attribute"/>
             <parameter key="15" value="Disambiguations.true.integer.attribute"/>
           </list>
         </operator>
         <operator activated="true" class="optimize_parameters_grid" compatibility="5.3.015" expanded="true" height="94" name="Optimize Parameters (2)" width="90" x="246" y="30">
           <list key="parameters">
             <parameter key="SVM (4).gamma" value="[0.0000001;1000000;13;logarithmic]"/>
             <parameter key="SVM (4).C" value="[0.0000001;1000000;13;logarithmic]"/>
           </list>
           <process expanded="true">
             <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="112" name="Validation (3)" width="90" x="246" y="30">
               <description>A cross-validation evaluating a decision tree model.</description>
               <process expanded="true">
                 <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.3.015" expanded="true" height="76" name="SVM (4)" width="90" x="90" y="30">
                   <parameter key="gamma" value="1.0000000000000003E-4"/>
                   <parameter key="C" value="1000000.0"/>
                   <list key="class_weights">
                     <parameter key="0" value="20.0"/>
                     <parameter key="1" value="1.0"/>
                   </list>
                 </operator>
                 <connect from_port="training" to_op="SVM (4)" to_port="training set"/>
                 <connect from_op="SVM (4)" from_port="model" to_port="model"/>
                 <portSpacing port="source_training" spacing="0"/>
                 <portSpacing port="sink_model" spacing="0"/>
                 <portSpacing port="sink_through 1" spacing="0"/>
               </process>
               <process expanded="true">
                 <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="Apply Model (5)" width="90" x="45" y="30">
                   <list key="application_parameters"/>
                 </operator>
                 <operator activated="true" class="performance_binominal_classification" compatibility="5.3.015" expanded="true" height="76" name="Performance (5)" width="90" x="179" y="30">
                   <parameter key="f_measure" value="true"/>
                 </operator>
                 <connect from_port="model" to_op="Apply Model (5)" to_port="model"/>
                 <connect from_port="test set" to_op="Apply Model (5)" to_port="unlabelled data"/>
                 <connect from_op="Apply Model (5)" from_port="labelled data" to_op="Performance (5)" to_port="labelled data"/>
                 <connect from_op="Performance (5)" from_port="performance" to_port="averagable 1"/>
                 <portSpacing port="source_model" spacing="0"/>
                 <portSpacing port="source_test set" spacing="0"/>
                 <portSpacing port="source_through 1" spacing="0"/>
                 <portSpacing port="sink_averagable 1" spacing="0"/>
                 <portSpacing port="sink_averagable 2" spacing="0"/>
               </process>
             </operator>
             <connect from_port="input 1" to_op="Validation (3)" to_port="training"/>
             <connect from_op="Validation (3)" from_port="averagable 1" to_port="performance"/>
             <portSpacing port="source_input 1" spacing="0"/>
             <portSpacing port="source_input 2" spacing="0"/>
             <portSpacing port="sink_performance" spacing="0"/>
             <portSpacing port="sink_result 1" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Optimize Parameters (2)" to_port="input 1"/>
         <connect from_op="Optimize Parameters (2)" from_port="parameter" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
  • HeikoPaulheimHeikoPaulheim Member Posts: 13 Contributor II
    If I may make a guess for the cause here: I think RM internally computes the F-measure without checking for tp=0. If you divide by 0 in Java, the result becomes larger than any other double:

    double d1 = 1.0;
    double d2 = 1.0/0.0;
    System.out.println(d1>d2);
    System.out.println(d2>d1);
    Thus, if not handled separately, a configuration which produces zero true positives (i.e., both recall and precision are 0) will always be favored over any other configuration, since the F-measure is a term with 0 as its denominator. Usually, F1 is defined as 0 if tp=0, although the term itself is undefined for that case.
Sign In or Register to comment.