Options

auto-find_precission/recall-breakeven-point

blubbblubb Member Posts: 3 Contributor I
edited November 2018 in Help
Hi there,

i am quite new to rapidminer and the whole subject of classification. my problem is in short the following:
  • textclassification
  • OHSUMED-91-dataset (~13 000 medical abstracts)
  • i want to compare my results with the results of someone else who measured his classification with the precission-recall-breakeven-point
for reaching this breakeven-point i tried:
  • metaCost-Operator
  • thresholdFinder / thresholdApplier
but with manual adjusting costs i cant seem to find the breakeven-point. is there a way for automatically finding the right threshold ??? maybe my whole approach for this is inappropriate ?!?
any help would be appreciated.

thanks in advance
klaus

my process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Root">
    <description>&lt;p&gt;This process demonstrates how a threshold can be obtained from a soft classifier and applied to an independent test set.&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The learner used in this process makes soft predictions instead of crisp classifications.  The prediction confidences delivered by all learners in RapidMiner which are able to handle nominal labels (classification) will be used as soft predictions. &lt;br&gt;&lt;icon&gt;groups/24/learner&lt;/icon&gt;&lt;/li&gt;&lt;li&gt;The ThresholdFinder is used to determine the best threshold with respect to class weights. In this case, a wrong classification of the first class (negative) will cause costs five times bigger than the other error. &lt;br&gt;&lt;icon&gt;groups/24/postprocessing&lt;/icon&gt;&lt;/li&gt;&lt;li&gt;Please note that a ModelApplier must be performed on the test set before a threshold can be found. Since this model must be applied again later, the model applier keeps the input model. &lt;br&gt;&lt;icon&gt;operators/24/model_applier&lt;/icon&gt;&lt;/li&gt;&lt;li&gt;The IOConsumer ensures that the prediction is made on the correct data set.  &lt;br&gt;&lt;icon&gt;operators/24/io_consumer&lt;/icon&gt;&lt;/li&gt;&lt;li&gt;The last steps apply the model and the threshold on the data set at hand. &lt;br&gt;&lt;icon&gt;groups/24/validation&lt;/icon&gt;&lt;/li&gt;&lt;/ol&gt;</description>
    <parameter key="logverbosity" value="status"/>
    <parameter key="random_seed" value="1903"/>
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="ProcessTrainingSet" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="C23_less" value="C:\stuff\Dokumentenklassifikation\Datasets\Ohsumed_91\ohsumed-first-20000-divided-C23_notC23\training\C23_less"/>
          <parameter key="notC23_less" value="C:\stuff\Dokumentenklassifikation\Datasets\Ohsumed_91\ohsumed-first-20000-divided-C23_notC23\training\notC23_less"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="false" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="179" y="30"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.3.002" expanded="true" height="60" name="Stem (Snowball)" width="90" x="315" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="623" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.3.015" expanded="true" height="76" name="SVM (2)" width="90" x="246" y="30">
        <parameter key="gamma" value="1.0"/>
        <parameter key="C" value="1.0"/>
        <parameter key="nu" value="0.4"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="ProcessTestSet" width="90" x="112" y="165">
        <list key="text_directories">
          <parameter key="C23_less" value="C:\stuff\Dokumentenklassifikation\Datasets\Ohsumed_91\ohsumed-first-20000-divided-C23_notC23\test\C23_less"/>
          <parameter key="notC23_less" value="C:\stuff\Dokumentenklassifikation\Datasets\Ohsumed_91\ohsumed-first-20000-divided-C23_notC23\test\notC23_less"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
          <operator activated="false" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="179" y="30"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.3.002" expanded="true" height="60" name="Stem (2)" width="90" x="380" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="648" y="30"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="TestModelApplier" width="90" x="380" y="120">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="find_threshold" compatibility="5.3.015" expanded="true" height="76" name="ThresholdFinder" width="90" x="514" y="120">
        <parameter key="misclassification_costs_first" value="1.335"/>
        <parameter key="use_example_weights" value="false"/>
      </operator>
      <operator activated="true" class="apply_threshold" compatibility="5.3.015" expanded="true" height="76" name="ThresholdApplier" width="90" x="648" y="120"/>
      <operator activated="true" class="performance_classification" compatibility="5.3.015" expanded="true" height="76" name="Performance (3)" width="90" x="782" y="120">
        <parameter key="classification_error" value="true"/>
        <parameter key="kappa" value="true"/>
        <parameter key="weighted_mean_recall" value="true"/>
        <parameter key="weighted_mean_precision" value="true"/>
        <parameter key="spearman_rho" value="true"/>
        <parameter key="kendall_tau" value="true"/>
        <parameter key="absolute_error" value="true"/>
        <parameter key="relative_error" value="true"/>
        <parameter key="relative_error_lenient" value="true"/>
        <parameter key="relative_error_strict" value="true"/>
        <parameter key="normalized_absolute_error" value="true"/>
        <parameter key="root_mean_squared_error" value="true"/>
        <parameter key="root_relative_squared_error" value="true"/>
        <parameter key="squared_error" value="true"/>
        <parameter key="correlation" value="true"/>
        <parameter key="squared_correlation" value="true"/>
        <parameter key="cross-entropy" value="true"/>
        <parameter key="margin" value="true"/>
        <parameter key="soft_margin_loss" value="true"/>
        <parameter key="logistic_loss" value="true"/>
        <list key="class_weights"/>
      </operator>
      <connect from_op="ProcessTrainingSet" from_port="example set" to_op="SVM (2)" to_port="training set"/>
      <connect from_op="ProcessTrainingSet" from_port="word list" to_op="ProcessTestSet" to_port="word list"/>
      <connect from_op="SVM (2)" from_port="model" to_op="TestModelApplier" to_port="model"/>
      <connect from_op="ProcessTestSet" from_port="example set" to_op="TestModelApplier" to_port="unlabelled data"/>
      <connect from_op="TestModelApplier" from_port="labelled data" to_op="ThresholdFinder" to_port="example set"/>
      <connect from_op="ThresholdFinder" from_port="example set" to_op="ThresholdApplier" to_port="example set"/>
      <connect from_op="ThresholdFinder" from_port="threshold" to_op="ThresholdApplier" to_port="threshold"/>
      <connect from_op="ThresholdApplier" from_port="example set" to_op="Performance (3)" to_port="labelled data"/>
      <connect from_op="Performance (3)" from_port="performance" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="36"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • Options
    frasfras Member Posts: 93 Contributor II
    Hi Klaus,
    to focus more on the threshold problem I built a new process with
    sonar data and a Naive Bayes learner.
    First of all: Do you want a lot of true positives (high precision)
    or is it important to have less false positives (high recall) ?
    You can not have both.
    Especially if you are dealing with mines and rocks  - like in the example process -
    it is highly recommended for a submarine that _no_ mine is
    predicted as a rock...
    Please study the operator "Select Recall" used in the following process.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Root">
        <parameter key="logverbosity" value="status"/>
        <parameter key="random_seed" value="1903"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="6.0.003" expanded="true" height="60" name="Retrieve Sonar" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Samples/data/Sonar"/>
          </operator>
          <operator activated="true" class="nominal_to_binominal" compatibility="6.0.003" expanded="true" height="94" name="Nominal to Binominal" width="90" x="45" y="75"/>
          <operator activated="true" class="split_data" compatibility="6.0.003" expanded="true" height="94" name="Split Data" width="90" x="45" y="210">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.3"/>
              <parameter key="ratio" value="0.7"/>
            </enumeration>
            <parameter key="sampling_type" value="stratified sampling"/>
          </operator>
          <operator activated="true" class="naive_bayes" compatibility="6.0.003" expanded="true" height="76" name="Naive Bayes" width="90" x="179" y="165"/>
          <operator activated="true" class="apply_model" compatibility="6.0.003" expanded="true" height="76" name="TestModelApplier" width="90" x="179" y="255">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="6.0.003" expanded="true" height="94" name="Multiply" width="90" x="313" y="210"/>
          <operator activated="true" class="select_recall" compatibility="6.0.003" expanded="true" height="76" name="Select Recall" width="90" x="447" y="165">
            <parameter key="min_recall" value="0.6"/>
            <parameter key="use_example_weights" value="false"/>
          </operator>
          <operator activated="true" class="apply_threshold" compatibility="6.0.003" expanded="true" height="76" name="Apply Threshold" width="90" x="581" y="165"/>
          <operator activated="true" class="performance_binominal_classification" compatibility="6.0.003" expanded="true" height="76" name="Perf 0.6" width="90" x="715" y="165">
            <parameter key="AUC (pessimistic)" value="true"/>
            <parameter key="precision" value="true"/>
            <parameter key="recall" value="true"/>
          </operator>
          <operator activated="true" class="select_recall" compatibility="6.0.003" expanded="true" height="76" name="Select Recall (2)" width="90" x="447" y="255">
            <parameter key="min_recall" value="0.9"/>
            <parameter key="use_example_weights" value="false"/>
          </operator>
          <operator activated="true" class="apply_threshold" compatibility="6.0.003" expanded="true" height="76" name="Apply Threshold (2)" width="90" x="581" y="255"/>
          <operator activated="true" class="performance_binominal_classification" compatibility="6.0.003" expanded="true" height="76" name="Perf 0.9" width="90" x="715" y="255">
            <parameter key="AUC (pessimistic)" value="true"/>
            <parameter key="precision" value="true"/>
            <parameter key="recall" value="true"/>
          </operator>
          <connect from_op="Retrieve Sonar" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
          <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Naive Bayes" to_port="training set"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="TestModelApplier" to_port="unlabelled data"/>
          <connect from_op="Naive Bayes" from_port="model" to_op="TestModelApplier" to_port="model"/>
          <connect from_op="TestModelApplier" from_port="labelled data" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Select Recall" to_port="example set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Select Recall (2)" to_port="example set"/>
          <connect from_op="Select Recall" from_port="example set" to_op="Apply Threshold" to_port="example set"/>
          <connect from_op="Select Recall" from_port="threshold" to_op="Apply Threshold" to_port="threshold"/>
          <connect from_op="Apply Threshold" from_port="example set" to_op="Perf 0.6" to_port="labelled data"/>
          <connect from_op="Perf 0.6" from_port="performance" to_port="result 1"/>
          <connect from_op="Select Recall (2)" from_port="example set" to_op="Apply Threshold (2)" to_port="example set"/>
          <connect from_op="Select Recall (2)" from_port="threshold" to_op="Apply Threshold (2)" to_port="threshold"/>
          <connect from_op="Apply Threshold (2)" from_port="example set" to_op="Perf 0.9" to_port="labelled data"/>
          <connect from_op="Perf 0.9" from_port="performance" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="36"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

  • Options
    blubbblubb Member Posts: 3 Contributor I
    hi fras,

    first of all, thank you much for providing me this example process. now with the "selectRecall"-operator i got another method for adjusting precission and recall. but still this is not exactly what i am looking for.

    as mentioned above, i want to compare my classificator-results with those of some else (Thorsten Joachims: Text Categorization with Support Vector
    Machines: Learning with Many Relevant Features
    ).
    and he meassured his classificator with the precission-recall-breakeven-point
    the precission-recall-breakeven-point is defined as the point where precission and recall are equal.

    i know that this is a very old and obsolete method for meassuring a classifier and that other values like the Foil’s information Gain, or the Likelihood-Ratio are much better for defining the quality of a classifier, which i will also use. but in order to be able to compare my results with those of Thorsten Joachims i need to find the precission-recall-breakeven-point. is there a way in rapidminer to automatically find the needed threshold or do i have to find it manually by trying different thresholds?

    ---

    anther question: in your example when i type in "Mine" or "Rock" in the settings for "positive label" of the selectRecall-operator it seems that it has no effect. always "mine" is the positive label, even when i type in "Rock". also tried the remapBinominals-operator with no succes in changing the  positive label:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Root">
       <parameter key="logverbosity" value="status"/>
       <parameter key="random_seed" value="1903"/>
       <process expanded="true">
         <operator activated="true" class="retrieve" compatibility="5.3.015" expanded="true" height="60" name="Retrieve Sonar" width="90" x="45" y="30">
           <parameter key="repository_entry" value="//Samples/data/Sonar"/>
         </operator>
         <operator activated="true" class="nominal_to_binominal" compatibility="5.3.015" expanded="true" height="94" name="Nominal to Binominal" width="90" x="179" y="30"/>
         <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="94" name="Multiply" width="90" x="313" y="30"/>
         <operator activated="true" class="remap_binominals" compatibility="5.3.015" expanded="true" height="76" name="Remap Binominals" width="90" x="45" y="210">
           <parameter key="negative_value" value="Mine"/>
           <parameter key="positive_value" value="Rock"/>
         </operator>
         <operator activated="true" class="split_data" compatibility="5.3.015" expanded="true" height="94" name="Split Data (2)" width="90" x="179" y="210">
           <enumeration key="partitions">
             <parameter key="ratio" value="0.3"/>
             <parameter key="ratio" value="0.7"/>
           </enumeration>
           <parameter key="sampling_type" value="stratified sampling"/>
         </operator>
         <operator activated="true" class="naive_bayes" compatibility="5.3.015" expanded="true" height="76" name="Naive Bayes (2)" width="90" x="313" y="210"/>
         <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="TestModelApplier (2)" width="90" x="447" y="210">
           <list key="application_parameters"/>
         </operator>
         <operator activated="true" class="select_recall" compatibility="5.3.015" expanded="true" height="76" name="Select Recall (2)" width="90" x="581" y="210">
           <parameter key="min_recall" value="0.9"/>
           <parameter key="use_example_weights" value="false"/>
         </operator>
         <operator activated="true" class="apply_threshold" compatibility="5.3.015" expanded="true" height="76" name="Apply Threshold (2)" width="90" x="715" y="210"/>
         <operator activated="true" class="performance_binominal_classification" compatibility="5.3.015" expanded="true" height="76" name="Perf 2" width="90" x="849" y="210">
           <parameter key="AUC (pessimistic)" value="true"/>
           <parameter key="precision" value="true"/>
           <parameter key="recall" value="true"/>
         </operator>
         <operator activated="true" class="remap_binominals" compatibility="5.3.015" expanded="true" height="76" name="Remap Binominals (2)" width="90" x="45" y="345">
           <parameter key="negative_value" value="Rock"/>
           <parameter key="positive_value" value="Mine"/>
         </operator>
         <operator activated="true" class="split_data" compatibility="5.3.015" expanded="true" height="94" name="Split Data" width="90" x="179" y="345">
           <enumeration key="partitions">
             <parameter key="ratio" value="0.3"/>
             <parameter key="ratio" value="0.7"/>
           </enumeration>
           <parameter key="sampling_type" value="stratified sampling"/>
         </operator>
         <operator activated="true" class="naive_bayes" compatibility="5.3.015" expanded="true" height="76" name="Naive Bayes" width="90" x="313" y="345"/>
         <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="TestModelApplier" width="90" x="447" y="345">
           <list key="application_parameters"/>
         </operator>
         <operator activated="true" class="select_recall" compatibility="5.3.015" expanded="true" height="76" name="Select Recall" width="90" x="581" y="345">
           <parameter key="min_recall" value="0.9"/>
           <parameter key="use_example_weights" value="false"/>
         </operator>
         <operator activated="true" class="apply_threshold" compatibility="5.3.015" expanded="true" height="76" name="Apply Threshold" width="90" x="715" y="345"/>
         <operator activated="true" class="performance_binominal_classification" compatibility="5.3.015" expanded="true" height="76" name="Perf 1" width="90" x="849" y="345">
           <parameter key="AUC (pessimistic)" value="true"/>
           <parameter key="precision" value="true"/>
           <parameter key="recall" value="true"/>
         </operator>
         <connect from_op="Retrieve Sonar" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
         <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Multiply" to_port="input"/>
         <connect from_op="Multiply" from_port="output 1" to_op="Remap Binominals (2)" to_port="example set input"/>
         <connect from_op="Multiply" from_port="output 2" to_op="Remap Binominals" to_port="example set input"/>
         <connect from_op="Remap Binominals" from_port="original" to_op="Split Data (2)" to_port="example set"/>
         <connect from_op="Split Data (2)" from_port="partition 1" to_op="Naive Bayes (2)" to_port="training set"/>
         <connect from_op="Split Data (2)" from_port="partition 2" to_op="TestModelApplier (2)" to_port="unlabelled data"/>
         <connect from_op="Naive Bayes (2)" from_port="model" to_op="TestModelApplier (2)" to_port="model"/>
         <connect from_op="TestModelApplier (2)" from_port="labelled data" to_op="Select Recall (2)" to_port="example set"/>
         <connect from_op="Select Recall (2)" from_port="example set" to_op="Apply Threshold (2)" to_port="example set"/>
         <connect from_op="Select Recall (2)" from_port="threshold" to_op="Apply Threshold (2)" to_port="threshold"/>
         <connect from_op="Apply Threshold (2)" from_port="example set" to_op="Perf 2" to_port="labelled data"/>
         <connect from_op="Perf 2" from_port="performance" to_port="result 2"/>
         <connect from_op="Remap Binominals (2)" from_port="original" to_op="Split Data" to_port="example set"/>
         <connect from_op="Split Data" from_port="partition 1" to_op="Naive Bayes" to_port="training set"/>
         <connect from_op="Split Data" from_port="partition 2" to_op="TestModelApplier" to_port="unlabelled data"/>
         <connect from_op="Naive Bayes" from_port="model" to_op="TestModelApplier" to_port="model"/>
         <connect from_op="TestModelApplier" from_port="labelled data" to_op="Select Recall" to_port="example set"/>
         <connect from_op="Select Recall" from_port="example set" to_op="Apply Threshold" to_port="example set"/>
         <connect from_op="Select Recall" from_port="threshold" to_op="Apply Threshold" to_port="threshold"/>
         <connect from_op="Apply Threshold" from_port="example set" to_op="Perf 1" to_port="labelled data"/>
         <connect from_op="Perf 1" from_port="performance" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="36"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
  • Options
    blubbblubb Member Posts: 3 Contributor I
    nobody got any suggestions for my problem?  :'(
Sign In or Register to comment.