What is the underlying algorithm of "Find threshold"

johnny5550822 · November 2016

I understand that the "Find threshold" operator uses ROC to determine the best threshold. But, what kind of algorithm it uses to select the threshold? For example, (1) optimizes the precision and recall, or (2) something like this: http://stats.stackexchange.com/questions/29719/how-to-determine-best-cutoff-point-and-its-confidence-interval-using-roc-curve-i, or (3) other

Thanks!

JEdward · November 2016

Hi Johnny,

You should be able to track it down on the github. RapidMiner Github

Try here: Find threshold & ROC helper class

johnny5550822 · November 2016

Great, thanks. Let me take a look!

johnny5550822 · November 2016

I tried to understand the code in the method "public ROCData createROCData", but I am not quite understanding what method it is using to determining the best threshold. Is there any paper that it is based on?

The code is in:

"https://github.com/rapidminer/rapidminer-studio/blob/85d3bee36c026a70580075092ed85ac517369e8e/src/main/java/com/rapidminer/tools/math/ROCDataGenerator.java"

SGolbert · April 2019

Hi,

I revive this post because I had to use Find Threshold and it doesn't perform as I expect it. After giving it both costs for missclassification, it doesn't return an optimal threshold. I had to multiply one of the costs by 10 to obtain a better threshold. I have looked at the Java code, but as usual it's very time consuming to know what's going on. So I am only giving a warning that it may be something wrong. In general, answering "take a look at the Java code" is not much of an answer, we need to have better references for each of the methods.

Here is the process I used:

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve 2. Data + new features" width="90" x="112" y="34">
        <parameter key="repository_entry" value="../data/2. Data + new features"/>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="166" name="Validation" width="90" x="447" y="34">
        <parameter key="split_on_batch_attribute" value="false"/>
        <parameter key="leave_one_out" value="false"/>
        <parameter key="number_of_folds" value="10"/>
        <parameter key="sampling_type" value="stratified sampling"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.2.000" expanded="true" height="124" name="Generalized Linear Model" width="90" x="45" y="34">
            <parameter key="family" value="AUTO"/>
            <parameter key="link" value="family_default"/>
            <parameter key="solver" value="AUTO"/>
            <parameter key="reproducible" value="false"/>
            <parameter key="maximum_number_of_threads" value="4"/>
            <parameter key="use_regularization" value="true"/>
            <parameter key="lambda_search" value="false"/>
            <parameter key="number_of_lambdas" value="0"/>
            <parameter key="lambda_min_ratio" value="0.0"/>
            <parameter key="early_stopping" value="true"/>
            <parameter key="stopping_rounds" value="3"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="standardize" value="true"/>
            <parameter key="non-negative_coefficients" value="false"/>
            <parameter key="add_intercept" value="true"/>
            <parameter key="compute_p-values" value="false"/>
            <parameter key="remove_collinear_columns" value="false"/>
            <parameter key="missing_values_handling" value="MeanImputation"/>
            <parameter key="max_iterations" value="0"/>
            <parameter key="specify_beta_constraints" value="false"/>
            <list key="beta_constraints"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
          </operator>
          <operator activated="false" class="h2o:gradient_boosted_trees" compatibility="9.2.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="246" y="442">
            <parameter key="number_of_trees" value="100"/>
            <parameter key="reproducible" value="false"/>
            <parameter key="maximum_number_of_threads" value="4"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="maximal_depth" value="10"/>
            <parameter key="min_rows" value="10.0"/>
            <parameter key="min_split_improvement" value="0.0"/>
            <parameter key="number_of_bins" value="20"/>
            <parameter key="learning_rate" value="0.01"/>
            <parameter key="sample_rate" value="1.0"/>
            <parameter key="distribution" value="AUTO"/>
            <parameter key="early_stopping" value="false"/>
            <parameter key="stopping_rounds" value="1"/>
            <parameter key="stopping_metric" value="AUTO"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
          </operator>
          <operator activated="false" class="h2o:deep_learning" compatibility="9.2.000" expanded="true" height="82" name="Deep Learning" width="90" x="380" y="442">
            <parameter key="activation" value="Rectifier"/>
            <enumeration key="hidden_layer_sizes">
              <parameter key="hidden_layer_sizes" value="50"/>
              <parameter key="hidden_layer_sizes" value="50"/>
            </enumeration>
            <enumeration key="hidden_dropout_ratios"/>
            <parameter key="reproducible_(uses_1_thread)" value="false"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="epochs" value="10.0"/>
            <parameter key="compute_variable_importances" value="false"/>
            <parameter key="train_samples_per_iteration" value="-2"/>
            <parameter key="adaptive_rate" value="true"/>
            <parameter key="epsilon" value="1.0E-8"/>
            <parameter key="rho" value="0.99"/>
            <parameter key="learning_rate" value="0.005"/>
            <parameter key="learning_rate_annealing" value="1.0E-6"/>
            <parameter key="learning_rate_decay" value="1.0"/>
            <parameter key="momentum_start" value="0.0"/>
            <parameter key="momentum_ramp" value="1000000.0"/>
            <parameter key="momentum_stable" value="0.0"/>
            <parameter key="nesterov_accelerated_gradient" value="true"/>
            <parameter key="standardize" value="true"/>
            <parameter key="L1" value="1.0E-5"/>
            <parameter key="L2" value="0.0"/>
            <parameter key="max_w2" value="10.0"/>
            <parameter key="loss_function" value="Automatic"/>
            <parameter key="distribution_function" value="AUTO"/>
            <parameter key="early_stopping" value="false"/>
            <parameter key="stopping_rounds" value="1"/>
            <parameter key="stopping_metric" value="AUTO"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="missing_values_handling" value="MeanImputation"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
            <list key="expert_parameters_"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="34"/>
          <operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="313" y="136">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="find_threshold" compatibility="9.2.001" expanded="true" height="82" name="Find Threshold" width="90" x="447" y="136">
            <parameter key="define_labels" value="false"/>
            <parameter key="misclassification_costs_first" value="25.0"/>
            <parameter key="misclassification_costs_second" value="10.0"/>
            <parameter key="show_roc_plot" value="false"/>
            <parameter key="use_example_weights" value="true"/>
            <parameter key="roc_bias" value="optimistic"/>
          </operator>
          <connect from_port="training set" to_op="Generalized Linear Model" to_port="training set"/>
          <connect from_op="Generalized Linear Model" from_port="model" to_op="Multiply" to_port="input"/>
          <connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Multiply" from_port="output 1" to_port="model"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Find Threshold" to_port="example set"/>
          <connect from_op="Find Threshold" from_port="threshold" to_port="through 1"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <portSpacing port="sink_through 2" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="apply_threshold" compatibility="9.2.001" expanded="true" height="82" name="Apply Threshold" width="90" x="179" y="34"/>
          <operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply (2)" width="90" x="313" y="34"/>
          <operator activated="true" class="performance_costs" compatibility="9.2.001" expanded="true" height="82" name="Performance" width="90" x="514" y="34">
            <parameter key="keep_exampleSet" value="false"/>
            <parameter key="cost_matrix" value="[0.0 10.0;25.0 0.0]"/>
            <enumeration key="class_order_definition"/>
          </operator>
          <operator activated="true" class="performance_binominal_classification" compatibility="9.2.001" expanded="true" height="82" name="Performance (2)" width="90" x="514" y="136">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="true"/>
            <parameter key="kappa" value="false"/>
            <parameter key="AUC (optimistic)" value="false"/>
            <parameter key="AUC" value="true"/>
            <parameter key="AUC (pessimistic)" value="false"/>
            <parameter key="precision" value="true"/>
            <parameter key="recall" value="true"/>
            <parameter key="lift" value="false"/>
            <parameter key="fallout" value="false"/>
            <parameter key="f_measure" value="true"/>
            <parameter key="false_positive" value="false"/>
            <parameter key="false_negative" value="false"/>
            <parameter key="true_positive" value="false"/>
            <parameter key="true_negative" value="false"/>
            <parameter key="sensitivity" value="true"/>
            <parameter key="specificity" value="false"/>
            <parameter key="youden" value="false"/>
            <parameter key="positive_predictive_value" value="false"/>
            <parameter key="negative_predictive_value" value="false"/>
            <parameter key="psep" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_port="through 1" to_op="Apply Threshold" to_port="threshold"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Apply Threshold" to_port="example set"/>
          <connect from_op="Apply Threshold" from_port="example set" to_op="Multiply (2)" to_port="input"/>
          <connect from_op="Multiply (2)" from_port="output 1" to_op="Performance" to_port="example set"/>
          <connect from_op="Multiply (2)" from_port="output 2" to_op="Performance (2)" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <connect from_op="Performance (2)" from_port="performance" to_port="performance 2"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="source_through 2" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
          <portSpacing port="sink_performance 3" spacing="0"/>
          <description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.</description>
        </process>
        <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
      </operator>
      <connect from_op="Retrieve 2. Data + new features" from_port="output" to_op="Validation" to_port="example set"/>
      <connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
      <connect from_op="Validation" from_port="performance 2" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
</operator>
</process>

I have also tried to optimize the threshold without cross validation, and it doesn't give the best one even for the training data!

Going a bit to other topic, the Optimize Threshold operator from the Operator Toolbox could be enhanced to accept other performance measures or accept missclassification costs.

Regards,

Sebastian

lionelderkrikor · April 2019

Hi @SGolbert,

I'm not able to reproduce what you observe with the "Titanic" dataset.
Could you share your data specifying :
- what you observe.
- what you expect.

Thanks you,

Regards,

Lionel

MartinLiebig · April 2019

Hi @SGolbert,
w.r.t the toolbox one: Noted. I planned to add a version with a subprocess where you can deliver you custom performance measure. But - time as usual..

BR,
Martin

SGolbert · May 2019

Hi all,

sorry for the delayed reply. I have found the process that I have used and the data:

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
<context>
    <input/>
    <output/>
    <macros>
      <macro>
        <key>cost_first</key>
        <value>25</value>
      </macro>
      <macro>
        <key>cost_second</key>
        <value>10</value>
      </macro>
    </macros>
</context>
<operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve 2. Data + new features" width="90" x="112" y="34">
        <parameter key="repository_entry" value="../data/2. Data + new features"/>
      </operator>
      <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.2.000" expanded="true" height="124" name="Generalized Linear Model (2)" width="90" x="313" y="34">
        <parameter key="family" value="AUTO"/>
        <parameter key="link" value="family_default"/>
        <parameter key="solver" value="AUTO"/>
        <parameter key="reproducible" value="false"/>
        <parameter key="maximum_number_of_threads" value="4"/>
        <parameter key="use_regularization" value="true"/>
        <parameter key="lambda_search" value="false"/>
        <parameter key="number_of_lambdas" value="0"/>
        <parameter key="lambda_min_ratio" value="0.0"/>
        <parameter key="early_stopping" value="true"/>
        <parameter key="stopping_rounds" value="3"/>
        <parameter key="stopping_tolerance" value="0.001"/>
        <parameter key="standardize" value="true"/>
        <parameter key="non-negative_coefficients" value="false"/>
        <parameter key="add_intercept" value="true"/>
        <parameter key="compute_p-values" value="false"/>
        <parameter key="remove_collinear_columns" value="false"/>
        <parameter key="missing_values_handling" value="MeanImputation"/>
        <parameter key="max_iterations" value="0"/>
        <parameter key="specify_beta_constraints" value="false"/>
        <list key="beta_constraints"/>
        <parameter key="max_runtime_seconds" value="0"/>
        <list key="expert_parameters"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve XX. Test Data + new features" width="90" x="715" y="85">
        <parameter key="repository_entry" value="../data/XX. Test Data + new features"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply (3)" width="90" x="514" y="34"/>
      <operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="849" y="34">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model (3)" width="90" x="514" y="187">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <operator activated="true" class="find_threshold" compatibility="9.2.001" expanded="true" height="82" name="Find Threshold (2)" width="90" x="715" y="187">
        <parameter key="define_labels" value="false"/>
        <parameter key="misclassification_costs_first" value="%{cost_first}"/>
        <parameter key="misclassification_costs_second" value="%{cost_second}"/>
        <parameter key="show_roc_plot" value="false"/>
        <parameter key="use_example_weights" value="true"/>
        <parameter key="roc_bias" value="optimistic"/>
        <description align="center" color="transparent" colored="false" width="126">define costs here</description>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="124" name="Multiply (4)" width="90" x="916" y="131"/>
      <operator activated="true" class="apply_threshold" compatibility="9.2.001" expanded="true" height="82" name="Apply Threshold (3)" width="90" x="1050" y="187"/>
      <operator activated="true" class="apply_threshold" compatibility="9.2.001" expanded="true" height="82" name="Apply Threshold (2)" width="90" x="1050" y="34"/>
      <operator activated="true" class="sort" compatibility="9.2.001" expanded="true" height="82" name="Sort" width="90" x="1184" y="34">
        <parameter key="attribute_name" value="prediction(fraud)"/>
        <parameter key="sorting_direction" value="decreasing"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply (5)" width="90" x="1184" y="187"/>
      <operator activated="true" class="performance_binominal_classification" compatibility="9.2.001" expanded="true" height="82" name="Performance (3)" width="90" x="1318" y="187">
        <parameter key="main_criterion" value="first"/>
        <parameter key="accuracy" value="true"/>
        <parameter key="classification_error" value="false"/>
        <parameter key="kappa" value="false"/>
        <parameter key="AUC (optimistic)" value="false"/>
        <parameter key="AUC" value="false"/>
        <parameter key="AUC (pessimistic)" value="false"/>
        <parameter key="precision" value="false"/>
        <parameter key="recall" value="false"/>
        <parameter key="lift" value="false"/>
        <parameter key="fallout" value="false"/>
        <parameter key="f_measure" value="false"/>
        <parameter key="false_positive" value="false"/>
        <parameter key="false_negative" value="false"/>
        <parameter key="true_positive" value="false"/>
        <parameter key="true_negative" value="false"/>
        <parameter key="sensitivity" value="false"/>
        <parameter key="specificity" value="false"/>
        <parameter key="youden" value="false"/>
        <parameter key="positive_predictive_value" value="false"/>
        <parameter key="negative_predictive_value" value="false"/>
        <parameter key="psep" value="false"/>
        <parameter key="skip_undefined_labels" value="true"/>
        <parameter key="use_example_weights" value="true"/>
      </operator>
      <operator activated="true" class="performance_costs" compatibility="9.2.001" expanded="true" height="82" name="Performance (4)" width="90" x="1318" y="340">
        <parameter key="keep_exampleSet" value="false"/>
        <parameter key="cost_matrix" value="[0.0 10.0;25.0 0.0]"/>
        <enumeration key="class_order_definition"/>
        <description align="center" color="transparent" colored="false" width="126">evaluate here with original costs</description>
      </operator>
      <connect from_op="Retrieve 2. Data + new features" from_port="output" to_op="Generalized Linear Model (2)" to_port="training set"/>
      <connect from_op="Generalized Linear Model (2)" from_port="model" to_op="Multiply (3)" to_port="input"/>
      <connect from_op="Generalized Linear Model (2)" from_port="exampleSet" to_op="Apply Model (3)" to_port="unlabelled data"/>
      <connect from_op="Retrieve XX. Test Data + new features" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Multiply (3)" from_port="output 1" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Multiply (3)" from_port="output 2" to_op="Apply Model (3)" to_port="model"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Apply Threshold (2)" to_port="example set"/>
      <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Find Threshold (2)" to_port="example set"/>
      <connect from_op="Find Threshold (2)" from_port="example set" to_op="Apply Threshold (3)" to_port="example set"/>
      <connect from_op="Find Threshold (2)" from_port="threshold" to_op="Multiply (4)" to_port="input"/>
      <connect from_op="Multiply (4)" from_port="output 1" to_op="Apply Threshold (2)" to_port="threshold"/>
      <connect from_op="Multiply (4)" from_port="output 2" to_port="result 2"/>
      <connect from_op="Multiply (4)" from_port="output 3" to_op="Apply Threshold (3)" to_port="threshold"/>
      <connect from_op="Apply Threshold (3)" from_port="example set" to_op="Multiply (5)" to_port="input"/>
      <connect from_op="Apply Threshold (2)" from_port="example set" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_port="result 1"/>
      <connect from_op="Multiply (5)" from_port="output 1" to_op="Performance (3)" to_port="labelled data"/>
      <connect from_op="Multiply (5)" from_port="output 2" to_op="Performance (4)" to_port="example set"/>
      <connect from_op="Performance (3)" from_port="performance" to_port="result 3"/>
      <connect from_op="Performance (4)" from_port="performance" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
      <description align="center" color="yellow" colored="false" height="108" resized="false" width="180" x="692" y="400">Try costs for miss classification first: 25 (original) and 250<br/><br/>Defined in macro cost_first</description>
    </process>
</operator>
</process>

This process doesn't use cross-validation, but the cross-validated result is the same (in this case the unexpected behaviour could be caused by applying a model on unseen data, therefore I am testing on the training set to catch the bug).

The problem is simple, I have missclassifications cost of 25 (no fraud) and 10 (fraud). It is actually more expensive to missclassify a loyal customer than a fraud customer. I define these costs in the operator Find Threshold and then evaluate the results with Performance (Costs).

The problem is that I get better results when I use cost1 = 250 in Find Threshold instead of cost1 = 25. If you can explain me why is it so, I would really appreciate it!

Kind regards,

Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

What is the underlying algorithm of "Find threshold"

Best Answer

Answers