The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
What is the underlying algorithm of "Find threshold"
johnny5550822
Member Posts: 3 Contributor I
I understand that the "Find threshold" operator uses ROC to determine the best threshold. But, what kind of algorithm it uses to select the threshold? For example, (1) optimizes the precision and recall, or (2) something like this: http://stats.stackexchange.com/questions/29719/how-to-determine-best-cutoff-point-and-its-confidence-interval-using-roc-curve-i, or (3) other
Thanks!
Tagged:
0
Best Answer
-
JEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
Hi Johnny,
You should be able to track it down on the github. RapidMiner Github
Try here: Find threshold & ROC helper class
0
Answers
Great, thanks. Let me take a look!
I tried to understand the code in the method "public ROCData createROCData", but I am not quite understanding what method it is using to determining the best threshold. Is there any paper that it is based on?
The code is in:
"https://github.com/rapidminer/rapidminer-studio/blob/85d3bee36c026a70580075092ed85ac517369e8e/src/main/java/com/rapidminer/tools/math/ROCDataGenerator.java"
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve 2. Data + new features" width="90" x="112" y="34">
<parameter key="repository_entry" value="../data/2. Data + new features"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="166" name="Validation" width="90" x="447" y="34">
<parameter key="split_on_batch_attribute" value="false"/>
<parameter key="leave_one_out" value="false"/>
<parameter key="number_of_folds" value="10"/>
<parameter key="sampling_type" value="stratified sampling"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="enable_parallel_execution" value="true"/>
<process expanded="true">
<operator activated="true" class="h2o:generalized_linear_model" compatibility="9.2.000" expanded="true" height="124" name="Generalized Linear Model" width="90" x="45" y="34">
<parameter key="family" value="AUTO"/>
<parameter key="link" value="family_default"/>
<parameter key="solver" value="AUTO"/>
<parameter key="reproducible" value="false"/>
<parameter key="maximum_number_of_threads" value="4"/>
<parameter key="use_regularization" value="true"/>
<parameter key="lambda_search" value="false"/>
<parameter key="number_of_lambdas" value="0"/>
<parameter key="lambda_min_ratio" value="0.0"/>
<parameter key="early_stopping" value="true"/>
<parameter key="stopping_rounds" value="3"/>
<parameter key="stopping_tolerance" value="0.001"/>
<parameter key="standardize" value="true"/>
<parameter key="non-negative_coefficients" value="false"/>
<parameter key="add_intercept" value="true"/>
<parameter key="compute_p-values" value="false"/>
<parameter key="remove_collinear_columns" value="false"/>
<parameter key="missing_values_handling" value="MeanImputation"/>
<parameter key="max_iterations" value="0"/>
<parameter key="specify_beta_constraints" value="false"/>
<list key="beta_constraints"/>
<parameter key="max_runtime_seconds" value="0"/>
<list key="expert_parameters"/>
</operator>
<operator activated="false" class="h2o:gradient_boosted_trees" compatibility="9.2.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="246" y="442">
<parameter key="number_of_trees" value="100"/>
<parameter key="reproducible" value="false"/>
<parameter key="maximum_number_of_threads" value="4"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="maximal_depth" value="10"/>
<parameter key="min_rows" value="10.0"/>
<parameter key="min_split_improvement" value="0.0"/>
<parameter key="number_of_bins" value="20"/>
<parameter key="learning_rate" value="0.01"/>
<parameter key="sample_rate" value="1.0"/>
<parameter key="distribution" value="AUTO"/>
<parameter key="early_stopping" value="false"/>
<parameter key="stopping_rounds" value="1"/>
<parameter key="stopping_metric" value="AUTO"/>
<parameter key="stopping_tolerance" value="0.001"/>
<parameter key="max_runtime_seconds" value="0"/>
<list key="expert_parameters"/>
</operator>
<operator activated="false" class="h2o:deep_learning" compatibility="9.2.000" expanded="true" height="82" name="Deep Learning" width="90" x="380" y="442">
<parameter key="activation" value="Rectifier"/>
<enumeration key="hidden_layer_sizes">
<parameter key="hidden_layer_sizes" value="50"/>
<parameter key="hidden_layer_sizes" value="50"/>
</enumeration>
<enumeration key="hidden_dropout_ratios"/>
<parameter key="reproducible_(uses_1_thread)" value="false"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="epochs" value="10.0"/>
<parameter key="compute_variable_importances" value="false"/>
<parameter key="train_samples_per_iteration" value="-2"/>
<parameter key="adaptive_rate" value="true"/>
<parameter key="epsilon" value="1.0E-8"/>
<parameter key="rho" value="0.99"/>
<parameter key="learning_rate" value="0.005"/>
<parameter key="learning_rate_annealing" value="1.0E-6"/>
<parameter key="learning_rate_decay" value="1.0"/>
<parameter key="momentum_start" value="0.0"/>
<parameter key="momentum_ramp" value="1000000.0"/>
<parameter key="momentum_stable" value="0.0"/>
<parameter key="nesterov_accelerated_gradient" value="true"/>
<parameter key="standardize" value="true"/>
<parameter key="L1" value="1.0E-5"/>
<parameter key="L2" value="0.0"/>
<parameter key="max_w2" value="10.0"/>
<parameter key="loss_function" value="Automatic"/>
<parameter key="distribution_function" value="AUTO"/>
<parameter key="early_stopping" value="false"/>
<parameter key="stopping_rounds" value="1"/>
<parameter key="stopping_metric" value="AUTO"/>
<parameter key="stopping_tolerance" value="0.001"/>
<parameter key="missing_values_handling" value="MeanImputation"/>
<parameter key="max_runtime_seconds" value="0"/>
<list key="expert_parameters"/>
<list key="expert_parameters_"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="34"/>
<operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="313" y="136">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="find_threshold" compatibility="9.2.001" expanded="true" height="82" name="Find Threshold" width="90" x="447" y="136">
<parameter key="define_labels" value="false"/>
<parameter key="misclassification_costs_first" value="25.0"/>
<parameter key="misclassification_costs_second" value="10.0"/>
<parameter key="show_roc_plot" value="false"/>
<parameter key="use_example_weights" value="true"/>
<parameter key="roc_bias" value="optimistic"/>
</operator>
<connect from_port="training set" to_op="Generalized Linear Model" to_port="training set"/>
<connect from_op="Generalized Linear Model" from_port="model" to_op="Multiply" to_port="input"/>
<connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Multiply" from_port="output 1" to_port="model"/>
<connect from_op="Multiply" from_port="output 2" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Find Threshold" to_port="example set"/>
<connect from_op="Find Threshold" from_port="threshold" to_port="through 1"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<portSpacing port="sink_through 2" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="apply_threshold" compatibility="9.2.001" expanded="true" height="82" name="Apply Threshold" width="90" x="179" y="34"/>
<operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply (2)" width="90" x="313" y="34"/>
<operator activated="true" class="performance_costs" compatibility="9.2.001" expanded="true" height="82" name="Performance" width="90" x="514" y="34">
<parameter key="keep_exampleSet" value="false"/>
<parameter key="cost_matrix" value="[0.0 10.0;25.0 0.0]"/>
<enumeration key="class_order_definition"/>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="9.2.001" expanded="true" height="82" name="Performance (2)" width="90" x="514" y="136">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<parameter key="kappa" value="false"/>
<parameter key="AUC (optimistic)" value="false"/>
<parameter key="AUC" value="true"/>
<parameter key="AUC (pessimistic)" value="false"/>
<parameter key="precision" value="true"/>
<parameter key="recall" value="true"/>
<parameter key="lift" value="false"/>
<parameter key="fallout" value="false"/>
<parameter key="f_measure" value="true"/>
<parameter key="false_positive" value="false"/>
<parameter key="false_negative" value="false"/>
<parameter key="true_positive" value="false"/>
<parameter key="true_negative" value="false"/>
<parameter key="sensitivity" value="true"/>
<parameter key="specificity" value="false"/>
<parameter key="youden" value="false"/>
<parameter key="positive_predictive_value" value="false"/>
<parameter key="negative_predictive_value" value="false"/>
<parameter key="psep" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_port="through 1" to_op="Apply Threshold" to_port="threshold"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Apply Threshold" to_port="example set"/>
<connect from_op="Apply Threshold" from_port="example set" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_op="Performance" to_port="example set"/>
<connect from_op="Multiply (2)" from_port="output 2" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance (2)" from_port="performance" to_port="performance 2"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="source_through 2" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<portSpacing port="sink_performance 3" spacing="0"/>
<description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.</description>
</process>
<description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
</operator>
<connect from_op="Retrieve 2. Data + new features" from_port="output" to_op="Validation" to_port="example set"/>
<connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
<connect from_op="Validation" from_port="performance 2" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
I'm not able to reproduce what you observe with the "Titanic" dataset.
Could you share your data specifying :
- what you observe.
- what you expect.
Thanks you,
Regards,
Lionel
w.r.t the toolbox one: Noted. I planned to add a version with a subprocess where you can deliver you custom performance measure. But - time as usual..
BR,
Martin
Dortmund, Germany
<context>
<input/>
<output/>
<macros>
<macro>
<key>cost_first</key>
<value>25</value>
</macro>
<macro>
<key>cost_second</key>
<value>10</value>
</macro>
</macros>
</context>
<operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve 2. Data + new features" width="90" x="112" y="34">
<parameter key="repository_entry" value="../data/2. Data + new features"/>
</operator>
<operator activated="true" class="h2o:generalized_linear_model" compatibility="9.2.000" expanded="true" height="124" name="Generalized Linear Model (2)" width="90" x="313" y="34">
<parameter key="family" value="AUTO"/>
<parameter key="link" value="family_default"/>
<parameter key="solver" value="AUTO"/>
<parameter key="reproducible" value="false"/>
<parameter key="maximum_number_of_threads" value="4"/>
<parameter key="use_regularization" value="true"/>
<parameter key="lambda_search" value="false"/>
<parameter key="number_of_lambdas" value="0"/>
<parameter key="lambda_min_ratio" value="0.0"/>
<parameter key="early_stopping" value="true"/>
<parameter key="stopping_rounds" value="3"/>
<parameter key="stopping_tolerance" value="0.001"/>
<parameter key="standardize" value="true"/>
<parameter key="non-negative_coefficients" value="false"/>
<parameter key="add_intercept" value="true"/>
<parameter key="compute_p-values" value="false"/>
<parameter key="remove_collinear_columns" value="false"/>
<parameter key="missing_values_handling" value="MeanImputation"/>
<parameter key="max_iterations" value="0"/>
<parameter key="specify_beta_constraints" value="false"/>
<list key="beta_constraints"/>
<parameter key="max_runtime_seconds" value="0"/>
<list key="expert_parameters"/>
</operator>
<operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve XX. Test Data + new features" width="90" x="715" y="85">
<parameter key="repository_entry" value="../data/XX. Test Data + new features"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply (3)" width="90" x="514" y="34"/>
<operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="849" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model (3)" width="90" x="514" y="187">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="find_threshold" compatibility="9.2.001" expanded="true" height="82" name="Find Threshold (2)" width="90" x="715" y="187">
<parameter key="define_labels" value="false"/>
<parameter key="misclassification_costs_first" value="%{cost_first}"/>
<parameter key="misclassification_costs_second" value="%{cost_second}"/>
<parameter key="show_roc_plot" value="false"/>
<parameter key="use_example_weights" value="true"/>
<parameter key="roc_bias" value="optimistic"/>
<description align="center" color="transparent" colored="false" width="126">define costs here</description>
</operator>
<operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="124" name="Multiply (4)" width="90" x="916" y="131"/>
<operator activated="true" class="apply_threshold" compatibility="9.2.001" expanded="true" height="82" name="Apply Threshold (3)" width="90" x="1050" y="187"/>
<operator activated="true" class="apply_threshold" compatibility="9.2.001" expanded="true" height="82" name="Apply Threshold (2)" width="90" x="1050" y="34"/>
<operator activated="true" class="sort" compatibility="9.2.001" expanded="true" height="82" name="Sort" width="90" x="1184" y="34">
<parameter key="attribute_name" value="prediction(fraud)"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply (5)" width="90" x="1184" y="187"/>
<operator activated="true" class="performance_binominal_classification" compatibility="9.2.001" expanded="true" height="82" name="Performance (3)" width="90" x="1318" y="187">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="AUC (optimistic)" value="false"/>
<parameter key="AUC" value="false"/>
<parameter key="AUC (pessimistic)" value="false"/>
<parameter key="precision" value="false"/>
<parameter key="recall" value="false"/>
<parameter key="lift" value="false"/>
<parameter key="fallout" value="false"/>
<parameter key="f_measure" value="false"/>
<parameter key="false_positive" value="false"/>
<parameter key="false_negative" value="false"/>
<parameter key="true_positive" value="false"/>
<parameter key="true_negative" value="false"/>
<parameter key="sensitivity" value="false"/>
<parameter key="specificity" value="false"/>
<parameter key="youden" value="false"/>
<parameter key="positive_predictive_value" value="false"/>
<parameter key="negative_predictive_value" value="false"/>
<parameter key="psep" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
</operator>
<operator activated="true" class="performance_costs" compatibility="9.2.001" expanded="true" height="82" name="Performance (4)" width="90" x="1318" y="340">
<parameter key="keep_exampleSet" value="false"/>
<parameter key="cost_matrix" value="[0.0 10.0;25.0 0.0]"/>
<enumeration key="class_order_definition"/>
<description align="center" color="transparent" colored="false" width="126">evaluate here with original costs</description>
</operator>
<connect from_op="Retrieve 2. Data + new features" from_port="output" to_op="Generalized Linear Model (2)" to_port="training set"/>
<connect from_op="Generalized Linear Model (2)" from_port="model" to_op="Multiply (3)" to_port="input"/>
<connect from_op="Generalized Linear Model (2)" from_port="exampleSet" to_op="Apply Model (3)" to_port="unlabelled data"/>
<connect from_op="Retrieve XX. Test Data + new features" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Multiply (3)" from_port="output 1" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Multiply (3)" from_port="output 2" to_op="Apply Model (3)" to_port="model"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Apply Threshold (2)" to_port="example set"/>
<connect from_op="Apply Model (3)" from_port="labelled data" to_op="Find Threshold (2)" to_port="example set"/>
<connect from_op="Find Threshold (2)" from_port="example set" to_op="Apply Threshold (3)" to_port="example set"/>
<connect from_op="Find Threshold (2)" from_port="threshold" to_op="Multiply (4)" to_port="input"/>
<connect from_op="Multiply (4)" from_port="output 1" to_op="Apply Threshold (2)" to_port="threshold"/>
<connect from_op="Multiply (4)" from_port="output 2" to_port="result 2"/>
<connect from_op="Multiply (4)" from_port="output 3" to_op="Apply Threshold (3)" to_port="threshold"/>
<connect from_op="Apply Threshold (3)" from_port="example set" to_op="Multiply (5)" to_port="input"/>
<connect from_op="Apply Threshold (2)" from_port="example set" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_port="result 1"/>
<connect from_op="Multiply (5)" from_port="output 1" to_op="Performance (3)" to_port="labelled data"/>
<connect from_op="Multiply (5)" from_port="output 2" to_op="Performance (4)" to_port="example set"/>
<connect from_op="Performance (3)" from_port="performance" to_port="result 3"/>
<connect from_op="Performance (4)" from_port="performance" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<description align="center" color="yellow" colored="false" height="108" resized="false" width="180" x="692" y="400">Try costs for miss classification first: 25 (original) and 250<br/><br/>Defined in macro cost_first</description>
</process>
</operator>
</process>
This process doesn't use cross-validation, but the cross-validated result is the same (in this case the unexpected behaviour could be caused by applying a model on unseen data, therefore I am testing on the training set to catch the bug).
The problem is simple, I have missclassifications cost of 25 (no fraud) and 10 (fraud). It is actually more expensive to missclassify a loyal customer than a fraud customer. I define these costs in the operator Find Threshold and then evaluate the results with Performance (Costs).
The problem is that I get better results when I use cost1 = 250 in Find Threshold instead of cost1 = 25. If you can explain me why is it so, I would really appreciate it!
Kind regards,
Sebastian