Logistic regression threshold

bernardo_pagnonbernardo_pagnon Member, University Professor Posts: 60 University Professor
edited November 2018 in Help

Hello all,

I am doing a simple logistic regression exercise (no SVM, simple and pure logistic regression) and I cannot understand how rapidminer defines the threshold for classifying instances as "yes". In similar posts it was mentioned that it chooses automatically 0.5, but that is not the case. I downloaded all the "yes" predictions and sorted them in ascending order: the threshold is 0.3108. Why?

 

I am using the "default" instance from the ISLR library (https://cran.r-project.org/web/packages/ISLR/index.html).

 

Thanks in advance,

Bernardo

 

 

Best Answer

  • phellingerphellinger Employee, Member Posts: 103 RM Engineering
    Solution Accepted

    Hi Bernardo,

     

    Logistic Regression also uses 0.5 as threshold value starting from version 7.6, see https://docs.rapidminer.com/7.6/studio/releases/7.6/changes-7.6.0.html ("Logistic Regression and Generalized Linear Model learners now use 0.5 as the threshold as other binominal learners").

    The old behaviour is kept for backward compatibility reason. You can easily alter the operator's behaviour by increasing its compatibility level. (For whatever reason, it is set to 7.5.000 in your process.)

     

    logreg_threshold.png

     

    The reason for the old behaviour was that one can optimize for maximal F-measure by choosing a different threshold, but this is can be confusing. That's why this alternative threshold is only provided on a "threshold" output port now, and 0.5 is used otherwise.

     

    Best,

    Peter

     

Answers

  • bernardo_pagnonbernardo_pagnon Member, University Professor Posts: 60 University Professor

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve DefaultFull" width="90" x="112" y="187">
    <parameter key="repository_entry" value="//Local Repository/data/DefaultFull"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
    <parameter key="attribute_name" value="default"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="h2o:logistic_regression" compatibility="7.5.000" expanded="true" height="124" name="Logistic Regression" width="90" x="380" y="34"/>
    <operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model" width="90" x="581" y="34">
    <list key="application_parameters"/>
    </operator>
    <connect from_op="Retrieve DefaultFull" from_port="output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
    <connect from_op="Logistic Regression" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Logistic Regression" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
    <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @bernardo_pagnon

     

    The way you have built a process is wrong. You train a model on the whole data and then use already labelled data with the trained model once again, this would generate some unexpected output: 

     

    Screenshot 2018-06-18 21.02.47.png

     

    In the simplest case you should split data before training a model, so a model is trained on, say, 80% of the data and then is applied on other 20% of examples: 

     

    Screenshot 2018-06-18 21.06.50.png

  • bernardo_pagnonbernardo_pagnon Member, University Professor Posts: 60 University Professor

    Dear Peter,

     

    you nailed it! I would never have figure it our for myself. I updated the compatibility for 8.2 and now 0.5 is the default threshold! Thank you so much!

     

    Best,

    Bernardo

  • bernardo_pagnonbernardo_pagnon Member, University Professor Posts: 60 University Professor

    Dear Vladmir,

     

    thank you for your reply. I agree that testing the model with the training data is not a good practice, but it is not wrong. By splitting the data I obtained the same error, so that was not the cause. 

     

    Best,

    Bernardo

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @bernardo_pagnon

     

    Okay my guess about the regression thresholds was really not correct, but glad @phellinger has provided this nice solution :) 

     

    Though I still should warn you about applying the model on a training set, it is totally possible to do this technically, but it still does not make sense as if you measure the performance then you'll get a perfect overfit model at the end, for example:

     

    Screenshot 2018-06-19 14.06.52.png

     

    Screenshot 2018-06-19 14.06.32.png

Sign In or Register to comment.