"Logistic Regression predicts too high probabilities"

confuzio · February 2011

Currently I am predicting creditor-defaults, using (besides others) Kernel Logistic Regression. Here's my problem: KLR predicts probabilities ("confidences") which are quite a bit higher than those given by a quite accurate Generalized Additive Model; leading to a much worse average performance (% deviance explained).

Comparing the Logits of the predicted probabilities it could be that there is just a constant missing, but I'm not sure if that is where the problem comes from.

Below my simple (not) working example. I really appreciate your help!


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
    <process expanded="true" height="415" width="882">
      <operator activated="true" class="read_csv" compatibility="5.1.001" expanded="true" height="60" name="Read CSV" width="90" x="45" y="120">
        <parameter key="csv_file" value="D:\DiplomModellierung\workspace\VSP35.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value=".false.integer.attribute"/>
          <parameter key="1" value="default.true.binominal.label"/>
          <parameter key="2" value="score.true.real.attribute"/>
          <parameter key="3" value="d\.score.true.real.attribute"/>
          <parameter key="4" value="d2\.score.true.real.attribute"/>
          <parameter key="5" value="NoND.true.real.attribute"/>
          <parameter key="6" value="WVP.true.real.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="logistic_regression" compatibility="5.1.001" expanded="true" height="94" name="Logistic Regression" width="90" x="246" y="120">
        <parameter key="kernel_gamma" value="0.52"/>
      </operator>
      <operator activated="true" class="read_csv" compatibility="5.1.001" expanded="true" height="60" name="Read CSV (2)" width="90" x="246" y="255">
        <parameter key="csv_file" value="D:\DiplomModellierung\workspace\TSP35.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value=".false.integer.attribute"/>
          <parameter key="1" value="default.true.binominal.label"/>
          <parameter key="2" value="score.true.real.attribute"/>
          <parameter key="3" value="d\.score.true.real.attribute"/>
          <parameter key="4" value="d2\.score.true.real.attribute"/>
          <parameter key="5" value="NoND.true.real.attribute"/>
          <parameter key="6" value="WVP.true.real.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.1.001" expanded="true" height="76" name="Apply Model" width="90" x="380" y="165">
        <list key="application_parameters"/>
        <parameter key="create_view" value="true"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="5.1.001" expanded="true" height="76" name="Performance" width="90" x="514" y="165">
        <parameter key="accuracy" value="false"/>
        <parameter key="cross-entropy" value="true"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="write_csv" compatibility="5.1.001" expanded="true" height="60" name="Write CSV" width="90" x="648" y="210">
        <parameter key="csv_file" value="C:\Users\Richard\Desktop\test.csv"/>
      </operator>
      <operator activated="true" class="log" compatibility="5.1.001" expanded="true" height="76" name="Log" width="90" x="648" y="75">
        <parameter key="filename" value="C:\Users\Richard\Desktop\loganwendung40.log"/>
        <list key="log">
          <parameter key="crossentropy" value="operator.Performance.value.cross-entropy"/>
        </list>
        <parameter key="persistent" value="true"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Logistic Regression" to_port="training set"/>
      <connect from_op="Logistic Regression" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Read CSV (2)" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
      <connect from_op="Performance" from_port="example set" to_op="Write CSV" to_port="input"/>
      <connect from_op="Log" from_port="through 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

land · February 2011

Hi,
first of all: If you want that we can draw conclusions from your process you will have to generate a process that does not depend on .csv files lying on your local hard disk.

If a Kernel Logistic Model performs worse than another modeling technique: This is not necessary a bug. Might be it just doesn't work on your data?
If you have reasons to believe otherwise, please explain them more detailed. You can be asured that we will listen carefully.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Logistic Regression predicts too high probabilities"

Answers