Options

AUPRC with imbalanced classes

kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn
edited December 2018 in Help

Hi, it seems I am not getting expected results when using Performance (AUPRC) with highly imbalanced dataset.

 

The relationship between recall and precision of positive class seems pretty intuitive, but I still get AUPRC = 0.010 regardless of anything: 

 

Screenshot 2018-04-25 23.28.32.pngScreenshot 2018-04-25 23.28.14.png

I am using here imbalanced credit card fraud dataset.

 

At the same time when I artificially balance data, AUPRC shows expected 'normal' values:

 

Screenshot 2018-04-25 23.35.06.pngScreenshot 2018-04-25 23.34.59.png

Process attached:

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.003" expanded="true" height="68" name="Retrieve creditcard" width="90" x="45" y="34">
<parameter key="repository_entry" value="../data/creditcard"/>
</operator>
<operator activated="true" class="sample" compatibility="8.1.003" expanded="true" height="82" name="equalize classes" width="90" x="179" y="34">
<parameter key="balance_data" value="true"/>
<list key="sample_size_per_class">
<parameter key="1" value="492"/>
<parameter key="0" value="492"/>
</list>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
</operator>
<operator activated="false" class="sample_stratified" compatibility="8.1.003" expanded="true" height="82" name="sample 50k" width="90" x="45" y="340">
<parameter key="sample_size" value="50000"/>
</operator>
<operator activated="false" class="create_threshold" compatibility="8.1.003" expanded="true" height="68" name="Create Threshold" width="90" x="581" y="391">
<parameter key="threshold" value="0.09"/>
<parameter key="first_class" value="0"/>
<parameter key="second_class" value="1"/>
</operator>
<operator activated="true" class="split_data" compatibility="8.1.003" expanded="true" height="103" name="Split Data" width="90" x="246" y="136">
<enumeration key="partitions">
<parameter key="ratio" value="0.8"/>
<parameter key="ratio" value="0.2"/>
</enumeration>
<parameter key="sampling_type" value="stratified sampling"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="8.1.003" expanded="true" height="145" name="Validation" width="90" x="380" y="34">
<parameter key="sampling_type" value="shuffled sampling"/>
<process expanded="true">
<operator activated="false" class="concurrency:parallel_decision_tree" compatibility="8.1.003" expanded="true" height="103" name="Decision Tree" width="90" x="112" y="136">
<parameter key="apply_pruning" value="false"/>
<parameter key="apply_prepruning" value="false"/>
</operator>
<operator activated="true" class="h2o:generalized_linear_model" compatibility="7.2.000" expanded="true" height="124" name="Generalized Linear Model" width="90" x="246" y="34">
<list key="beta_constraints"/>
<list key="expert_parameters"/>
</operator>
<operator activated="false" class="h2o:deep_learning" compatibility="7.6.001" expanded="true" height="82" name="Deep Learning" width="90" x="380" y="136">
<enumeration key="hidden_layer_sizes">
<parameter key="hidden_layer_sizes" value="50"/>
<parameter key="hidden_layer_sizes" value="50"/>
</enumeration>
<enumeration key="hidden_dropout_ratios"/>
<list key="expert_parameters"/>
<list key="expert_parameters_"/>
</operator>
<operator activated="false" class="stacking" compatibility="8.1.003" expanded="true" height="68" name="Stacking" width="90" x="179" y="289">
<process expanded="true">
<operator activated="true" class="h2o:generalized_linear_model" compatibility="7.2.000" expanded="true" height="124" name="Generalized Linear Model (2)" width="90" x="179" y="187">
<list key="beta_constraints"/>
<list key="expert_parameters"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="8.1.003" expanded="true" height="103" name="Decision Tree (2)" width="90" x="112" y="34">
<parameter key="apply_pruning" value="false"/>
<parameter key="apply_prepruning" value="false"/>
</operator>
<operator activated="true" class="h2o:deep_learning" compatibility="7.6.001" expanded="true" height="82" name="Deep Learning (2)" width="90" x="112" y="340">
<enumeration key="hidden_layer_sizes">
<parameter key="hidden_layer_sizes" value="20"/>
<parameter key="hidden_layer_sizes" value="20"/>
</enumeration>
<enumeration key="hidden_dropout_ratios"/>
<list key="expert_parameters"/>
<list key="expert_parameters_"/>
</operator>
<connect from_port="training set 1" to_op="Decision Tree (2)" to_port="training set"/>
<connect from_port="training set 2" to_op="Generalized Linear Model (2)" to_port="training set"/>
<connect from_port="training set 3" to_op="Deep Learning (2)" to_port="training set"/>
<connect from_op="Generalized Linear Model (2)" from_port="model" to_port="base model 2"/>
<connect from_op="Decision Tree (2)" from_port="model" to_port="base model 1"/>
<connect from_op="Deep Learning (2)" from_port="model" to_port="base model 3"/>
<portSpacing port="source_training set 1" spacing="0"/>
<portSpacing port="source_training set 2" spacing="0"/>
<portSpacing port="source_training set 3" spacing="0"/>
<portSpacing port="source_training set 4" spacing="0"/>
<portSpacing port="sink_base model 1" spacing="0"/>
<portSpacing port="sink_base model 2" spacing="0"/>
<portSpacing port="sink_base model 3" spacing="0"/>
<portSpacing port="sink_base model 4" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="h2o:generalized_linear_model" compatibility="7.6.001" expanded="true" height="124" name="Generalized Linear Model (3)" width="90" x="45" y="34">
<list key="beta_constraints"/>
<list key="expert_parameters"/>
</operator>
<connect from_port="stacking examples" to_op="Generalized Linear Model (3)" to_port="training set"/>
<connect from_op="Generalized Linear Model (3)" from_port="model" to_port="stacking model"/>
<portSpacing port="source_stacking examples" spacing="0"/>
<portSpacing port="sink_stacking model" spacing="0"/>
</process>
</operator>
<connect from_port="training set" to_op="Generalized Linear Model" to_port="training set"/>
<connect from_op="Generalized Linear Model" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="8.1.003" expanded="true" height="82" name="apply on train" width="90" x="45" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="operator_toolbox:performance_auprc" compatibility="1.0.000" expanded="true" height="82" name="perf train" width="90" x="246" y="34">
<parameter key="main_criterion" value="AUPRC"/>
<parameter key="AUC" value="true"/>
<parameter key="AUPRC" value="true"/>
</operator>
<connect from_port="model" to_op="apply on train" to_port="model"/>
<connect from_port="test set" to_op="apply on train" to_port="unlabelled data"/>
<connect from_op="apply on train" from_port="labelled data" to_op="perf train" to_port="labelled data"/>
<connect from_op="perf train" from_port="performance" to_port="performance 1"/>
<connect from_op="perf train" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="8.1.003" expanded="true" height="82" name="apply on test" width="90" x="581" y="136">
<list key="application_parameters"/>
</operator>
<operator activated="false" class="select_recall" compatibility="8.1.003" expanded="true" height="82" name="Select Recall" width="90" x="581" y="289">
<parameter key="min_recall" value="0.8"/>
<parameter key="positive_label" value="1"/>
</operator>
<operator activated="false" class="apply_threshold" compatibility="8.1.003" expanded="true" height="82" name="Apply Threshold" width="90" x="715" y="289"/>
<operator activated="true" class="performance" compatibility="8.1.003" expanded="true" height="82" name="perf test" width="90" x="715" y="136"/>
<operator activated="true" class="operator_toolbox:performance_auprc" compatibility="1.0.000" expanded="true" height="82" name="perf test (2)" width="90" x="849" y="136">
<parameter key="main_criterion" value="AUPRC"/>
<parameter key="accuracy" value="false"/>
<parameter key="AUPRC" value="true"/>
</operator>
<connect from_op="Retrieve creditcard" from_port="output" to_op="equalize classes" to_port="example set input"/>
<connect from_op="equalize classes" from_port="example set output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Validation" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 2" to_op="apply on test" to_port="unlabelled data"/>
<connect from_op="Validation" from_port="model" to_op="apply on test" to_port="model"/>
<connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
<connect from_op="apply on test" from_port="labelled data" to_op="perf test" to_port="labelled data"/>
<connect from_op="Select Recall" from_port="example set" to_op="Apply Threshold" to_port="example set"/>
<connect from_op="Select Recall" from_port="threshold" to_op="Apply Threshold" to_port="threshold"/>
<connect from_op="perf test" from_port="performance" to_op="perf test (2)" to_port="performance"/>
<connect from_op="perf test" from_port="example set" to_op="perf test (2)" to_port="labelled data"/>
<connect from_op="perf test (2)" from_port="performance" to_port="result 2"/>
<connect from_op="perf test (2)" from_port="example set" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>

 

 

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi @kypexin,

    isn't that exactly what you would expect? AUPRC is NOT independend of class  balance. If you add more and more of one class, then the precision will go down for the other class. Thus the curve becomes flatter and the integral less. 0.5 is thus not the lower threshold anymore.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @mschmitz

     

    Honestly, no, I have expected it exactly the other way around.

    If we assume that the curve shows precision against the recall of the same positive class (in our case '1'), then varying recall of positive class gives the following:

     

    Low recall, high precision (6/100)

     

    Screenshot 2018-04-26 09.38.56.png

     

    High recall, low precision (93/6)

     

    Screenshot 2018-04-26 09.39.54.png

     

    Around optimum (80/80)

     

    Screenshot 2018-04-26 09.41.25.png

     

    Or do I interpret AUPRC completely wrong? :) (never used it before in practice)

     

     

     

     

  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    PS @mschmitz to give you more intuition, this is a PR curve I am getting on my data (it least what I understand to be that curve)

     

    Screenshot 2018-04-26 10.35.55.png

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi @kypexin,

    what happens if you switch class balance? it should go down, right?

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Not sure if I got you right, @mschmitz

    If I just remap classes, I will get AUPRC = 0.999 and also this (obviously for majority class it will be really close to 1):

     

    Screenshot 2018-04-26 10.47.59.png

    Screenshot 2018-04-26 10.49.17.png

    However this still does not give me an intuition why in thge 1st case AUPRC = 0.010 while it should be not to my logfical expectation.

     

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hey @kypexin,

    Here is how i see this. If you have a different class balance you transform the space. Essentially Recall for your positive class stays the same, but the precision for a given recall point changes. This may look like this:

    2018-04-26 11.14.35.jpgUpper: Normal PR-Curve, Lower with a different Class Ratio

    If you have a look at the math, you can see Precision as a function of recall like this:

     

    2018-04-26 11.09.38.jpg

     

    adding more Negative falues will lead to more FN (false negatives) and thus less precision. So naturally AURPC drops with changing class balance (if the classifer does not counter this.)

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hey @mschmitz

     

    I totally agree with the point that "adding more Negative falues will lead to more FN (false negatives) and thus less precision, so naturally AURPC drops with changing class balance". But at the same time, I observe influence of class imbalance on AUPRC is realy lower then we would expect. 

     

    I made tests on different imbalance ratio datasets, with 1:1, 1:10, 1:100 and 1:500 class ratios. Below are the PR curves for that cases. As we see, while imbalance increases, AUPRC drops, but not really much

     

    Screenshot 2018-04-26 15.01.17.pngclass ratio 1:1 Screenshot 2018-04-26 15.01.50.pngclass ratio 1:10

    Screenshot 2018-04-26 15.02.24.pngclass ratio 1:100 Screenshot 2018-04-26 15.03.12.pngclass ratio 1:500

     

    So the question is, why the operator itself provides AUPRC values non-relevant to these plots, unless of course I am committing some serious mistake. 

     

    I attach my process which is used for estimating these curves, plus my test labelled dataset as well from which different ratios can be sampled. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.1.003" expanded="true" height="68" name="Retrieve scored data" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Local Repository/kaggle - fraud/data/scored data 500 - 250000"/>
    </operator>
    <operator activated="true" class="concurrency:loop_parameters" compatibility="8.1.003" expanded="true" height="103" name="Loop Parameters" width="90" x="313" y="34">
    <list key="parameters">
    <parameter key="Select Recall.min_recall" value="[0.0;1.0;100;linear]"/>
    </list>
    <parameter key="log_all_criteria" value="true"/>
    <process expanded="true">
    <operator activated="true" class="select_recall" compatibility="8.1.003" expanded="true" height="82" name="Select Recall" width="90" x="45" y="34">
    <parameter key="min_recall" value="0.8"/>
    <parameter key="positive_label" value="1"/>
    </operator>
    <operator activated="true" class="apply_threshold" compatibility="8.1.003" expanded="true" height="82" name="Apply Threshold" width="90" x="179" y="34"/>
    <operator activated="true" class="performance" compatibility="8.1.003" expanded="true" height="82" name="perf test" width="90" x="313" y="34"/>
    <operator activated="true" class="operator_toolbox:performance_auprc" compatibility="1.0.000" expanded="true" height="82" name="perf test (2)" width="90" x="447" y="34">
    <parameter key="main_criterion" value="AUPRC"/>
    <parameter key="accuracy" value="false"/>
    <parameter key="AUPRC" value="true"/>
    </operator>
    <operator activated="true" class="performance_to_data" compatibility="8.1.003" expanded="true" height="82" name="Performance to Data" width="90" x="581" y="34"/>
    <connect from_port="input 1" to_op="Select Recall" to_port="example set"/>
    <connect from_op="Select Recall" from_port="example set" to_op="Apply Threshold" to_port="example set"/>
    <connect from_op="Select Recall" from_port="threshold" to_op="Apply Threshold" to_port="threshold"/>
    <connect from_op="Apply Threshold" from_port="example set" to_op="perf test" to_port="labelled data"/>
    <connect from_op="perf test" from_port="performance" to_op="perf test (2)" to_port="performance"/>
    <connect from_op="perf test" from_port="example set" to_op="perf test (2)" to_port="labelled data"/>
    <connect from_op="perf test (2)" from_port="performance" to_op="Performance to Data" to_port="performance vector"/>
    <connect from_op="Performance to Data" from_port="example set" to_port="output 1"/>
    <connect from_op="Performance to Data" from_port="performance vector" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <portSpacing port="sink_output 3" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve scored data" from_port="output" to_op="Loop Parameters" to_port="input 1"/>
    <connect from_op="Loop Parameters" from_port="output 1" to_port="result 1"/>
    <connect from_op="Loop Parameters" from_port="output 2" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

     

  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hey @mschmitz - could you please elaborate regarding my latest plots / messages in this thread? 

    This issue seems still not clear to me. 

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    @kypexin,

    ive done some tests. Attached is my project on your data. For me the AUPRC drops heavily, as expected.

     

    100	0.6869897886710802
    200 0.540999353555299
    300 0.453043673642775
    400 0.39372295554318193
    500 0.3493142261965152

    Where the left coloum is the number of negative examples and the right one is the AUPRC. There is also a way to visualize the AURPC exactly like the operator does it.I think one good question is: How to handle missings in the integral. since i copied most of the code from AUC the handling is the same.

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    @mschmitz -- please look.

     

    If we take each sample size separately (I did it for the value of 100 for example) and then visualize precision against recall, we can get two meaningful (to my understanding) charts:

     

    Screenshot 2018-06-18 14.03.59.pngPrecision vs. Recall, as series

    Here we see that while recall goes from 0 to 1, all the way precision slowly goes downwards, from 1 to 0.5. Correct?

     

    Screenshot 2018-06-18 14.04.36.png

    In a scatter plot, we basically see the same, just from a different perspective. 

     

    Now, my question is -- can you please point out what part in this plot exactly counts as an area under curve? If we connect all the points together, we, basically, will get a precision-recall curve, right? So what is the area under it? 

     

    PS same plots for sample size = 500

     

    Screenshot 2018-06-18 14.14.28.pngScreenshot 2018-06-18 14.14.15.png

     

    Sorry, my brain has started to exhaust smokes already :))

     

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hey @kypexin,

    to be honest i've only adapted our AUC performance measure and copied all of the code :) I've only changed from TPR/FPR to precision/recall. So the Java code for AUC is fairly similar.

     

    #1 Generate these points

    These are the same as for AUC. That's why we can use Extract ROC Curve.

     

    #2

    For each point in rocData:

    double fpDivN = point.getFalsePositives() / rocData.getTotalNegatives(); 
    double tpDivP = point.getTruePositives() / (point.getTruePositives() + point.getFalsePositives());
    if (Double.isNaN(tpDivP)) {
    tpDivP = 0;
    }

    This is Recall and Precision.  Then we do the "summation"

    double width = fpDivN - last[0];
    double leftHeight = last[1];
    double rightHeight = tpDivP;
    Double aux = leftHeight * width + (rightHeight - leftHeight) * width / 2;
    if (!aux.isNaN()) {
    aucSum += aux;
    }

     and store the last value:

    last = new double[] { fpDivN, tpDivP };

     That makes a lot of sense for me..?

     

    Cheers,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Well @mschmitz in case of ROC curve it is clear what is the area under it; looking at the visualizations I made for PRC, it is not really clear, because I cannot literally see where and why for sample size 500 AUPRC = 0.35 and this is the problem here :) Curve with area under it lower than 0.5 would be hanging lower than the diagonal line, isn't it?? 

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi,

    there was at least one bug.. For some crazy reason the Recall calculation was for the negative class, while the precision was for the positive class. It's fixed now.

     

    Do you know a good way to check if it's working as expected?

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Did you updated the operator itself? I could test it as soon as it is available.

    But still, another really important thing to consider in a future is a curve visualization. Because, as we saw, the number itself often does not give much intuition.

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Operator is updated and will be released in the next release of toolbox. I've taken the class' recall..

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Thanks Martin! :) truly appreciate your help.

  • Options
    RNarayanRNarayan Member Posts: 4 Contributor I
    Hello @mschmitz

    Can you please advise where I can get hold of the said operator with PRC curve visualisation?

    Thanks
    Narayan
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You need to install the Operator Toolbox (free extension) and there is an operator in that called Performance (AUPRC)
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.