Performance Measures for Imbalanced Data

ozgeozyazarozgeozyazar Member Posts: 21  Maven
edited June 12 in Help

Hi All !

My question is not directly regards to program but I know that in this community many valuable data miners exists and believe that I might reach the correct answer easily. I am doing decision tree classification and measuring both classification and binomial performance with using different paramater combinations. I need to select one of the good performed model to create decision tree for disease risk factors detection. I have read the article that says " Any performance metric that uses values from both columbs will be inherently sensitive to class skews". This meant to me that if I have imbalanced data I should not use those metrics. Could you please confirm my understanding?


Tagged:
sgenzer

Answers

  • varunm1varunm1 Member Posts: 506   Unicorn
    edited May 23
    Hello @ozgeozyazar

    Actually, it is not like you shouldn't use but these measures vary if there is a class imbalance and can be misleading, for example, accuracy.
    Regards,
    Varun
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,049  RM Data Scientist
    @varunm1i would highly recommend to have a look at AUPRC.
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    lionelderkrikorvarunm1
  • varunm1varunm1 Member Posts: 506   Unicorn
    edited May 23
    Thanks, @mschmitz I am not aware that we have AUPRC. Generally, I take trade-off between AUC and kappa but this is good to know.
    Regards,
    Varun
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 725   Unicorn
    Hi @ozgeozyazar,

    To complete @mschmitz post, here a Kaggle article which advices to favor AUPRC (Area Under Precision Recall Curve) as the performance metrics of a model when the dataset is very imbalanced : 

    https://www.kaggle.com/lct14558/imbalanced-data-why-you-should-not-use-roc-curve

    If you want to use the AUPRC (performance) operator in your process in RapidMiner, you have to install the free Operator Toolbox extension.

    Regards,

    Lionel
    varunm1
  • varunm1varunm1 Member Posts: 506   Unicorn
    Thanks, @lionelderkrikor for sharing this.
    Regards,
    Varun
    lionelderkrikor
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,049  RM Data Scientist

    now that i am on my working pc: Have a look at this paper: https://www.biostat.wisc.edu/~page/rocpr.pdf . I discovered it while working with sven. It proofs that a Curve which dominates in AUPRC also dominates in AUC, but not the other way around. Besides the usual problems i talk about with correlation to business value, i would thus prefer AUPRC, if i know the class balance.

    Best,
    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    varunm1lionelderkrikorDocMusher
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,150   Unicorn
    Interesting readings.  Just remember that which metric is "better" (AUC vs AUPRC) is very much a function of business needs since they are optimizing different things.  As @mschmitz has noted in the past, if you can actually assign a cost to your different classification outcomes (TP, TN, FP, FN) then the best approach is to use the Performance Costs operator and optimize directly for that.  These other curves are simply approaches based on other useful metrics. The other thing to be aware of is that AUPRC is probably less well known, so you might have some difficulties in explaining it even to other data scientists, never mind business users. 
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    lionelderkrikorvarunm1Andy2
Sign In or Register to comment.