Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Performance Measures for Imbalanced Data

ozgeozyazarozgeozyazar Member Posts: 21 Maven
edited June 2019 in Help

Hi All !

My question is not directly regards to program but I know that in this community many valuable data miners exists and believe that I might reach the correct answer easily. I am doing decision tree classification and measuring both classification and binomial performance with using different paramater combinations. I need to select one of the good performed model to create decision tree for disease risk factors detection. I have read the article that says " Any performance metric that uses values from both columbs will be inherently sensitive to class skews". This meant to me that if I have imbalanced data I should not use those metrics. Could you please confirm my understanding?


Tagged:

Answers

  • varunm1varunm1 Member Posts: 1,207 Unicorn
    edited May 2019
    Hello @ozgeozyazar

    Actually, it is not like you shouldn't use but these measures vary if there is a class imbalance and can be misleading, for example, accuracy.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
    @varunm1i would highly recommend to have a look at AUPRC.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • varunm1varunm1 Member Posts: 1,207 Unicorn
    edited May 2019
    Thanks, @mschmitz I am not aware that we have AUPRC. Generally, I take trade-off between AUC and kappa but this is good to know.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @ozgeozyazar,

    To complete @mschmitz post, here a Kaggle article which advices to favor AUPRC (Area Under Precision Recall Curve) as the performance metrics of a model when the dataset is very imbalanced : 

    https://www.kaggle.com/lct14558/imbalanced-data-why-you-should-not-use-roc-curve

    If you want to use the AUPRC (performance) operator in your process in RapidMiner, you have to install the free Operator Toolbox extension.

    Regards,

    Lionel
  • varunm1varunm1 Member Posts: 1,207 Unicorn
    Thanks, @lionelderkrikor for sharing this.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist

    now that i am on my working pc: Have a look at this paper: https://www.biostat.wisc.edu/~page/rocpr.pdf . I discovered it while working with sven. It proofs that a Curve which dominates in AUPRC also dominates in AUC, but not the other way around. Besides the usual problems i talk about with correlation to business value, i would thus prefer AUPRC, if i know the class balance.

    Best,
    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Interesting readings.  Just remember that which metric is "better" (AUC vs AUPRC) is very much a function of business needs since they are optimizing different things.  As @mschmitz has noted in the past, if you can actually assign a cost to your different classification outcomes (TP, TN, FP, FN) then the best approach is to use the Performance Costs operator and optimize directly for that.  These other curves are simply approaches based on other useful metrics. The other thing to be aware of is that AUPRC is probably less well known, so you might have some difficulties in explaining it even to other data scientists, never mind business users. 
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.