image

🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉

Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.

CLICK HERE TO DOWNLOAD

Model selection for imbalanced training dataset

phivuphivu Member Posts: 34  Guru
edited November 2018 in Help

Hi RapidMiner,

 

I'm doing model selection for SVM using the "Optimize Parameters (Grid)" operator, my training dataset is imbalanced/skewed (782 positive examples and 2048 negative examples), so we cannot use Accuracy (= (TP+TN)/(TP+TN+FP+FN)) as a score for model selection (because if the predictor predicts everything as negative, the accuracy will easily reach 2048/(2048+782)= 72.3%). So may I ask if there is a way to choose Precision and Recall, or a combined function of them like F1 score instead of Accuracy? I did look into the parameter list of Performance operator but could not see those scores. Or is there other way to deal with imbalanced dataset like this?

 

I attach my process file here. In this process, I use "Optimize Parameters (Grid)" operator to find the SVM's hyper-parameters that give the best cross-validation performance. This process works very well on a balanced training dataset, now I wonder how to modify it for an imbalanced one. Thank you very much for your help!

 

 

 

Best Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751  RM Founder
    Solution Accepted

    Hi,

     

    Sure - all those measurements (precision, recall, F1 and many more) are available as parameters of the operator "Performance (Binominal Classification)".

     

    Hope this helps,

    Ingo

    Telcontar120
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,625   Unicorn
    Solution Accepted

    Another option is to add weights to balance the classes, since the SVM operator accepts weights.  But in either case you may want to look at AUC as a performance metric as well, it's my preferred one for classification problems since it does not depend on a single arbitrary cutoff threshold.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    phivu

Answers

  • phivuphivu Member Posts: 34  Guru

    Thank you Ingo,

    I've already seen the scores in the "Performance (Binominal Classification)" operator!

Sign In or Register to comment.