The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

Best way to handle imbalanced data

GeezerDocGeezerDoc Member Posts: 5 Contributor I
I would very much appreciate some guidance. Because RMS has an unusual validation method plus data sampling the test set is small even with hundreds of cases when the class of interest is not balanced. I have read all of the posts about balancing data in the Community Forum and looked at 3 videos on the subject. The current data set I am interested in has 91 patients with dementia and 242 controls. When I uploaded the dataset to WEKA and added SMOTE the AUC increased from .725 to .898, a substantial improvement. I used WEKA simply because SMOTE was an easy filter to add. Several RM forum postings suggest that upsampling techniques will not affect the performance which was not my experience. 

I'm stuck with smaller medical datasets to demonstrate/teach binary classification to clinical students. This results in very small numbers in the confusion matrix. What are your recommendations: sample, bootstrapping, SMOTE, etc? Truthfully, I have spent most of my time with TurboPrep and AutoModel so I was unable to figure out how to add SMOTE to the process pipeline. I would appreciate your thoughts.

Best Answers

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,525 RM Data Scientist
    I personally would also opt for some smote based analysis, even though you need to be a bit careful to not trick your validation.

    @yyhuang recently did quite a lot with it, maybe she can jump in with some best practices?
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • GeezerDocGeezerDoc Member Posts: 5 Contributor I
    Thanks for that insight. It seems to me for biomedical datasets imbalanced data remains a huge challenge. There does not seem to be a magic bullet or an absolute consensus on the right approach. For teaching machine learning basics we can warn students that accuracy is misleading and the precision-recall curves may be better than AUCs. What else should we be telling those new to machine learning? 
Sign In or Register to comment.