Options

SMOTE

noritanorita Member Posts: 29 Contributor I
Hi


My binary classification problem is imbalanced. (In 5% of the cases the outcome occurs)
I used SMOTE for the variable selection and training of the model.

SMOTE derives from the paper in the link above.

In this paper mentioned it is written: "a combination with the method of over-sampling
the minority class and under-sampling the majority class can achieve better classifier performance than only under-sampling the majority class."

My question now is: Is applying SMOTE  not sufficient to address the imbalanced problem.Or do I need to add aditionally an operator for "under-sampling the majority class"?



Answers

  • Options
    yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Hi @norita,

    If there're too many samples in the majority class, you can add down-sampling (w/ "sample" operator) before SMOTE. You may also use some similarity analysis to identify the similar data points in the majority class and size down this population with simple filters. Some R/python library are helpful to under-sample with sophisticated algorithms, e.g. Edited Neared Neighbor Rule, Condensed Nearest Neighbor Rule, TomekLinks, One-sided selection, Neighborhood Cleaning Rule,...

    Note that ROC curve can not measure the performance of classifiers well on imbalanced data. Because TPR only depends on positives, ROC curves do not measure the effects of negatives. AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. Try the Precision-Recall curve on the imbalanced data. 

    Cheers,
    YY
  • Options
    noritanorita Member Posts: 29 Contributor I
    Thank you very much!
    I will have a deeper look on the paper later. Thank you! Yes, performance measures are a very delicat topic I think especially for the case (mine) of the internal validation with the SMOTE manipulated data and afterwards the external validation with data  with the original prevalence. (5% vs 95% of the different outcomes)

    Still a question remains for me on the topic of SMOTE. I did only SMOTE I only oversampled the minority class to have equal sizes of the different outcomes for the model development.
    Do I have to have some concerns that I adressed the imbalance problem only by SMOTE and not combined with undersampling the overrepresented outcome.
    Am I right that the paper of the author of SMOTE only stated the positive effect if its used in combination with undersampling.

    Is it usual to only obtain SMOTE to equal size of the comparator groups?
Sign In or Register to comment.