Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

SMOTE

noritanorita Member Posts: 29 Learner III
Hi


My binary classification problem is imbalanced. (In 5% of the cases the outcome occurs)
I used SMOTE for the variable selection and training of the model.

SMOTE derives from the paper in the link above.

In this paper mentioned it is written: "a combination with the method of over-sampling
the minority class and under-sampling the majority class can achieve better classifier performance than only under-sampling the majority class."

My question now is: Is applying SMOTE  not sufficient to address the imbalanced problem.Or do I need to add aditionally an operator for "under-sampling the majority class"?



Answers

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Hi @norita,

    If there're too many samples in the majority class, you can add down-sampling (w/ "sample" operator) before SMOTE. You may also use some similarity analysis to identify the similar data points in the majority class and size down this population with simple filters. Some R/python library are helpful to under-sample with sophisticated algorithms, e.g. Edited Neared Neighbor Rule, Condensed Nearest Neighbor Rule, TomekLinks, One-sided selection, Neighborhood Cleaning Rule,...

    Note that ROC curve can not measure the performance of classifiers well on imbalanced data. Because TPR only depends on positives, ROC curves do not measure the effects of negatives. AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. Try the Precision-Recall curve on the imbalanced data. 

    Cheers,
    YY
  • noritanorita Member Posts: 29 Learner III
    Thank you very much!
    I will have a deeper look on the paper later. Thank you! Yes, performance measures are a very delicat topic I think especially for the case (mine) of the internal validation with the SMOTE manipulated data and afterwards the external validation with data  with the original prevalence. (5% vs 95% of the different outcomes)

    Still a question remains for me on the topic of SMOTE. I did only SMOTE I only oversampled the minority class to have equal sizes of the different outcomes for the model development.
    Do I have to have some concerns that I adressed the imbalance problem only by SMOTE and not combined with undersampling the overrepresented outcome.
    Am I right that the paper of the author of SMOTE only stated the positive effect if its used in combination with undersampling.

    Is it usual to only obtain SMOTE to equal size of the comparator groups?
Sign In or Register to comment.