Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
SMOTE
Hi
My binary classification problem is imbalanced. (In 5% of the cases the outcome occurs)
I used SMOTE for the variable selection and training of the model.
SMOTE derives from the paper in the link above.
In this paper mentioned it is written: "a combination with
the method of over-sampling
the minority class and under-sampling the majority class can achieve better classifier performance than only under-sampling the majority class."
the minority class and under-sampling the majority class can achieve better classifier performance than only under-sampling the majority class."
My question now is: Is applying SMOTE not sufficient to address the imbalanced problem.Or do I need to add aditionally an operator for "under-sampling the majority class"?
0
Answers
If there're too many samples in the majority class, you can add down-sampling (w/ "sample" operator) before SMOTE. You may also use some similarity analysis to identify the similar data points in the majority class and size down this population with simple filters. Some R/python library are helpful to under-sample with sophisticated algorithms, e.g. Edited Neared Neighbor Rule, Condensed Nearest Neighbor Rule, TomekLinks, One-sided selection, Neighborhood Cleaning Rule,...
Note that ROC curve can not measure the performance of classifiers well on imbalanced data. Because TPR only depends on positives, ROC curves do not measure the effects of negatives. AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. Try the Precision-Recall curve on the imbalanced data.
Cheers,
YY