Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
how to apply smote upsampling
hanaabdalrahman
Member Posts: 9 Learner III
hello.. sorry i am new in data mining i have project on classification loan default and my data is imbalanced ..
where i apply smote upsampling before spilt the data or after? my data is not larg only 1030 sample
Tagged:
0
Answers
Hi @hanaabdalrahman,
Do you split data for validation? You can upsample with smote before split/cross validation. If you like you can also apply a "stratified sample" to split data for 10% holdout test set before smote upsampling. Since the stratified holdout sample will keep the similar distribution as the original imbalanced data and can be considered a 'good' representative set for the real life data. You may want to know how good the model perform with the upsampled balanced set, and also more importantly the goodness of fit for future unseen new data from real life.
My example process for handling imbalanced data is attached for reference.
Cheers,
YY
Hi @hanaabdalrahman @yyhuang
I personally wouldn't upsample before splitting for a mere reason that in this case you will end up with synthetic examples in the test set, which then could distort testing results. So I would follow the common sense which suggests that upsampling is meant for artificially balancing data used for training the model, but it still should be tested on original unbalanced sample to show true performance. In this sense YY's process is the one you'd need to use.
Vladimir
http://whatthefraud.wtf
Just to chime in here, I think @kypexin's approach is correct. Upsampling during modeling building is the approach I would use too.
thanks but if i apply it after spilt the data the result stil not ok you can see the confusion matrix befor and after spilt
Hi @hanaabdalrahman,
In your process your are doing split validation to check the performance of DT model on test data. You will have to upsample before split.
In my exmaple process, I did have 2 split. First split is before the upsample to have 10% holdout, and another split is inside the validation which is using the upsampled data. With a validated model trained with balanced data, it makes you more confident to apply it on the 10% holdout.
In your case, you may have to upsample before split validation.
YY
Where can I find the SMOTE feature in RM 9.3.1? I tried to find in the market but nothing has shown. Thanks
The smote operator is in "Operator Toolbox" that need to be installed from the market place in rapidminer.
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Thank you so much i found it