Resampling / oversampling with holdout sample.

kasper2304kasper2304 Member Posts: 28 Contributor II
edited November 2018 in Help
Hi guys.

I have a question regarding resampling / oversampling i combination with the use of a holdout sample

My dataset is the following:

Positive cases: 337
Negative cases: 2661

What i did until now was:
1) Sample 337 positive cases and sample 1500 negative cases
2) Then i filter 0's in on node and filter 1's in another node
3) I use sample bootstrapping one the 1's with a factor of 4.451 giving me 1500 positive cases.
4) I append the datasets
5) I am ready to model

Now I want to use a holdout sample as my linear SVM seems to be overfitting. 90-95% accuracy.

What i consider the right thing, is to extract lets say 37 positive cases and 37 negative cases to use for validation BEFORE upscaling the minority class. this leaves me with a holdout sample on evenly distributed 74 (i know it is small, but i am mining text so I need my training cases). It also leaves me with a training and test set on 300/1500 which i can upscale to 1500/1500 cases.

My SVM predicts almost all the negative cases correctly and 2/3 of the positive cases if i use feature extraction on the hold out sample.

What are you thoughts?

Are there other ways to use holdout sample in rapidminer?
Sign In or Register to comment.