"Resampling / oversampling with holdout sample."

lansuminc · August 2014

Hi guys.

I have a question regarding resampling / oversampling i combination with the use of a holdout sample

My dataset is the following:

Positive cases: 337
Negative cases: 2661

What i did until now was:
1) Sample 337 positive cases and sample 1500 negative cases
2) Then i filter 0's in on node and filter 1's in another node
3) I use sample bootstrapping one the 1's with a factor of 4.451 giving me 1500 positive cases.
4) I append the datasets
5) I am ready to model

Now I want to use a holdout sample as my linear SVM seems to be overfitting. 90-95% accuracy.

What i consider the right thing, is to extract lets say 37 positive cases and 37 negative cases to use for validation BEFORE upscaling the minority class. this leaves me with a holdout sample on evenly distributed 74 (i know it is small, but i am mining text so I need my training cases). It also leaves me with a training and test set on 300/1500 which i can upscale to 1500/1500 cases.

My SVM predicts almost all the negative cases correctly and 2/3 of the positive cases if i use feature extraction on the hold out sample.

What are you thoughts?

Are there other ways to use holdout sample in rapidminer?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Resampling / oversampling with holdout sample."