# Over-fitting problem

Hi,

I´m working with 1000 attributes, 8000 examples and only 2,5% positive cases. To train the model, I used under sampling (25% positive, 75% negative). At first, I optimized model parameters. Then I used a Forward Selection of variables followed by a Backward Selection of variables, with different "keep best" (1, 2, 5, 10). My performance in the train part is 0,824. When I tested, the performance is 0,742. I´m always working with x-validation. I don´t figure out where is the over-fitting problem. Am I using the correct sampling? Should I use over-sampling or a different under-sampling?

Thank you very much,

Ignacio

I´m working with 1000 attributes, 8000 examples and only 2,5% positive cases. To train the model, I used under sampling (25% positive, 75% negative). At first, I optimized model parameters. Then I used a Forward Selection of variables followed by a Backward Selection of variables, with different "keep best" (1, 2, 5, 10). My performance in the train part is 0,824. When I tested, the performance is 0,742. I´m always working with x-validation. I don´t figure out where is the over-fitting problem. Am I using the correct sampling? Should I use over-sampling or a different under-sampling?

Thank you very much,

Ignacio

0

## Answers

3,453RM Data ScientistDoes your x-Val include the Forward selection and the Optimization? Otherwise you can easilty overfit (Just take the attributes, which are good for this specific (sub)set.)

Could you maybe provide an example process doing this? What is the Std_dev for the 0.824?

To improve performance, i would recommend using weights.

Cheers,

Martin

Dortmund, Germany

7Contributor III am using weights. I think the problem might be in the sampling process.

Is over-sampling a good idea?

Ignacio

3,453RM Data Scientistwhy do you want to sample if you use weights? It has a very similar effect. Are you sure that this does not change your performance in training and testing?

And again - is your Feature Selection and optimization inside your X-Val? Otherwise you will overestimate your performance.

Cheers,

Martin

Dortmund, Germany

7Contributor IIIm sorry I mixed the terms, I used sampling with 75% for training and 25% testing. The undersampling I did was 50/50 positive/negative cases for training, testing was left 2.5/97.5.

First I got parameters for a svm using top 100 correlated attributes, then I used those parameters for a forward+backward. Two different processes. In both cases the x-val was INSIDE the optimize parameters / forwards. Are you saying it should be the other way around, with the optimizers inside a single x-val node? Each fold tested against what then , the same training fold? Or do a x-val inside as well?

I haven{t tried weighting, but I read it doesn{t work every algotiyhm in rm. I am using 5.3.015, what algorithms should I try with it? I normally use svm, libsvm, neural net, k-nn, bayes, decision trees, logisitic and linear regression.

Thank you!

Ignacio

3,453RM Data Scientistyour procedure might go into overfitting. You might choose the attributes (=Dimensions) which are well suited for your special subset of data. Think about binominal attributes coding wether a customer lives in a City or not. If you optimize on that, you can overtrain on "People from Springfield", which is overtraining.

To do it correctly you need to do

X-Val, inside Optimize Parameters, inside Feature Selection and X-Val.

This takes a lot of time. So if you have enough data you might do the Feature Selection on a "Hold-Out" set, which is then not used in the Optmization anymore.

For the weighting: You can click on an operator and than use f1 to see what's supported. There is a entry for weights.From a first look your operators should support weights.

Cheers,

Martin

Dortmund, Germany