What model should I use ( training, validation or testing )

cliftonarms · February 2013

I am seeking a little "best" advice on the live prediction model application, as I am a little confused what approach is normally adopted.

The data : My data set is 50 attributes and 3400 rows ( 90% for training, 10% for unseen testing) with the very last row reserved as the live prediction example.

The training : I use the 90% training data in 10 fold x-validation to find the best training algorithm and attribute mix for my data. Confirming the best setup selection by applying the model created on the 10% of unseen data.

My question is - Once I am happy with the above results, what model do I use ( or create ) for the live prediction of the last row? :

1) Do I use the best model created via 90% data 10 fold x-validation
2) Do I create a model with the 90% training data ( without x fold ) using the best settings found from the x-validation training.
3) Do I create a model on 100% data ( 90% training and 10% unseen ) with the best settings found from training.

Thank you in advance for your time.

earmijo · February 2013

With datasets that small my advice would be to go with (1) select based on X-validation.

With a large dataset you could go with (2) select based on training/test. You can do without X-validation here.

Whatever you pick Don't do (3) ever as you face the risk of over-fitting the data badly.

There are some authors who recommend splitting the dataset into training/test/validation. Train your models in the training set. Compare the models in the test set. Pick the best. Estimate the error rate of the best model again in the validation set.

cliftonarms · February 2013

Thanks for the quick response earmijo.

Can I just confirm - you are advocating using the "best" model created by the 10 fold x-validation method, and not retraining the model using the "best" model settings but on the complete data set.

earmijo · February 2013

The way X-Validation works in RapidMiner is you use X-validation to estimate the "out-of-sample" error but you report the model trained on the entire dataset. Notice, for instance, when you use 10-fold X-validation the model is estimated 11 times.

cliftonarms · February 2013

Fantastic - I understand - Thank you for your advice.

Danyo83 · March 2013

Hi,

I have a question. Do you apply the trained model with the model applier right after the XValidation or do you have to train again over the whole training set after having applied the XValidation? I am asking because in case you do a Feature selection with an inner XValidation, You don't get a model out of the feature selection (there is no connection point). However you could save the model with a "remember operator" inside the FS and call the model outside the FS operator and combine it with the feature weights operator for the unseen testset. But I think one has to retrain over the full training set with the selected features right?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

What model should I use ( training, validation or testing )

Answers