What model should I use ( training, validation or testing )

cliftonarmscliftonarms Member Posts: 32 Contributor II
edited November 2018 in Help
I am seeking a little "best" advice on the live prediction model application, as I am a little confused what approach is normally adopted.

The data : My data set is 50 attributes and 3400 rows ( 90% for training, 10% for unseen testing) with the very last row reserved as the live prediction example.

The training : I use the 90% training data in 10 fold x-validation to find the best training algorithm and attribute mix for my data. Confirming the best setup selection by applying the model created on the 10%  of unseen data.

My question is - Once I am happy with the above results, what model do I use ( or create ) for the live prediction of the last row? :

1) Do I use the best model created via 90% data 10 fold x-validation
2) Do I create a model  with the 90% training data ( without x fold ) using the best settings found from the x-validation  training.
3) Do I create a model on 100% data ( 90% training and 10% unseen )  with the best settings found from training.

Thank you in advance for your time.


  • Options
    earmijoearmijo Member Posts: 270 Unicorn
    With datasets that small my advice would be to go with (1) select based on X-validation.

    With a large dataset you could go with (2) select based on training/test. You can do without X-validation here.

    Whatever you pick Don't do (3) ever as you face the risk of over-fitting the data badly.

    There are some authors who recommend splitting the dataset into training/test/validation. Train your models in the training set. Compare the models in the test set. Pick the best. Estimate the error rate of the best model again in the validation set.
  • Options
    cliftonarmscliftonarms Member Posts: 32 Contributor II
    Thanks for the quick response earmijo.

    Can I just confirm - you are advocating using the "best" model created by the 10 fold x-validation method, and not retraining the model using the "best" model settings but on the complete data set.
  • Options
    earmijoearmijo Member Posts: 270 Unicorn
    The way X-Validation works in RapidMiner is you use X-validation to estimate the "out-of-sample" error but you report the model trained on the entire dataset. Notice, for instance, when you use 10-fold X-validation the model is estimated 11 times.
  • Options
    cliftonarmscliftonarms Member Posts: 32 Contributor II
    Fantastic - I understand - Thank you for your advice.
  • Options
    Danyo83Danyo83 Member Posts: 41 Contributor II

    I have a question. Do you apply the trained model with the model applier right after the XValidation or do you have to train again over the whole training set after having applied the XValidation? I am asking because in case you do a Feature selection with an inner XValidation, You don't get a model out of the feature selection (there is no connection point). However you could save the model with a "remember operator" inside the FS and call the model outside the FS operator and combine it with the feature weights operator for the unseen testset. But I think one has to retrain over the full training set with the selected features right?
Sign In or Register to comment.