Cross Validation # of k's

mbiclarmbiclar Member Posts: 6 Contributor I
edited December 2018 in Help

Hey guys!

 

Just wondering is there any guidelines on how many validation (k's) is to be performed when doing cross-validation? 

Let's say I have 100k data, how many is said to be enough or alright?

Answers

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    The default setting (10) has been a consensus for a long time. 

     

    Depending on your data and the stability of your models, you could get away with less or need more.

     

    Try different values and look for the variance of both the main performance number and the calculated variance. if these stay stable, you have enough data and stable enough models so you can go with less iterations.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I agree with @BalazsBarany that 10 folds is the default consensus, but with large datasets, you can usually get away with 5.  As noted, stability of the performance is the key measure.   If you have a small dataset you might consider the leave-one-out option but for larger datasets it is not at all recommended.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Pekka_JounelaPekka_Jounela Member, University Professor Posts: 4 University Professor

    I agree and if you need a reference for that, then see: Ron Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, Proceedings of the 14th international joint conference on Artificial intelligence, p.1137-1143, August 20-25, 1995, Montreal, Quebec, Canada.

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    The choice of k is an example of the Bias-Variance trade-off present in every estimation. 

     

    The Leave-One-Out CV is the most unbiased one, but it can have a very high variance (the models trained using the same dataset but one point are highly correlated).

     

    The CVs with decreasing value of k will tend to be more biased (overestimating) but with lower variance.

     

    In practical terms, if the estimation of the model performance is very important you can do several CV with k ranging from 5-20, and then choose the one that has the maximum acceptable variance. If the estimation is not very important (i.e. is used only for feature selection or parameter optimization), then you can leave it at 10, or reduce to 5 if you need to do it fast.

Sign In or Register to comment.