Learner III mbiclar
Learner III

Cross Validation # of k's

Hey guys!


Just wondering is there any guidelines on how many validation (k's) is to be performed when doing cross-validation? 

Let's say I have 100k data, how many is said to be enough or alright?

Community Manager Community Manager
Community Manager

Re: Cross Validation # of k's

The default setting (10) has been a consensus for a long time. 


Depending on your data and the stability of your models, you could get away with less or need more.


Try different values and look for the variance of both the main performance number and the calculated variance. if these stay stable, you have enough data and stable enough models so you can go with less iterations.

Balázs Bárány
Data Scientist, Vienna
RM Certified Expert
RM Certified Expert

Re: Cross Validation # of k's

I agree with @BalazsBarany that 10 folds is the default consensus, but with large datasets, you can usually get away with 5.  As noted, stability of the performance is the key measure.   If you have a small dataset you might consider the leave-one-out option but for larger datasets it is not at all recommended.


Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Contributor I Pekka_Jounela
Contributor I

Re: Cross Validation # of k's

I agree and if you need a reference for that, then see: Ron Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, Proceedings of the 14th international joint conference on Artificial intelligence, p.1137-1143, August 20-25, 1995, Montreal, Quebec, Canada.

RM Staff
RM Staff

Re: Cross Validation # of k's

The choice of k is an example of the Bias-Variance trade-off present in every estimation. 


The Leave-One-Out CV is the most unbiased one, but it can have a very high variance (the models trained using the same dataset but one point are highly correlated).


The CVs with decreasing value of k will tend to be more biased (overestimating) but with lower variance.


In practical terms, if the estimation of the model performance is very important you can do several CV with k ranging from 5-20, and then choose the one that has the maximum acceptable variance. If the estimation is not very important (i.e. is used only for feature selection or parameter optimization), then you can leave it at 10, or reduce to 5 if you need to do it fast.

How can RapidMiner increase participation in our new competitions?
Twitter Feed