Options

# Correlation value at 0 with leave-one-out cross validation

Hello,

I've been noticing a phenomenon I don't quite know how to explain that is somehow related to what is described in this previous post. I try training a linear regression model on a dataset (we can consider the Polynomial dataset for instance, with 200 examples), using cross-validation (with shuffled sampling): on the training side, there is simply a linear regression with the default parameters; on the testing side, an apply model and a performance evaluation. What I'm doing is trying to change the number of folds the model gets trained on. Here are some values I observed with that dataset:

- 5-fold CV: correlation = 0.894 +/- 0.026 (micro average: 0.892)
- 10-fold CV: correlation = 0.902 +/- 0.038 (micro average: 0.891)
- 20-fold CV: correlation = 0.909 +/- 0.080 (micro average: 0.894)
- 50-fold CV: correlation = 0.899 +/- 0.174 (micro average: 0.894)
- 100-fold CV: correlation = 0.960 +/- 0.197 (micro average: 0.894)
- 150-fold CV: correlation = 0.300 +/- 0.460 (micro average: 0.894)
- 200-fold CV: correlation = 0.000 +/- 0.000 (micro average: 0.894)

So does it mean the "best" value for the number of folds in that case is half the number of examples in the dataset? If so, why is that? Or should I only rely on the micro-averages which are pretty stable?

0

## Answers

391Unicorn391Unicorn6Contributor I391Unicornhttps://machinelearningmastery.com/how-to-configure-k-fold-cross-validation/

6Contributor I*, considering that correlation cannot be calculated appropriately in LOOCV, how to obtain the as-best-as-we-can-achieve "ideal" estimate of model performance? I.e. the red line on the first figure in your link* for the record, I know other metrics should be considered to evaluate a regression, including at least one like the RMSE to estimate how far off the estimated numbers are from the reality. But those do not have the same problem that correlation has with LOOCV, which is why I don't focus on them here.391Unicorn