🥳 RAPIDMINER 9.9 IS OUT!!! 🥳

The updates in 9.9 power advanced use cases and offer productivity enhancements for users who prefer to code.

CLICK HERE TO DOWNLOAD

Correlation value at 0 with leave-one-out cross validation

VortyVorty Member Posts: 6 Contributor I
Hello,

I've been noticing a phenomenon I don't quite know how to explain that is somehow related to what is described in this previous post. I try training a linear regression model on a dataset (we can consider the Polynomial dataset for instance, with 200 examples), using cross-validation (with shuffled sampling): on the training side, there is simply a linear regression with the default parameters; on the testing side, an apply model and a performance evaluation. What I'm doing is trying to change the number of folds the model gets trained on. Here are some values I observed with that dataset:
  • 5-fold CV: correlation = 0.894 +/- 0.026 (micro average: 0.892)
  • 10-fold CV: correlation = 0.902 +/- 0.038 (micro average: 0.891)
  • 20-fold CV: correlation = 0.909 +/- 0.080 (micro average: 0.894)
  • 50-fold CV: correlation = 0.899 +/- 0.174 (micro average: 0.894)
  • 100-fold CV: correlation = 0.960 +/- 0.197 (micro average: 0.894)
  • 150-fold CV: correlation = 0.300 +/- 0.460 (micro average: 0.894)
  • 200-fold CV: correlation = 0.000 +/- 0.000 (micro average: 0.894)
So, clearly at first increasing the number of folds means more data is getting seen to train the model, so it makes sense that the performance may increase slightly until 100-fold. What confuses me more is what happens after: I agree it doesn't quite make sense to do 150 folds, because I'm not sure how one divides 200 into 150 folds - I assume there may be some repetition in the training sets? Still, I'd expect a warning telling me it's potentially problematic, but I don't quite get why the correlation value collapses. And finally, at 200-fold, which is equivalent to a leave-one-out CV, the correlation value is at 0.

So does it mean the "best" value for the number of folds in that case is half the number of examples in the dataset? If so, why is that? Or should I only rely on the micro-averages which are pretty stable?









Answers

  • jacobcybulskijacobcybulski Member, University Professor Posts: 388   Unicorn
    edited December 2020
    To calculate correlation you need a few data points, when you get only one data point for validation, it is not possible to calculate correlation. There is no such thing as the best "practical" value for the number of folds. Usually when you have little data, you use LOOCV. However, LOOCV introduces some bias as you overtrain your model as compared to validation (and of course you cannot calculate correlation). When you have a lot of data, you may wish to reduce the number of folds (e.g. to 3 or 5) to reduce the waiting time and then you may have issues with variance. Typically 10 folds are generally accepted as good enough for the normal validation.
    Vorty
  • jacobcybulskijacobcybulski Member, University Professor Posts: 388   Unicorn
    Also - NEVER try to "optimise" the model performance by changing the number of folds. The aim of cross-validation is to assess the realistic model performance when it is deployed. The deployed model is trained on all available data once all your experiments are done and finished. (This is why cross-validation at the end returns the model trained on all examples)
    Vorty
  • VortyVorty Member Posts: 6 Contributor I
    Thanks @jacobcybulski - those answers make sense. I was precisely wondering what would be the best assessment of performance in that case: 200 examples is relatively small and so LOOCV made sense to me in that case, with the drawback that indeed correlation cannot be calculated.
    I don't think I'm trying to "optimize" the model performance because indeed, I'm aware the real model is trained on all data. I'm simply trying to get an idea of which number is the closest from what the model performance will really be: from that point of view, the more folds the better the estimation... with a limit relative to correlation calculation, as shown with the number above. So should I assume that the "best estimation" of model performance I can obtain, while still being able to calculate a meaningful correlation, is indeed number_of_folds = number_of_examples / 2?
    Empirically it seemed to work, but I didn't want to rely exclusively on such an empirical choice.
  • jacobcybulskijacobcybulski Member, University Professor Posts: 388   Unicorn
    It is possible to conduct sensitivity analysis on your CV k, if you can bear Python, the following article gives an excellent explanation and an example of how to do this:
    https://machinelearningmastery.com/how-to-configure-k-fold-cross-validation/
    Vorty
  • VortyVorty Member Posts: 6 Contributor I
    This link is extremely interesting, thanks again! I wasn't familiar with the term "sensitivity analysis" which is indeed what I had in mind (although you are correct that the goal is not so much to maximize the metric (here, correlation score) as to approximate it as accurately as possible).
    So maybe a proper way to reformulate my initial interrogation would be: when performing sensitivity analysis of the number of folds k to be used in cross validation, when estimating the performance of the model in terms of correlation*, considering that correlation cannot be calculated appropriately in LOOCV, how to obtain the as-best-as-we-can-achieve "ideal" estimate of model performance? I.e. the red line on the first figure in your link

    * for the record, I know other metrics should be considered to evaluate a regression, including at least one like the RMSE to estimate how far off the estimated numbers are from the reality. But those do not have the same problem that correlation has with LOOCV, which is why I don't focus on them here.

    My guess is still that it is with number_of_folds k = number_of_examples / 2

  • jacobcybulskijacobcybulski Member, University Professor Posts: 388   Unicorn
    in validation, ultimately you want to minimise the chance that your folds samples have distribution different from that in population, and if so average the performance across folds. If your data set is large, say 10,000 examples, two random samples splitting the data set 70-30% are very likely to have the distribution profile of the population, the chance that your training or validation set are very different from the population is very low, assuming that the entire data set is representative. This means that for large data sets, a simple holdout validation is sufficient. For smaller data sets, say 200 examples, things are more difficult as the chance of the samlle being different from the population is high  so we need to repeat sampling and testing several times. I any case, doing so 100 times is probably not needed, and I'd go with the default of 10 folds. If your data set is very small, say 30 examples, you are in big trouble, you really need to use it all, and this case I'd always go for LOOCV or better get more data. 
Sign In or Register to comment.