# Newbie question: XValidation

nicugeorgian
Member Posts:

**31**Guru
in Help

Hi,

For a cross validation process with, e.g.,

The inner operator of

once with

once with

once with

For each of these runs, a model is returned.

My question is how to decide in general what model is the best? Or there is

Thanks,

Geo

For a cross validation process with, e.g.,

*XValidation*, the example set*S*is splitt up into, say, 3 subsets:*S*_{1},*S*_{2}, and*S*_{3}.The inner operator of

*XValidation*is applied thenonce with

*S*_{1}as*test*set and*S*_{2}**U***S*_{3}as*training*set,**and then**once with

*S*_{2}as*test*set and*S*_{1}**U***S*_{3}as*training*set,**and then**once with

*S*_{3}as*test*set and*S*_{1}**U***S*_{2}as*training*set.For each of these runs, a model is returned.

My question is how to decide in general what model is the best? Or there is

**no****best**model ...Thanks,

Geo

0

## Answers

1,751RM Founderthis is actually a question we got most often during the last years and there seem to be a lot of misunderstanding in properly evaluating models with cross validation techniques. The answer is as simple as this:

noneof the models created for the single folds is the best. The best one is the one trained on the complete data set or on a well chosen sample (it is not the task of a cross validation to find such a sample).If you ask which is the best one I would ask "What

shouldbe the best model?" The one with the lowest error on the corresponding test set? Well, this would be again like overfitting but now not on a training but on a test set. So it is probably not a good idea to simply select a model because of the test error alone.The best thing one can do is to think of cross validation as a process which is completely independent of the learning process:

1) One process is the learning process which is performed on the complete data.

2) But now you also want to now how good your model will perform if it is employed on completely unseen data. This is where the second process comes into the game: estimating the predictive power of your model. The best estimation you could get is calculated with leave-one-out (LOO) where all but one examples are used for training and only the remaining one for the test. Since almost all examples are used for training, the model is the most similar one compared to the model trained on the complete data. Since LOO is rather slow on large data sets, we often use a k-fold cross validation in order to get a good estimation in less time.

Hope that makes things a bit clearer. Cheers,

Ingo

31GuruI have somehow anticipated your answer when I wrote

1,751RM FounderCheers,

Ingo

23MavenBy using too much data, too little, or comparing and optimising your model on different dataset (bootstrapping), you run the risk of overfitting your model.

Taking a set of cases 'C' and a model 'M', the Coefficent of Concordance (CoC) is an indication on how good a model can distinguish cases into the defined catagories. [M.G. Kendall (1948) Rank correlation methods, Griffin, Londen]

When the CoC of a model is 50%, you actually have a random model (below 50%, your model is "cross-wirde"), so 50% is the lowest CoC you will get.

Accuracy is a measure that indicates the number of mismatched cases of your model in comparenson to the total amount of cases, this is different than your CoC.

These two measures (CoC and Accuracy) determin how good a model is.

For instance, when we sort the scored cases by the outcome predicted and the actual outcome:

....BBBBBBBBBBBBBBBBBB|GGGGGGGGGGGGGGGGGG.... Here we have 100% CoC

the accuracy is determined by the number of cases that are actually scored correctly

....BBBBBBBBB|GBBGBGGBBBGGBGGGB|GGGGGGGGGG.... Here is a more realistic picture, naturaly CoC is below 100%

Now by determining stratigacly where you will place your cutt-off, the accuracy can be determined.

If you place your cut-off higher, you take a lower risk, and your accuracy will be high

Accepting more risk, with a lower accuracy, you will place your cut-off lower, allowing yourself a bigger market share.

347Maven...reviving an old discussion...

My question is (since I am currently checking the possibilities to validate a ranking classifier without applying a cutoff / threshold) why anyone should bother to use the CoC ? It is much easier to calculate the sum of the ranks of the TP. This value can be easily transformed to the [0,1]-interval (e.g. 1 = optimal ranking, 0 = worst ranking).

I know that the CoC is the value of teststatistics for the Kendall CoC-test, so a statistical test can be applied. But this test is only meant to notify whether there is any difference (in agreement), just like ANOVA. I am looking for a test for multiple comparisons to know WHERE the difference occurs (e.g. Tukey-Test). The only test I found for this case is Friedman Ranksumtest.

another one:

[quote author=mierswa]

But now you also want to now how good your model will perform if it is employed on completely unseen data. This is where the second process comes into the game: estimating the predictive power of your model. The best estimation you could get is calculated with leave-one-out (LOO) where all but one examples are used for training and only the remaining one for the test. Since almost all examples are used for training, the model is the most similar one compared to the model trained on the complete data. Since LOO is rather slow on large data sets, we often use a k-fold cross validation in order to get a good estimation in less time.

[/quote]

Hm,hm. Recently I read a very interesting Phd-Thesis from Ron Kohavi (Click), who has shown that LOO reduces the variance, but increases the bias (i.e. stability). Imagine a binary classification problem with 50% of all instances got label=1 and 50% got label=0. Now apply a majority classifier. Using LOOC the accuracy will be zero.

However, Kohavi concludes that it is best to apply 6-10-fold-crossvalidation and repeat it 10-20 times to reduce variance. Note that repeating CV increases the alpha-error if you plan to use statistical tests to validate the results.

...so we returned to the suggestion that 10-fold-cv is the best procedure you can use . I just wanted to accentuate the argument...

greetings,

Steffen

1,751RM Founderyes, I know Ron's thesis and this is actually a good point. So my explanation about LOO might be a bit misleading. Anyway, I just wanted to give the readers a feeling how error estimation with any cross-validation-like process and a model learned from the complete data are connected. The reason for this explanation was quite simple: it's probably one of the most often asked questions - or at least it used to be some time ago.

Cheers,

Ingo