Options

Newbie question: XValidation

nicugeorgiannicugeorgian Member Posts: 31 Maven
Hi,

For a cross validation process with, e.g., XValidation, the example set S is splitt up into, say, 3 subsets: S1, S2, and S3.

The inner operator of XValidation is applied then

once with S1 as test set and S2U S3 as training set, and then

once with S2 as test set and S1U S3 as training set, and then

once with S3 as test set and S1U S2 as training set.

For each of these runs, a model is returned.

My question is how to decide in general what model is the best? Or there is no best model ...

Thanks,
Geo

Answers

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Geo,

    this is actually a question we got most often during the last years and there seem to be a lot of misunderstanding in properly evaluating models with cross validation techniques. The answer is as simple as this: none of the models created for the single folds is the best. The best one is the one trained on the complete data set or on a well chosen sample (it is not the task of a cross validation to find such a sample).

    If you ask which is the best one I would ask "What should be the best model?" The one with the lowest error on the corresponding test set? Well, this would be again like overfitting but now not on a training but on a test set. So it is probably not a good idea to simply select a model because of the test error alone.

    The best thing one can do is to think of cross validation as a process which is completely independent of the learning process:

    1) One process is the learning process which is performed on the complete data.

    2) But now you also want to now how good your model will perform if it is employed on completely unseen data. This is where the second process comes into the game: estimating the predictive power of your model. The best estimation you could get is calculated with leave-one-out (LOO) where all but one examples are used for training and only the remaining one for the test. Since almost all examples are used for training, the model is the most similar one compared to the model trained on the complete data. Since LOO is rather slow on large data sets, we often use a k-fold cross validation in order to get a good estimation in less time.

    Hope that makes things a bit clearer. Cheers,
    Ingo
  • Options
    nicugeorgiannicugeorgian Member Posts: 31 Maven
    Ingo, many thanks for the very detailed answer!

    I have somehow anticipated your answer when I wrote
    Or there is no best model ...
      :)
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    I have somehow anticipated your answer when I wrote
    I already though that but we get this question so often that I thought a longer answer might be a good idea so that we can post a link here in the future  ;)

    Cheers,
    Ingo
  • Options
    reports01reports01 Member Posts: 23 Maven
    Creating a good model is a tricky business.
    By using too much data, too little, or comparing and optimising your model on different dataset (bootstrapping), you run the risk of overfitting your model.

    Taking a set of cases 'C' and a model 'M', the Coefficent of Concordance (CoC) is an indication  on how good a model can distinguish cases into the defined catagories. [M.G. Kendall (1948) Rank correlation methods, Griffin, Londen]
    When the CoC of a model is 50%, you actually have a random model (below 50%, your model is "cross-wirde"), so 50% is the lowest CoC you will get.
    Accuracy is a measure that indicates the number of mismatched cases of your model in comparenson to the total amount of cases, this is different than your CoC.

    These two measures (CoC and Accuracy) determin how good a model is.


    For instance, when we sort the scored cases by the outcome predicted and the actual outcome:

            ....BBBBBBBBBBBBBBBBBB|GGGGGGGGGGGGGGGGGG....      Here we have 100% CoC
    the accuracy is determined by the number of cases that are actually scored correctly

            ....BBBBBBBBB|GBBGBGGBBBGGBGGGB|GGGGGGGGGG....      Here is a more realistic picture, naturaly CoC is below 100%

    Now by determining stratigacly where you will place your cutt-off, the accuracy can be determined.
    If you place your cut-off higher, you take a lower risk, and your accuracy will be high
    Accepting more risk, with a lower accuracy, you will place your cut-off lower, allowing yourself a bigger market share.


  • Options
    steffensteffen Member Posts: 347 Maven
    Hello

    ...reviving an old discussion...

    My question is (since I am currently checking the possibilities to validate a ranking classifier without applying a cutoff / threshold) why anyone should bother to use the CoC ? It is much easier to calculate the sum of the ranks of the TP. This value can be easily transformed to the [0,1]-interval (e.g. 1 = optimal ranking, 0 = worst ranking).

    I know that the CoC is the value of teststatistics for the Kendall CoC-test, so a statistical test can be applied. But this test is only meant to notify whether there is any difference (in agreement), just like ANOVA. I am looking for a test for multiple comparisons to know WHERE the difference occurs (e.g. Tukey-Test). The only test I found for this case is Friedman Ranksumtest.




    another one:
    [quote author=mierswa]
    But now you also want to now how good your model will perform if it is employed on completely unseen data. This is where the second process comes into the game: estimating the predictive power of your model. The best estimation you could get is calculated with leave-one-out (LOO) where all but one examples are used for training and only the remaining one for the test. Since almost all examples are used for training, the model is the most similar one compared to the model trained on the complete data. Since LOO is rather slow on large data sets, we often use a k-fold cross validation in order to get a good estimation in less time.
    [/quote]

    Hm,hm. Recently I read a very interesting Phd-Thesis from Ron Kohavi (Click), who has shown that LOO reduces the variance, but increases the bias (i.e. stability). Imagine a binary classification problem with 50% of all instances got label=1 and 50% got label=0. Now apply a majority classifier. Using LOOC the accuracy will be zero.
    However, Kohavi concludes that it is best to apply 6-10-fold-crossvalidation and repeat it 10-20 times to reduce variance. Note that repeating CV increases the alpha-error if you plan to use statistical tests to validate the results. 

    ...so we returned to the suggestion that 10-fold-cv is the best procedure you can use  :D. I just wanted to accentuate the argument...

    greetings,

    Steffen

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Steffen,

    yes, I know Ron's thesis and this is actually a good point. So my explanation about LOO might be a bit misleading. Anyway, I just wanted to give the readers a feeling how error estimation with any cross-validation-like process and a model learned from the complete data are connected. The reason for this explanation was quite simple: it's probably one of the most often asked questions - or at least it used to be some time ago.

    Cheers,
    Ingo
Sign In or Register to comment.