🦉🦉   WOOT WOOT!   RAPIDMINER WISDOM 2020 EARLY BIRD REGISTRATION ENDS FRIDAY DEC 13!   REGISTER NOW!   🦉🦉

"[SOLVED] Overlapping folds in cross validation?"

siamak_wantsiamak_want Member Posts: 98 Contributor II
edited June 13 in Help
Hi forum,

Today, I read many helpful posts about cross validation (x-validation). But still I have one important question: Do the folds, which are constructed,  "overlap" with each other? I mean do they have any duplicated data point or they are completely separated folds with no overlap?

You know in RM we have 3 types of cross validation sampling: "linear", "shuffled" and "stratified". I think choosing linear sampling makes non-overlapping folds but the other two may construct overlapping folds. But I experienced a very astonishing result: When I used 10 folds x-val with "linear sampling" I got the accuracy of 31% but when I just choose the "stratified sampling" I got 86% accuracy!!! I am really confused with the results. Does Anyone know how should I evaluate the performance of my model?

I would also really appreciate if someone explain the issue of overlapping  folds for cross validation, from academic point of view.

regards,

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869   Unicorn
    Hi,

    the test sets of the folds do NOT overlap, however the training sets DO overlap: the X-Validation splits the data into (e.g.) 10 partitions. Then it loops the partitions, using the current one as test set and training on the 9 others. Thus, obviously the training sets of the folds overlap.

    Using linear sampling, each partition contains examples in the order in which they are in the original data set. If your data is ordered by label or in any other way, your learner probably does not see a representative sample of the data, but only a certain subset, and thus does not generalize well to other data. You should always use "stratified sampling" on data with a nominal label, or "shuffled sampling" otherwise.

    Best, Marius
  • siamak_wantsiamak_want Member Posts: 98 Contributor II
    Thanks to your nice answer Marius,

    So I will always set the sampling type to stratified.

    thanks again Marius.
Sign In or Register to comment.