Options

# "[SOLVED] Overlapping folds in cross validation?"

siamak_want
Member Posts:

**98**Contributor II
Hi forum,

Today, I read many helpful posts about cross validation (x-validation). But still I have one important question: Do the folds, which are constructed, "overlap" with each other? I mean do they have any duplicated data point or they are completely separated folds with no overlap?

You know in RM we have 3 types of cross validation sampling: "linear", "shuffled" and "stratified". I think choosing linear sampling makes non-overlapping folds but the other two may construct overlapping folds. But I experienced a very astonishing result: When I used 10 folds x-val with "linear sampling" I got the accuracy of 31% but when I just choose the "stratified sampling" I got 86% accuracy!!! I am really confused with the results. Does Anyone know how should I evaluate the performance of my model?

I would also really appreciate if someone explain the issue of overlapping folds for cross validation, from academic point of view.

regards,

Today, I read many helpful posts about cross validation (x-validation). But still I have one important question: Do the folds, which are constructed, "overlap" with each other? I mean do they have any duplicated data point or they are completely separated folds with no overlap?

You know in RM we have 3 types of cross validation sampling: "linear", "shuffled" and "stratified". I think choosing linear sampling makes non-overlapping folds but the other two may construct overlapping folds. But I experienced a very astonishing result: When I used 10 folds x-val with "linear sampling" I got the accuracy of 31% but when I just choose the "stratified sampling" I got 86% accuracy!!! I am really confused with the results. Does Anyone know how should I evaluate the performance of my model?

I would also really appreciate if someone explain the issue of overlapping folds for cross validation, from academic point of view.

regards,

Tagged:

0

## Answers

1,869Unicornthe test sets of the folds do NOT overlap, however the training sets DO overlap: the X-Validation splits the data into (e.g.) 10 partitions. Then it loops the partitions, using the current one as test set and training on the 9 others. Thus, obviously the training sets of the folds overlap.

Using linear sampling, each partition contains examples in the order in which they are in the original data set. If your data is ordered by label or in any other way, your learner probably does not see a representative sample of the data, but only a certain subset, and thus does not generalize well to other data. You should always use "stratified sampling" on data with a nominal label, or "shuffled sampling" otherwise.

Best, Marius

98Contributor IISo I will always set the sampling type to stratified.

thanks again Marius.