The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

"[SOLVED] Overlapping folds in cross validation?"

siamak_wantsiamak_want Member Posts: 98 Contributor II
edited June 2019 in Help
Hi forum,

Today, I read many helpful posts about cross validation (x-validation). But still I have one important question: Do the folds, which are constructed,  "overlap" with each other? I mean do they have any duplicated data point or they are completely separated folds with no overlap?

You know in RM we have 3 types of cross validation sampling: "linear", "shuffled" and "stratified". I think choosing linear sampling makes non-overlapping folds but the other two may construct overlapping folds. But I experienced a very astonishing result: When I used 10 folds x-val with "linear sampling" I got the accuracy of 31% but when I just choose the "stratified sampling" I got 86% accuracy!!! I am really confused with the results. Does Anyone know how should I evaluate the performance of my model?

I would also really appreciate if someone explain the issue of overlapping  folds for cross validation, from academic point of view.



  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn

    the test sets of the folds do NOT overlap, however the training sets DO overlap: the X-Validation splits the data into (e.g.) 10 partitions. Then it loops the partitions, using the current one as test set and training on the 9 others. Thus, obviously the training sets of the folds overlap.

    Using linear sampling, each partition contains examples in the order in which they are in the original data set. If your data is ordered by label or in any other way, your learner probably does not see a representative sample of the data, but only a certain subset, and thus does not generalize well to other data. You should always use "stratified sampling" on data with a nominal label, or "shuffled sampling" otherwise.

    Best, Marius
  • Options
    siamak_wantsiamak_want Member Posts: 98 Contributor II
    Thanks to your nice answer Marius,

    So I will always set the sampling type to stratified.

    thanks again Marius.
Sign In or Register to comment.