ALL FEATURE REQUESTS HERE ARE MONITORED BY OUR PRODUCT TEAM.

VOTING MATTERS!

IDEAS WITH HIGH NUMBERS OF VOTES (USUALLY ≥ 10) ARE PRIORITIZED IN OUR ROADMAP.

NOTE: IF YOU WISH TO SUGGEST A NEW FEATURE, PLEASE POST A NEW QUESTION AND TAG AS "FEATURE REQUEST". THANK YOU.

Feature Request: Batch validation with optional fold numbers

varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
Dear All,

I have a simple feature request if possible could be added in the cross-validation operator. Currently, we have a "Batch Validation" option that helps to set different batches and divides folds based on the number of batches. I am looking for an enhancement that helps control the number of folds created using these batches.

For example, if I have data related to 100 subjects and each subject has 10 samples, there will be 1000 samples of data. If I need to do a Leave Once subject out Cross-validation, I need to set 100 batch ID's (one for each subject) and do a batch validation in Cross-validation operator. If I need to try only 5 batches where 20 students belong to each batch, I need to generate attribute again with 5 batch ids, instead of this, we can provide an option where it uses the 100 batch ID's created first as an index and divide the 5 subsets based on that.

This will help switch between Leave one batch out and groupKfold validations.
Regards,
Varun
https://www.varunmandalapu.com/

Be Safe. Follow precautions and Maintain Social Distancing

Tagged:
2
2 votes

Open for Voting · Last Updated

PROD-897

Comments

  • yzanyzan Member Posts: 66 Unicorn
    edited March 2020
    For the engineers, the required functionality can be obtained with the following pseudocode:
    function value_to_fold = batch_validation(batch_attribute, fold_count, seed):
      unique_values = unique(batch_attribute)
      randomly_permuted_values = randperm(unique_values, seed)
      value_to_fold = map()
      fold = 0
      for value in randomly_permuted_values:
        value_to_fold.put(value, fold)
        fold = modulo(fold+1, fold_count)
    which returns a map, which maps unique values in the batch attribute into the appropriate fold. The advantages of the proposed solution are:
    1. The assignment is deterministic given the random seed.
    2. But if we change the seed, we will (most likely) get different fold assignment.
    3. Till the count of the unique values in the batch attribute >= count of the required folds, we are guaranteed that each fold is going to have at least a single sample. This is a highly desirable property: We do not want 2-fold cross-validation to fail just because one of the training sets is empty.
    Note also that sometimes we want to assign folds not just by a single attribute, but by multiple attributes. Personally, I like the illustration in http://www.rogermstein.com/wp-content/uploads/SobehartKeenanStein2000.pdf: we may want to estimate the generalization ability of a model not only across subjects but also across time.

    How to implement multi-attribute batch validation in RapidMiner? From the point of GUI, just add another advanced parameter "split on attributes" (a checkbox) into Cross Validation operator. When this checkbox gets selected, irrelevant parameters disappear (like with "split on batch attribute"). But a new parameter (button) appears: "Select attributes", which opens a dialog to select the attributes (like "Select attributes" in Select Attributes operator). This dialog will have two columns (similar to the dialog "set additional roles" in Set Role operator): "attribute name" on the left and "maximal count of bins" on the right. The default value of "maximal count of bins" should be a small, strictly positive whole number (e.g.: 3).

    What do we do with "maximal count of bins"? We have to "bin" each "batch attribute" based on the type of the attribute. The options are:
    1. Nominal: Use the above pseudocode for binning.
    2. Numerical: Use Discretize by Frequency operator (I prefer to have folds of the same size if possible. Hence the choice of the discretization algorithm).
    3. Date: Use Date to Numerical operator. And then treat it like if it was a numerical attribute.
    The treatment of the date attributes may look too simplistic (particularly if you are accustomed to using "rolling window backtesting" and other algorithms like that). But it is actually the recommended approach by Bergmeir and Hyndman: https://www.sciencedirect.com/science/article/pii/S0167947317302384 (They compared this "simplistic" approach to a couple "more thorough" approaches. The "simplistic" approach won.). If Hyndman says it is OK, I am OK with that as well.

    Once the batch attributes are binned (i.e.: for each batch attribute we have the mapping from values to folds), we can just run the following nested loop (in pseudocode):
    function training_set, testing_set = training_testing_split(exampleSet, batch_attribute_to_map, testing_fold):
      training_set = set()
      testing_set = set()
      for row in exampleSet:
        is_training = True
        is_testing = True 
        for batch_attribute, value_to_folds in batch_attribute_to_map:
          assigned_fold = value_to_folds.get(exampleSet[row, batch_attribute])
          if assigned_fold != testing_fold:
            is_testing = False
          if assigned_fold == testing_fold:
            is_training = False
        if is_training:
          training_set.add(exampleSet[row,:])
        if is_testing:
          testing_set.add(exampleSet[row,:])
    which returns training and testing sets. Note that when more than one batch attribute is used, samples are not assigned into training_set or testing_set but rather into training_set, testing_set or not_used_in_this_split - this is the tax we have to pay for estimating model's generalization ability over multiple attributes at once.

    To provide feedback to the user about how the validation was performed, Cross Validation operator may include "add batch attribute" option when "split on batch attribute" is not checked (similar to segmentation operators that provide "add cluster attribute").

    I apologize for the length of this post. But the ability to quickly get an estimate of the model's generalization ability across some id-like attribute (and possibly across time) is so useful that I felt compelled to write this post.
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    wow thank you @yzan . I have copied + pasted your whole post onto our internal system for the engineers.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    wow @yzan , than you for your contribution!
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.