Sampling (Balancing) and Cross validation

SimonW18272 · February 2022

hey everyone I want to train a decision tree model and I already use a cross validation operator for training my model. However I also need to balance my data since I have two classes from which one is repesented much less times. I am concerned now how to use the samling Operator. I know how to use it to balance my data, i am more wondering if it matters if i put the sampling operator into the subprocess of the cross validation operator or if i can also balance the dataset right before. I somewhere saw it is typical and better to use the sampling operator in the cross validation operator, because otherweise some data point get out of scope. But does it really mater because if i think about it again, it does not mae that much sense for me and it should not matter if I use sample before or after. Can someone give me a answer about this?

SimonW18272 · February 2022

I mean I first understood why it makes sense, but now I am confused because when i think about it more in detail it should not matter when I use k-folds cross validation.

Since if i have a data set lets say 80/20 is the ratio of the classes and I reduce it before or in the cross validation shouldnt make a difference or am i wrong?

BalazsBarany · February 2022

Hi!

If you put the sampling operator into the cross validation (into the left panel, before building the model), you get two benefits:
1. The model will learn on balanced data
2. The sampling doesn't affect the test set (the one on the right in the validation), so you validate the model on all data.

Sampling is something you do to improve the model. Therefore it makes sense to put into the cross validation. When doing a cross validation, you are not only validating the model: the goal is to validate the entire process that leads to building the model. Sampling is a part of that.

Reducing the data before cross validation gives you a false impression on the results of the entire process. You want to build a model from balanced data, but if the underlying data set is fundamentally unbalanced, then you should validate it that way. Balancing before the validation will give you a validation result on an artificially balanced data set.

Regards,
Balázs

SimonW18272 · February 2022

Thank you very much Balaz that makes much more sense for me now!!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Sampling (Balancing) and Cross validation

Answers