Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Sampling (Balancing) and Cross validation
SimonW18272
Member Posts: 13 Learner I
hey everyone I want to train a decision tree model and I already use a cross validation operator for training my model. However I also need to balance my data since I have two classes from which one is repesented much less times. I am concerned now how to use the samling Operator. I know how to use it to balance my data, i am more wondering if it matters if i put the sampling operator into the subprocess of the cross validation operator or if i can also balance the dataset right before. I somewhere saw it is typical and better to use the sampling operator in the cross validation operator, because otherweise some data point get out of scope. But does it really mater because if i think about it again, it does not mae that much sense for me and it should not matter if I use sample before or after. Can someone give me a answer about this?
Tagged:
0
Answers
If you put the sampling operator into the cross validation (into the left panel, before building the model), you get two benefits:
1. The model will learn on balanced data
2. The sampling doesn't affect the test set (the one on the right in the validation), so you validate the model on all data.
Sampling is something you do to improve the model. Therefore it makes sense to put into the cross validation. When doing a cross validation, you are not only validating the model: the goal is to validate the entire process that leads to building the model. Sampling is a part of that.
Reducing the data before cross validation gives you a false impression on the results of the entire process. You want to build a model from balanced data, but if the underlying data set is fundamentally unbalanced, then you should validate it that way. Balancing before the validation will give you a validation result on an artificially balanced data set.
Regards,
Balázs