It looks like you're new here. Sign in or register to get started.
Sign In with RapidMiner
Sign In with RapidMiner
Altair RapidMiner Community
GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.
Examples: Monday, today, last week, Mar 26, 3/26/04
ASK A QUESTION
FIND HELPFUL VIDEOS
Could you please tell me how I can achieve downsampling with imbalanced data in RM. I have used the random sampling and sampling bootstrap operators would also like to know the difference between the two.
Moderator, RapidMiner Certified Analyst, Member, University Professor
In the Mannheim Toolbox extension, there is a
Sample - Balance
operator that does just this.
(Opinions and fundamental techniques aside, but you might want to work with weighting instead of sampling.)
All the best,
Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member
The different Sample operators gives you also the possibility to downsample (or upsample) your data set. The Sample operator just randomly draw (drawing without replacement) a number of Examples. By default it does not depend on the class which Examples are drawn, so the (possible imbalanced) class ratio will be the same (with some random variations) after drawing. You can change this by selecting 'balance data' and draw different numbers of Examples per class. If your want to force your ratio to 1.0 you can set the sample size for both classes to the same number.
Sample (Stratified) will always sample in the way the the class ratio is kept.
Sample (Bootstrapping) is drawing with replacement, so there is a possibility that a specific Example occur multiple times after sampling. This can be helpful to upsample a class from which you have only a smaller number of Examples.
Hopes this helps with the differences of the Sampling operators.
Two other things I would like to mention:
In most cases I would try to not downsample your data for a machine learning task. You remove information which your model could be using for finding patterns. You may want to switch to another model instead. There are a few reasons for downsampling:
- Runtime problems
- I you have an extremely large number of Examples for one class (say a class ratio of 20:1 or higher)
If you want to get rid of your imbalanced class ratio, you may also want to try the SMOTE operator from the Operator Toolbox Extension. It performs an (advanced) method for upsampling your underrepresented class.
Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member
I second the idea that weighting is my preferred approach, and that downsampling should be used primarily when you have many more cases than needed (either in general, or specifically of the majority class). There are diminishing returns to larger and larger samples, so if your development population is hundreds of thousands of cases then you likely don't need them all. But if you have an absolutely small number of your minority class then you probably don't want to downsample the majority class to match it as too much information would be lost.
Data Science Consulting from Certified RapidMiner Experts
2018-2022 RapidMiner, Inc. All Rights Reserved.
Manage My Cookie Settings