The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Downsampling operators
Best Answers
-
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 UnicornHi,
In the Mannheim Toolbox extension, there is a Sample - Balance operator that does just this.
(Opinions and fundamental techniques aside, but you might want to work with weighting instead of sampling.)
All the best,
Rodrigo.1 -
tftemme Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM ResearchHi @20160041,
The different Sample operators gives you also the possibility to downsample (or upsample) your data set. The Sample operator just randomly draw (drawing without replacement) a number of Examples. By default it does not depend on the class which Examples are drawn, so the (possible imbalanced) class ratio will be the same (with some random variations) after drawing. You can change this by selecting 'balance data' and draw different numbers of Examples per class. If your want to force your ratio to 1.0 you can set the sample size for both classes to the same number.
Sample (Stratified) will always sample in the way the the class ratio is kept.
Sample (Bootstrapping) is drawing with replacement, so there is a possibility that a specific Example occur multiple times after sampling. This can be helpful to upsample a class from which you have only a smaller number of Examples.
Hopes this helps with the differences of the Sampling operators.
Two other things I would like to mention:
In most cases I would try to not downsample your data for a machine learning task. You remove information which your model could be using for finding patterns. You may want to switch to another model instead. There are a few reasons for downsampling:
- Runtime problems
- I you have an extremely large number of Examples for one class (say a class ratio of 20:1 or higher)
If you want to get rid of your imbalanced class ratio, you may also want to try the SMOTE operator from the Operator Toolbox Extension. It performs an (advanced) method for upsampling your underrepresented class.
Best regards
Fabian5 -
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornI second the idea that weighting is my preferred approach, and that downsampling should be used primarily when you have many more cases than needed (either in general, or specifically of the majority class). There are diminishing returns to larger and larger samples, so if your development population is hundreds of thousands of cases then you likely don't need them all. But if you have an absolutely small number of your minority class then you probably don't want to downsample the majority class to match it as too much information would be lost.1