RapidMiner

Stratified sampling with multiple strata

SOLVED
Contributor II CausalityvsCorr
Contributor II

Stratified sampling with multiple strata

There is a operator called "Sample (Stratified)". To me it can handle one strata at the time, such as Girls vs Boys.

 

But how should I solve the situation sampling with multiple strata, such as Gender (Girls/Boys), Location (area1/area2/area3) and nationality (locals/others)?

 

5 REPLIES
RM Certified Expert
RM Certified Expert

Re: Stratified sampling with multiple strata

"Generate Weight (Stratification" works fine with multi-class labels and will assign weights to distribute the sum of weights equally across all classes.  However, if you are trying to incorporate information from multiple attributes (as your example seems to suggest), that is much more complicated.  But you can always generate your own weights using "Generate Attributes" and define them however you like, and then use "Set Role" to assign your weight variable.

 

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
RM Certified Expert
RM Certified Expert

Re: Stratified sampling with multiple strata

Also, I should have mentioned, "Sample(Stratified)" is designed to ensure that you have the same class distribution across your samples, not to balance your classes.  It does work with mulitple classes but it doesn't do what you want. 

 

If you want a pure sampling solution, you can actually use the normal "Sample" operator and activate the "balance data" parameter (an advanced parameter) and then specify the sample size (absolute or relative) for each class in a multi-class label.  But you will only be able to downsample and you can't incorporate information from any other attribute--that's why I first mentioned the weighting alternative.

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Contributor II CausalityvsCorr
Contributor II

Re: Stratified sampling with multiple strata

First, thank you of fast reply. 

In order to select a right approach, I perhaps need to describe the case:

1. It is question of a survey with a rather restricted budget and resources

2. The population is 100000 persons. I know the strata which should match also for the target population (to be surveyed). Gender distribution is F=45%, M= 55%, location distribution is area 1= 20%, area 2= 65%, area 3= 15%, nationality distribution is locals 80%, others 20%.

3. The target (to be surveyed) is around 1500 persons. Response rate is expected to be 50%.

 

=> How this kind of sampling frame could be implemented in RM?

 

Highlighted
RM Certified Expert
RM Certified Expert
Solution

Re: Stratified sampling with multiple strata

If you need to treat these as independent attributes and simultaneously stratify across all of these 3 variables, you are probably going to have to create a single new attribute using Generate Attributes (with if statements) that represents all the combinations: for example, male area 1 local, female area 1 local, etc.  It looks like you will have 12 possible values, and you can then compute the sample proportion that each one will comprise of the total by multiplying through.

Once you have that, you will be able to use the sample attribute to pull the appropriate number (or proportion) of each of the individual classes.

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Contributor II CausalityvsCorr
Contributor II

Re: Stratified sampling with multiple strata

Thanks,

this is what I was thinking as a potential solution. And whether or right, at least this explanation confirmed my thinking