Stratified sampling with multiple strata

CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II
edited September 2019 in Help

There is a operator called "Sample (Stratified)". To me it can handle one strata at the time, such as Girls vs Boys.

 

But how should I solve the situation sampling with multiple strata, such as Gender (Girls/Boys), Location (area1/area2/area3) and nationality (locals/others)?

 

Tagged:

Best Answer

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    If you need to treat these as independent attributes and simultaneously stratify across all of these 3 variables, you are probably going to have to create a single new attribute using Generate Attributes (with if statements) that represents all the combinations: for example, male area 1 local, female area 1 local, etc.  It looks like you will have 12 possible values, and you can then compute the sample proportion that each one will comprise of the total by multiplying through.

    Once you have that, you will be able to use the sample attribute to pull the appropriate number (or proportion) of each of the individual classes.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    "Generate Weight (Stratification" works fine with multi-class labels and will assign weights to distribute the sum of weights equally across all classes.  However, if you are trying to incorporate information from multiple attributes (as your example seems to suggest), that is much more complicated.  But you can always generate your own weights using "Generate Attributes" and define them however you like, and then use "Set Role" to assign your weight variable.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Also, I should have mentioned, "Sample(Stratified)" is designed to ensure that you have the same class distribution across your samples, not to balance your classes.  It does work with mulitple classes but it doesn't do what you want. 

     

    If you want a pure sampling solution, you can actually use the normal "Sample" operator and activate the "balance data" parameter (an advanced parameter) and then specify the sample size (absolute or relative) for each class in a multi-class label.  But you will only be able to downsample and you can't incorporate information from any other attribute--that's why I first mentioned the weighting alternative.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II

    First, thank you of fast reply. 

    In order to select a right approach, I perhaps need to describe the case:

    1. It is question of a survey with a rather restricted budget and resources

    2. The population is 100000 persons. I know the strata which should match also for the target population (to be surveyed). Gender distribution is F=45%, M= 55%, location distribution is area 1= 20%, area 2= 65%, area 3= 15%, nationality distribution is locals 80%, others 20%.

    3. The target (to be surveyed) is around 1500 persons. Response rate is expected to be 50%.

     

    => How this kind of sampling frame could be implemented in RM?

     

  • CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II

    Thanks,

    this is what I was thinking as a potential solution. And whether or right, at least this explanation confirmed my thinking 

Sign In or Register to comment.