turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Home
- :
- Product Help
- :
- RapidMiner Studio Forum
- :
- Re: Stratified sampling with multiple strata

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

3 weeks ago
3 weeks ago

3 weeks ago
There is a operator called "Sample (Stratified)". To me it can handle one strata at the time, such as Girls vs Boys.

But how should I solve the situation sampling with multiple strata, such as Gender (Girls/Boys), Location (area1/area2/area3) and nationality (locals/others)?

Solved! Go to Solution.

5 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

3 weeks ago
3 weeks ago

3 weeks ago
"Generate Weight (Stratification" works fine with multi-class labels and will assign weights to distribute the sum of weights equally across all classes. However, if you are trying to incorporate information from multiple attributes (as your example seems to suggest), that is much more complicated. But you can always generate your own weights using "Generate Attributes" and define them however you like, and then use "Set Role" to assign your weight variable.

Brian T., **Lindon Ventures** - www.lindonventures.com

Analytics Consulting by Certified RapidMiner Analysts

Analytics Consulting by Certified RapidMiner Analysts

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

3 weeks ago
3 weeks ago

3 weeks ago
Also, I should have mentioned, "Sample(Stratified)" is designed to ensure that you have the same class distribution across your samples, not to balance your classes. It does work with mulitple classes but it doesn't do what you want.

If you want a pure sampling solution, you can actually use the normal "Sample" operator and activate the "balance data" parameter (an advanced parameter) and then specify the sample size (absolute or relative) for each class in a multi-class label. But you will only be able to downsample and you can't incorporate information from any other attribute--that's why I first mentioned the weighting alternative.

Brian T., **Lindon Ventures** - www.lindonventures.com

Analytics Consulting by Certified RapidMiner Analysts

Analytics Consulting by Certified RapidMiner Analysts

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

3 weeks ago
3 weeks ago

3 weeks ago
First, thank you of fast reply.

In order to select a right approach, I perhaps need to describe the case:

1. It is question of a survey with a rather restricted budget and resources

2. The population is 100000 persons. I know the strata which should match also for the target population (to be surveyed). Gender distribution is F=45%, M= 55%, location distribution is area 1= 20%, area 2= 65%, area 3= 15%, nationality distribution is locals 80%, others 20%.

3. The target (to be surveyed) is around 1500 persons. Response rate is expected to be 50%.

=> How this kind of sampling frame could be implemented in RM?

Highlighted
Solution

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

3 weeks ago
3 weeks ago

3 weeks ago
If you need to treat these as independent attributes and simultaneously stratify across all of these 3 variables, you are probably going to have to create a single new attribute using Generate Attributes (with if statements) that represents all the combinations: for example, male area 1 local, female area 1 local, etc. It looks like you will have 12 possible values, and you can then compute the sample proportion that each one will comprise of the total by multiplying through.

Once you have that, you will be able to use the sample attribute to pull the appropriate number (or proportion) of each of the individual classes.

Brian T., **Lindon Ventures** - www.lindonventures.com

Analytics Consulting by Certified RapidMiner Analysts

Analytics Consulting by Certified RapidMiner Analysts

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

3 weeks ago