Creating equally sized clusters that are representative for the population

Kristjan_MarKristjan_Mar Member Posts: 2 Learner I
edited February 2021 in Help
Hi all,

I have a set of data (population) with individuals that have signed up to be a part of a group. When they signed up they gave some background information, leaving me with 5 variables that I am mostly focusing on. 

What I want to do is create 4 equally sized groups that are as representative for the whole population as possible. That is, I want to create 4 homogenous groups. 

Also, I have some other columns in the dataset that are important in handling/using the dataset. I would like this information to be included in each of the groups (subsamples) so that they still match the respondent that they should belong to. 

In short: How can I create four homogenous subsamples that are representative of the population, using only selected variables from the dataset?

Cheers, K


Best Answers

  • Options
    MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    Solution Accepted
    Hi @Kristjan_Mar it seems you need to create 4 stratified samples of your data.
    For that you need to use the Split Data operator with  sampling type stratified.

    Hope that helps you.
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted
    I think I am confused about your wording of your intended outcome here---"as representative of the whole population as possible" and "homogeneous" are typically not synonymous.  If you want the groups to be as representative of the whole as possible, you basically want random subsets, which you can accomplish easily by Split Data and choosing sampling type of shuffled. You would only need to select the sampling type of stratify if you first choose a nominal attribute as your label to stratify on, and you want to make sure that each resulting partition contains the same proportions of these label classes.  I suggest you have a look at the tutorial and help explanation of the Split Data operator. (You can use Select Attributes prior to the split to only bring in the 5 attributes that you are interested in if you only want to look at those).

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts


Sign In or Register to comment.