Options

Sample One row within a group

yrgowthamyrgowtham Member Posts: 6 Contributor II
edited September 2019 in Help

Hi Experts,
I have a table with PatientID, the day of their stay and max vital signs for the day.
I want to create a process that randomly samples one day for each patient.
Table Structure :
PatientID    Day Number    Max_Temp   Max_Resp  Max_SBP    Max_HR
ABC                 1                    98.7            32               90                 72
ABC                 2                    98.8            33               95                 75
ABC                 3                    95              35               90                 78
DEF                 1                    98.7            32               90                 72
DEF                  2                   95              35               90                 78
the output of my process should have one day for each patient picked randomly and should look like as below :

PatientID    Day Number    Max_Temp   Max_Resp  Max_SBP    Max_HR
ABC                 2                    98.8            33               95                 75
DEF                 1                    98.7            32               90                 72

 

Methods I have tried :

  1. I have tried to use sample operator and use balance data option but it requires me to mention each PatientID in
    the parameter list (sample size per class).This is impossible because there are more than 50000 patientID
  2. Using R-code(Execute R)  will solve this, but trying to find if there is a way in Rapidminer to solve it.

    I am looking for a more automated method to achieve it in Rapidminer 

    Please let me know if you need more info.
    Thanks in advance :)


Tagged:

Answers

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You can sort your datset by a random variable (which you can add if you need to using "Generate Attributes") and then simply use "Remove Deuplicates" to get rid of records based on the patient id.  This should give you one random day per patient in the resulting dataset.  

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    @Telcontar120 - pretty elegant solution! however, why would you want to sort dataset by a random variable beforehand?

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    @kypexin Sorting by a random variable should help ensure it doesn't systematically keep the same day for each patient.(I'm not 100% sure what the internal logic is for removing duplicates but it might conceivably be related to the order in which they appear, so if your dataset is sorted by the patient/day, that could lead to non- random sampling results.)

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.