Count examples in overlapping time frame

DocMusherDocMusher Member Posts: 305   Unicorn
edited December 2019 in Help
Hi RM friends,
My data set consists of ID, a time of admission, a time of discharge. I would like  to calculate the number of examples (patients in casu) that are simultaneously admitted. As such a new attribute should provide me the number of patients admitted for each patient (example) during his/her admission time frame. 
Thanks a lot!!!
Sven
Tagged:

Best Answers

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,316  RM Data Scientist
    do i understand it correctly, that you have something like:
    PatientId     DateOfEntry    DateOfExit
    1                  7AM                9AM
    2                  7AM                10AM
    3                  9AM                10AM

    and want to know at each hour how many patients are in?

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • DocMusherDocMusher Member Posts: 305   Unicorn
    Hi Martin, 
    In fact I want to know for each patient how many patients are in at time of admission and at discharge. This might give me insight if length of stay is related with number of patients admitted.
    Sven 
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 416   Unicorn
    Hi,

    sounds like you are looking for a "generic join" functionality where you can specify the join criterion. In this case, the criterion would be:
    a.id <> b.id and overlaps(a.admission, a.discharge, b.admission, b.discharge)
    Assuming that both a and b refer to copies of your patient example set (a self join). 

    Check out this contribution: https://community.rapidminer.com/discussion/33908/generic-join-script
    You can write the overlaps() function yourself and specify the above join criterion to self-join your example set, and then aggregate by the a.id to count the number of the joined patients. You'll need to rename the attributes in the second copy before joining.

    Here are some example "overlaps" implementations (depends on your requirement, e.g. discharge time could be missing, meaning that the patient is still there):
    https://stackoverflow.com/questions/17106670/how-to-check-a-timeperiod-is-overlapping-another-time-period-in-java

    Regards,
    Balázs
    mschmitzDocMusherTghadiallysgenzer
  • DocMusherDocMusher Member Posts: 305   Unicorn
    Give me a few hours before I can check but already thanks to spend your time 4 this!!!
    Sven
  • DocMusherDocMusher Member Posts: 305   Unicorn

    lionelderkrikor,

    Your python script worked with the example you provided. I considered this as a solution. However with the dataset I am using "sehid" is the (group), "aankomstdt" admissiondatetime and "ontslagdt" dischargedatetime.

    With the python script I adapted the variables but the final count is zero for all examples. What am I doing wrong here?

    Hopefully you could help me?

    Cheers

    Sven

    sgenzer
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 952   Unicorn
    Hi Sven,

    the final count is zero for all examples
    **** !!! my Python script is .... optimistic !!!!! :)


    More seriously, yes, of course, I can help you : In reality "sehid" is your "id" not the new "group" variable I introduced.
    The "group" variable allows to build group of patients in order to study the overlaps within each group.
    By assimiling the "id" to "group" there is in deed zero for all examples....
    In your case, you have apriori only one group.
    The bad new is that after execution, the process I shared is raising an error if there is only one group.
    The good new is that there is a (far-fetched) workaround.
    Grant me one hour, the time I build a new version of the process and check that it works with the data you provided.

    Sorry for the inconvenience..

    See you soon ! 

    Regards,

    Lionel 

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 952   Unicorn
    Sven,

    As said in my previous post, I found a workaround.
    It consists to extract the first example from your dataset,  assigning it group "B" 
    and append it at the end of your dataset (in order the global dataset has 2 groups)

    Given you have a huge dataset and time computation is very long. I used a Filter Example Range operator
    to execute the process only on a fraction of your data to check the execution. A priori the process works fine...and gives more relevant results ! ;)
    To execute the process on your whole dataset, please remove/disable the last Filter Example Range operator (between
    the Append operator and the Execute Python operator.

    The working process is in attached file.

    Keep me informed ! 

    Regards,

    Lionel
    TghadiallyDocMusher
  • DocMusherDocMusher Member Posts: 305   Unicorn
    Thanks++++++
    lionelderkrikor
  • DocMusherDocMusher Member Posts: 305   Unicorn

    Lionel,

    I gave the process a try on the full example range, its running now for 1 day and 20 hours with only 1.2 GB consumption stable over the entire period. What do you think, just let it run untill the finish (in that case, how long would it require by your estimation?) OR there a way to run the process in steps?

    Cheers

    Sven 

  • DocMusherDocMusher Member Posts: 305   Unicorn

    Hi Lionel and Balazs

    Thanks for the reply, I also thought that overlap is only computable in one batch because patiënt admission is a continuous. Each split of the dataset can bias missing cases that overlap between the subsets. Interesting to "feel" the impact of dimensionality on calculation time. I try to reconstruct how a human brain tries to look for overlap, I wonder if looping with ascending or descending times could not reduce possible combinations. Although theoretically all combinations are possible in overlap, this is only the case if admission and discharge differences are between zero and indefinite which is not realistic. Maybe the number of combinations can be reduced starting from median, average length of stay which could already cover x % of the cases calculated in a fraction of the time? 

    Thanks anyway!!!!

    Sven

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 416   Unicorn
    Hi Sven,

    you can create "fuzzy" batches, like taking 10 % of the data, calculating minimum and maximum entry and exit times, and filtering the candidate dataset accordingly. Then you would remove duplicates that inevitably appear in the result.

    Regards,
    Balázs
  • DocMusherDocMusher Member Posts: 305   Unicorn

    Hi,

    What is your opinion on interlaps (https://brentp.github.io/interlap/)?

    Regards

    Sven

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 416   Unicorn
    Hi,

    it sounds good (without having enough Python knowledge to check it).

    Actually, my first idea was to solve your problem by exporting the data to a PostgreSQL database and joining there. Postgres has an interval datatype and matching operators, with index support. I just didn't post that idea because I assumed you would be easily able to solve it in Studio. Maybe now's the time to try it.

    Regards,
    Balázs
Sign In or Register to comment.