Help with Algorthim Selection

macctenmaccten Member Posts: 28 Contributor II

This is more of a theoretical question more than anything else but I was wondering if I could get any ideas from you guys in relation to a dataset I was thinking of data mining

Consider a dataset which contains IDs. These IDs identify a person. Over the course of X amount of days (Timestamp), these attributes that change so for example if this was a patient in a hospital.
They saw a 3 different doctors 20 times over a period of say 60 days where symptoms were recorded. Other information such as weight height, diet, exercise regime are also recorded

Sometimes these symptoms persisted so for example, the patient complained about a headache on visit 2 and 3 with Doctor A but the headache had disappeared on visit 4 with Doctor B.
The visits can be extrapolated from the timestamp by rolling it up to day. Equally diet or exercise levels could change due to fatigue or loss of appetite

The label or target being predicted would be if a patient - based on these changing symptoms - will get a certain disease for example heart disease.
The key factor I want to include here is that the symptoms occur over time and are subject to change

Could someone recommend a data mining method to predict basically a 1 or 0 label factoring in the time element as well as the other variables for the patient.

I should probably mention, I don’t actually have this data set but the theory behind it could be applied to numerous other problems that have an object, that has attributes that change state over time.
It looks like it might be a good fit for a Hidden Markov Model but it never hurts to get other ideas on how you would approach the problem for the education alone.
Also if anyone has seen a good implementation of something similar using rapid miner or even R within Rapidminer, that would be great

Thanks for your time


  • Options
    AsokaAsoka Member Posts: 14 Contributor II
    My first thought is Logistic Regression as it is designed to work with a dependent variable / label that is binary in nature.

    The challenge I see is that you're really looking for a series of categorical answers (many possible label values can result, one for each possible disease, plus at least one more for "no disease expected").  It might also be that a Logistic model that is focused on a single disease with an answer of 1 or 0 is the way to approach this, with the realization that you'll end up with at least 1 model for every disease you want to check for.

    To build this model, using your example of heart disease, you will need a training data set that contains examples or observations of patients with the various attributes you have available and that might be interesting for the analysis.  That training data set will need to include examples of people that don't get heart disease, as well as people that do get heart disease, so that the modeling approach you use has a chance of finding the pattern that it can use to discriminate between the two outcomes.

    Time is going to be interesting, as it most always is.  "Get heart disease" is probably too broad of a value / field definition to model, as many people will develop heart disease over a 50 year period, though whether you know enough 50 years before to predict it is questionable (and more importantly, whether you can take meaningful action on that information).  One formulation of a problem along these lines that I have seen (it was a class problem rather than a real business problem), was a problem where a health insurance company wanted to predict who would be likely to experience an unplanned emergency room visit in the coming year.

    One approach might be to take what you know in the aggregate (one row, one patient, lots of attributes) for each patient as of 3 years ago, and then use what actually happened to those patients 2 years ago as the label.  Then further add to the training data with what you know about the individual as of 2 years ago, with a dependent variable of what happened to them 1 year ago.  With that as training data to train and validate the model, you could then apply that model to what you know about people from the previous year, and use the model to predict what will happen to those people in the coming year.  The implication is that with the information in hand, you can then identify specific risk factors and intervene.

    So that's one kind of an approach to work time into the model, but without going into time series analysis.

    I haven't studied, and don't have any ideas, for how to predict a categorical label, from a time series of data.  At least one problem you will need to deal with at this grain (level of detail of the data), is that the label is going to change over time.
Sign In or Register to comment.