Setup and picking the correct learner

siebertljsiebertlj Member Posts: 4 Contributor I
edited November 2018 in Help
Hi Folks,

I'm a beginner with datamining in general and rapidminer in specific, so I hope you will forgive me if I am less then clear about what help I need exactly.  The truth is I'm not sure.  I'll be paying attention to this thread so I can answer any questions.

I've got a data set of survey responses on breastfeeding from women at four different hospitals, with 30 or so variables which are either nominal or ordinal, and an ordinal outcome variable for most of the participants at 6 months with three levels (no breastfeeding, any breastfeeding, and exclusive breastfeeding (no formula)).  I say most because I have some survey data for participants who couldn't be found 6 months later, or who stopped breastfeeding earlier and were dropped from the study. I've set that as the label attribute.

I'm a SAS user, so the first thing I did was do a logistic regression, removing non significant variables one by one.  That showed four significant variables.  I would have liked to do a survival analysis, but unfortunately the date data was badly coded.

Still with so many variables, many that are somewhat correlated (like language and country of origin)  my supervisor suggested that a signal detection methodology that automatically established cutpoints for variables would be helpful for understanding the data, based on some papers she had read.  I eventually realized that that was a form of datamining, and that lead me to rapidminer, which appears to be a great program. Eventually I figured out how to get sas data into it correctly, but now i'm somewhat stuck.

I've encountered some difficulty in using Rapidminer to understand the data I have. I've tried turning the ordinal into a nominal variable, both with three levels and with two, and using decision tree, but it doesn't produce a tree, just one single bar listing one of the values of the label variable.

I've removed all variables but the ones logistic regression in SAS indicated was significant, and still didn't get a tree.  In any case I'm not even sure that decisiontree is what I want to use for a learner, save that it seemed closest to what the papers my supervisor suggested I look at used.

In any case, any guidance regarding how to proceed with this analysis, assumptions I need to check etc. would be much appreciated. 



Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    the decision tree performs some sort of pruning to avoid overfitting. It might be, that the default settings are to rigorous. If you are in expert mode, there are several parameters shown, controlling the pruning. I suggest switchting them all off at first. And set the minimal_gain parameter to 0.
    If this does not work, please report back.


    By the way: How many participants did your survey have?

    Greetings,
      Sebastian
  • siebertljsiebertlj Member Posts: 4 Contributor I
    Hi Sebastian,

    Thanks for your help.  I really appreciate your guidance.

    Both setting the minimal gain to zero and getting rid of prepruning generates a huge tree seemingly with every variable. Presumably there is some middle ground between having every variable and only having the one.

    I think the problem may be  in my label variable.  There are a great many more breastfeeding mothers then non breastfeeders,.  (320  vs. 125).  So if I preprune or don't set minimal gain to zero, it just selects breastfeeders as the important variable and decides it is done.  Even if I split the breastfeeding mothers into those who exclusively breastfeed and those who don't, I still have one much larger group, and that may lead the tree to prune to only that variable.  Is there something I can do to have DecisionTree not consider the label variable in the tree prepruning?  I don't expect you to  provide the XML for me (though that's always nice) but if there's an operator or two I should look at that may work, that would be good to know.  I'm happy to try and experiment.

    I have 976 participants, of which only 445 of which have values for the label,  I may be able to add to the 445 though, by cross referencing when people were dropped from the survey, and seeing who was dropped earlier, (indicative of not breastfeeding)l.  Can Decision tree benefit from participants who don't have a value for the label variable?    I've been assuming no, and removed them from the dataset.

    Thanks,
    Lawrence











  • siebertljsiebertlj Member Posts: 4 Contributor I
    Sebastian,

    Okay, I think I've actually fixed the biggest part of this on my own.  I increased the number of participants with a label value as I described, by using when people stopped being followed up for surveys, and that changed the result with the default options, since I now had more of the second value of the label then the first.

    Since I had confirmed my thought on part of the problem, that it was the label value not being split evenly that seemed to be giving me the problem, I turned off prepruning, and looking over the other options, figured changing the minimal size for split and minimal leaf size might work.  Sure enough it did, and now i have a tree that makes sense. 

    I still want to know if I can prevent the program only showing the label variable when I turn on prepruning and a few other things, but I'll start a new thread in the problems forum.

    Thank you so much for your help Sebastian. 



  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    it's always a pleasure answering such detailed questions. Especially if the problem is scientificly more interessting than the usual stuff :)

    Greetings,
      Sebastian
Sign In or Register to comment.