Newbie question - cross validation using decision tree

mystic86mystic86 Member Posts: 3 Contributor I
edited November 2018 in Help



Really sorry about such a stupid question, I am very new!


I want to do cross validation with decision tree, here is my dataset:


Screen Shot 2016-07-30 at 8.38.09 p.m..png



Here is the setup I have:




  • Options
    mystic86mystic86 Member Posts: 3 Contributor I

    Sorry, the rest of my first post was cut off :(


    Here is the setup I have:


    Screen Shot 2016-07-30 at 8.39.08 p.m..png


    And here is the setup inside the X-validation:


    Screen Shot 2016-07-30 at 8.39.53 p.m..png

  • Options
    mystic86mystic86 Member Posts: 3 Contributor I

    When I run this, it seems to be doing nothing at all except summarising my data pretty much - here are some screenshots of the results:


    Screen Shot 2016-07-30 at 8.41.39 p.m..pngScreen Shot 2016-07-30 at 8.41.17 p.m..png


    Can anyone help me figure out what is going on here - is it something to do with how my attributes are setup in terms of their roles etc? ....


    Screen Shot 2016-07-30 at 8.44.20 p.m..pngScreen Shot 2016-07-30 at 8.44.30 p.m..png


    Thanks!! :)

  • Options
    bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist

    Hi Philip,


    Basically it looks like the model is predicting that everything is YES, so basically a very incorrect model.

    Is it possible for you to share the data ? 


    Are you applying pruning on the decision tree ? Try without or changing the confidence values.

    You may also try some other models, since decision tree does not seem to be getting close.


    Keep in mind some learners can only predict binary values, some polynominal values, some numbers and there are also limitation on kind of data type that can be input variables.

    You can use RM operators to tweak data to meet those, but plan accordingly.




  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Some other questions to ask, does the data need to be balanced too? Is there any feature generation that can be done?

  • Options
    yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

    @Thomas_Ott  raises a good point, unbalanced data is  an interesting and very frequenct problem in classfications.

    As you found empirically, a training set consisting of different numbers of representatives from either class may result in a classifier that is biased towards the majority class. When applied to a test set that is imbalanced, the classifer (not only decision trees) yields an optimistic accuracy estimate. In an extreme case (just like your example), the classifier might assign every single test case to the majority class, thereby achieving an accuracy equal to the proportion of test classes belonging to the majority classes. Some strategies for learning from unbalanced data:

    1. Under sampling, 

    by removing samples from the majority class using an udersampling algorithm, for instance using absolute sized Sample to balance data with specified sample size per calss in RapidMiner



    2. Oversampling, 

    by generating new samples from the minority class using an oversampling algorithm, for instance Bootstraping Sample in rapidminer

    3. Cost-sensitive learning,

    by chaning the decision tree build algorithm so that the misclassifications of minority class samples have a higher cost than misclassifications of majority calss samples. The MetaCost in rapidminer is a good choice. Plz refer to the built-in tutorial process for  

    Using the MetaCost operator for generating a better Decision Tree

    4. Ensemble learning,

    by trying to use several decision trees instead of using a single decision tree. Check out Bagging algorithm in rapidminer for booststrap aggregating decision tree models. In our latest release Rapidminer 7.2, Gradient Boosted TreesIngoRM  :smileywink: and say hello to our favourite learners.


    hello new alg.png

    5. Combination,

    by combining undersampling, oversampling, and ensemble learning strategies. Most state of art learning methids for imbalanced data use a combination of defferent strategies. Choose the one that is best for you. I would recommend to consider at leaset two of the mentioned approaches in conjuctions. 

    We would be happy to post some additional references to the literature if you would like to follow up on this.






Sign In or Register to comment.