"A Priori probabilities for Decision Trees"

Mario_HofmannMario_Hofmann Member Posts: 9 Contributor II
edited June 2019 in Help

currently I am looking for a method to determine a priori probabilities for the label value when using decision trees. I was browsing the trees and operators but didnt found anything apropriate.




  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Mario,

    the a-priori probabilities can basically be calculated by aggregating by label and using the count() aggregation function. Those values have to be divided by the total number of examples to get the a-priori probabilities of the label.

    The next version of RapidMiner expected to be released within the next 2-3 weeks will feature a count (fractional) aggregation function, which will deliver the a-priori probabilities directly without dividing manually by the number of examples.

    Best regards and a happy new year!

  • Mario_HofmannMario_Hofmann Member Posts: 9 Contributor II
    Hello Marius,

    thanks for your feedback. What I was looking for was a method to adjust the a priori probabilities.

    I experienced that this can help to build different trees with a higher weight for values which have a low relative frequency. E.g. if you want to build a tree which gives you an high accuracy on a group but you cant explain your group precise enough in your model. One way would be to work with MetaCost but this did not work well on my example. Another way is to ignore the fact that your group has just a low frequency (0,01) and increase it significantly (e.g. to 0,2). This leads to a different tree with a low accuracy in exchange for a better precision. I ll give the weight based trees a try, but was hoping that I just missed an operator.



    p.s. a happy new year to you too, of course. :) 
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn

    the easiest way would probably be to use the Sample operator with the parameter balance_data. This, however, allows only for undersampling the majority class, not for oversampling minority classes.

    Oversampling can only be done by the Sample (Bootstrapping) operator, which does not have the balance data option. Instead, it allows to consider example weights for sampling: you can create a new attribute with Generate Attributes, which contains the weight of each example. If you use a formula like if(labelAttr=="majority class label", 1, 10) then examples from the minority class will be selected with a probability 10 times higher than the majority class.
    To define the new attribute as weight, you have to use Set Role and assign the weight role to it.

    Best regards,
Sign In or Register to comment.