Options

Gradient Boosted Tree and performance

BarborkaBarborka Member Posts: 8 Learner I
Dear community,

I want to understand my GBT algorithm. I trained it, validated it on new data with quite a good result. Now, I would like to understand the model to find out, which attributes were the most decisive ones, but here I fail. For example, my Tree 1 is described as

ch1 in {1009351207,1047831207,... (46 more)}: 0.013 {}

ch1 not in {1009351207,1047831207,... (46 more)}

|   ch1 in {1009351207,1000751092,... (49 more)}: -0.009 {}

|   ch1 not in {1009351207,1000751092,... (49 more)}: -0.027 {}


Could you please, explain, where can I find these 46 more atributes? Or 49 more attributes?


Thanks a lot.


Best Answer

  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted
    Hi @Barborka,

    if you're looking at the description of one tree and it only contains ch1, then it only considers ch1. Other trees might consider different attributes. The weights output of the entire model shows the summary - single trees are not that relevant.

    I couldn't find a way to extract the whole list of values going into the rules. There are some promising operators like Tree to Rules and DecisionTree to ExampleSet (in the Converters extension) but these don't work with GBT, only single trees.

    Regards,
    Balázs

Answers

  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi @Barborka,

    with a complex model like GBT it's very complicated to derive the attribute importance directly from the model.
    In your example ch1 is the attribute name, the 1009... (46 more) entries are different values (data in the ch1 column). 

    So in this example only the attribute ch1 is relevant at all. 
    The Gradient Boosted Trees operator has an output called "wei". These are the attribute weights calculated by the model. Higher values in this table mark the more important attributes for predicting the label.

    If I saw a model like this, I would suspect that these are IDs and the model is just learning them. This would mean that the model is overfitted. I hope this is not the case with your data, but you should check.

    Regards,
    Balázs
  • Options
    BarborkaBarborka Member Posts: 8 Learner I
    Dear @BalazsBarany thanks for reply. In other trees, I also have ch2 and ch3, for example, I just entered the first one. Is that possible, that ch2 and ch3 are not considered in this tree?

    and, is there any possibility how to find, which exact values are in these 46 more (and 49 more,etc.)? {1009351207,1047831207,... (46 more)}

    Btw., these are not IDs.


  • Options
    BarborkaBarborka Member Posts: 8 Learner I
    Dear @BalazsBarany , thanks for your help. I will try something different then, maybe python.
Sign In or Register to comment.