Gradient Boosted Tree and performance

Barborka · December 2020

Dear community,

I want to understand my GBT algorithm. I trained it, validated it on new data with quite a good result. Now, I would like to understand the model to find out, which attributes were the most decisive ones, but here I fail. For example, my Tree 1 is described as

ch1 in {1009351207,1047831207,... (46 more)}: 0.013 {}

ch1 not in {1009351207,1047831207,... (46 more)}

| ch1 in {1009351207,1000751092,... (49 more)}: -0.009 {}

| ch1 not in {1009351207,1000751092,... (49 more)}: -0.027 {}

Could you please, explain, where can I find these 46 more atributes? Or 49 more attributes?

Thanks a lot.

BalazsBarany · December 2020

Hi @Barborka,

if you're looking at the description of one tree and it only contains ch1, then it only considers ch1. Other trees might consider different attributes. The weights output of the entire model shows the summary - single trees are not that relevant.

I couldn't find a way to extract the whole list of values going into the rules. There are some promising operators like Tree to Rules and DecisionTree to ExampleSet (in the Converters extension) but these don't work with GBT, only single trees.

Regards,
Balázs

BalazsBarany · December 2020

Hi @Barborka,

with a complex model like GBT it's very complicated to derive the attribute importance directly from the model.
In your example ch1 is the attribute name, the 1009... (46 more) entries are different values (data in the ch1 column).

So in this example only the attribute ch1 is relevant at all.
The Gradient Boosted Trees operator has an output called "wei". These are the attribute weights calculated by the model. Higher values in this table mark the more important attributes for predicting the label.

If I saw a model like this, I would suspect that these are IDs and the model is just learning them. This would mean that the model is overfitted. I hope this is not the case with your data, but you should check.

Regards,
Balázs

Barborka · December 2020

Dear @BalazsBarany thanks for reply. In other trees, I also have ch2 and ch3, for example, I just entered the first one. Is that possible, that ch2 and ch3 are not considered in this tree?

and, is there any possibility how to find, which exact values are in these 46 more (and 49 more,etc.)? {1009351207,1047831207,... (46 more)}

Btw., these are not IDs.

Barborka · December 2020

Dear @BalazsBarany , thanks for your help. I will try something different then, maybe python.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Gradient Boosted Tree and performance

Best Answer

Answers