A Practical Guide to Gradient Boosted Trees - Part I, Regression

MartinLiebig · February 2017

Gradient Boosted Trees are one of the most powerful algorithms for classification and regression. They have all advantages of trees including the strength of handling nominal data well. The downside of any powerful, multivariate method is its complexity. This is the first of a series of articles drilling down into the algorithms and explain how they work.

The Setting

In this tutorial we will investigate how a Gradient Boosted Tree is doing a regression. As training data we use a sqrt-function between 0 and 1.

To make our algorithm easier more graspable we use a few simplifications. We do a regression for a label variable (y) with only one dependet variable (x).

We do use a GBT with a depth of 1 and learning rate of 1. We will discuss these options in later articles in more depth.

Initialization

Before starting with the real algorithm we will set a base line prediction. Our whole algorithm will consist of a lot of steps. The resulting function f(x) which will approximate the sqrt-function is a sum of the results of the individual steps. The first step is slightly different. We do set the base line prediction f0 to the average label. In our data this is 0.667.

Now we can move to the first iteration.

Step 1 - Calculate Errors

As a first step we calculate the errors. In our case these are the residuals between or our previous prediction f(x) and the truth y. In our fist iteration we do have f(x) = f0 = 0.667.

We define r = y - f(x) as the error for each individual example. It is crucial to note that the tree in each iteration is not predicting the label, but r! Since this is only a change in scale our sqrt-function still looks like this:

Step 2 - Tree Building

We now built a regression tree predicting r. Since we limit our tree to a depth of one we get:

This tree results in two leafs. We call this leafs R11 and R12 representing the first and the second leaf in the first iteration. For each of the two leafs we need to find the best prediction.

Step 3 - Prediction Definition

In our regression problem the calculation of what we predict in the leafs is fairly straight forward. We want to find the prediction, which minimizes the difference between our function and the truth. In our case this is the average of r in each leaf. We define the prediction in the i-th iteration and the j-th leaf as gij and get for our data:

g11 = -0.254

g12 = +0.157

Step 4 - Update of our Function

We can now update our approximated function to estimate y, called f(x). Let's define the result of our first iteration f1:

f1 = -0.254 if x < 0.382, else: +0.152

And f(x) = f0 + f1. This results in a total function which approximates our initial unscaled sqrt like this:

Step 1 - Part 2, Build residuals again

We can now do Step 1 again. But this time we build the residuals between our new function approximation f(x) = f1 + f2 and y. So we get a picture like this:

In this iteration of the algorithm we try to fit this functions of residuals to get a better overall function approximation.

Step 2 and 3- Part 2, Build the tree and calculate a prediction.

Again we built a tree which splits up our data into to parts. If you look at the picture above, we do search for the point on the x-axis, where we can draw two lines parallel to the a-axis, which have a minimum distance to the residuals.

The tree is telling us, that the splitting point is performed at 0.107 and the two parallel lines g21 and g22 are -0.193 and +0.023.

Step 4 - Updating the function

The last step is to update our function. It is now:

f(x) = f0 + f1+f2

or in full words:

f1 = 0.667

-0.254+ if x < 0.382

+0.152 if x >= 0.382

- 0.193 if x < 0.107

+ 0.023 if x >= 0.107

Or in a drawn form it looks like this:

Please be aware that the second tree is also raising the right hand side of the function by a little bit.

With this knowledge we can now move on and start with Step 1 again. That's it.

Naming

y is the value of the label. E.g. the true value you want to predict
x is the dependent variable. E.g. one (or many) regular attributes
f(x) is the derived formula to predict y. E.g. the model
Rij is the j-th leaf in the i-th iteration of the algorithm.
gij the constant added to our function for all examples who are in the i-th iteration and the j-th leaf. Note: Often called gamma in literature.

phivu · March 2017

Thank you for a very helpful step-by-step explaination of the GBT algorithm!
Is there any typos in the last step (Step 4 - Updating the function)? Why do the conditions overlap (x<0.382 and x<0.107)? Could you explain more on how to determine a position to do a split (as at x= 0.382 and x= 0.107), or how f1 and f2 are built?

Muhammed_Fatih_ · March 2021

Hi @mschmitz,

thank you for the interesting guide how Gradient Boosted Trees work. Very interesting!

You wrote the following:

"It is crucial to note that the tree in each iteration is not predicting the label, but r!"

Does this mean that the prediction of the respective r is considered in each iteration to calculate the overall performance (avg(r's in each iteration)) which is finally displayed by the Performance Vector?

Or does the Performance Vector represent the r prediciton in the final improved state of the tree?

Thank you in advance for your feedback!

Best regards,

Fatih

MartinLiebig · March 2021

Hi @Muhammed_Fatih_ ,

I think neither. When you evaluate the trees, you always evaluate all trees and then sum over the individual results. Or mathematically spoken:

f(x) = \sum_i ( f_i(x))

where i iterates over the different trees.

The performance vector is then always evaluated on the sum, not on the individual trees.

If you want to go deeper on it, see Tibsharani, Hastie et al: https://web.stanford.edu/~hastie/Papers/ESLII.pdf . This guide is basically my version of the chapter from the book.

Best,

Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

A Practical Guide to Gradient Boosted Trees - Part I, Regression

Comments