Appendix to Gradient Boosting and XGBoost

original post

A.1 Finding the weights of a decision tree

Start with the loss function, loss, and a decision tree h. The decision tree has two parameters, T and w which I want to regularize. However, since in this instance my tree structure is fixed, lets ignore the T term. Note that here, \(y_{res}\) is the real target if this is the first tree, or the gradient of the error if this is a subsequent tree. For additional simplicity, let’s assume only L2 regularization is used (\(\alpha = 0\))

\[ L = \sum_{i=0}^{n} loss(y_{res}, h(x)) + \frac{1}{2}\lambda \sum_{j=1}^{T}w_{j}^2\]

Now, let’s Taylor expand the loss around 0 to the second order, where \(\hat{y}\) are the predictions of \(h(x)\):

\[ L = \sum_{i=0}^{n} [ loss(y_{res}, \hat{y}=0) +\frac{\partial loss}{\partial (\hat{y}=0)}h(x) + \frac{\partial^2 loss}{\partial (\hat{y}=0)^2}h(x)^2 ] + \frac{1}{2}\lambda \sum_{j=1}^{T}w_{j}^2\]

Since \(y_{res} \) is a fixed target, \(loss(y_{res},\hat{y}=0)\) is constant. Our goal is to minimize L, so this term won’t have any effect on that. Let’s remove it for simplification.

\[ L = \sum_{i=0}^{n} [ \frac{\partial loss}{\partial (\hat{y}=0)}h(x) + \frac{\partial^2 loss}{\partial (\hat{y}=0)^2}h(x)^2 ] + \frac{1}{2}\lambda \sum_{j=1}^{T}w_{j}^2\]

Since h(x) is a decision tree, it can be expressed relatively simply; each instance in a leaf will be assigned the leaf weight w. If I define all the leaves in my decision tree as the set \( \{ I_{j} \} ^{T}\), then each input instance \(x_{i} \in I_{j}\) will be assigned the weight \(w_j\).

This allows me to rewrite L in terms of \(I_j\) and \(w_j\):

\[ L = \sum_{i=0}^{n} [ \sum_{i \in I_{j}}(\frac{\partial loss}{\partial (\hat{y}=0)})w_{j} + \sum_{i=0}^{n}(\frac{\partial^2 loss}{\partial (\hat{y}=0)^2})w_{j}^2 ] + \frac{1}{2}\lambda \sum_{j=1}^{T}w_{j}^2\]

The above expression breaks the sums down to their leaf-levels. Handily, I can now set L to 0, and rewrite this in terms of \(w_j \) to get my analytical solution (for each weight)

\[ w_{j} = \frac{\sum_{i \in I_{j}} \frac{\partial loss}{\partial (\hat{y} = 0)}}{\sum_{i \in I_{j}} (\frac{\partial^2 loss}{\partial (\hat{y} = 0)^2}) + \lambda} \]

A.2 The rest of the XGBoost hyperparameters

gamma

When considering a new split, this new split must reduce your loss by this much to be used.

min_child_weight

This performs a similar function to gamma (in that it implements regularization at the splitting step). It defines the “minimum sum of the instance weight hessian to make a child”. The hessian is the second derivative, so (for our MSE example), this would be

\[ \frac{\partial ^2 (y - \hat{y})^2}{\partial \hat{y}^2} = 1\]

So effectively, this says ‘if my leaf has less than min_child_weight instances, ignore it’.

max_delta_step

In the case of very unbalanced classes, if you are using a loss with a sigmoid function, \( \frac{\partial^2 loss}{\partial (\hat{y} = 0)^2}\) can become lose to 0. This is a problem, because its in the denominator of the weights function, so you could get weight values close to infinity (which even a learning rate will do little to control). This sets a maximum absolute value which the weights can be.

subsample

This trains each subtree h on a fraction of all the training data available. A better sampling alternative (according to the XGBoost paper) is column subsampling:

colsample_by_tree

This subsamples a certain fraction of features (columns) every time a new tree h is trained

colsample_bylevel

This subsamples a certain fraction of features at every level