In the previous post, we learned about tree based learning methods - basics of tree based models and the use of bagging to reduce variance. We also looked at one of the most famous learning algorithms based on the idea of bagging- random forests. In this post, we will look into the details of yet another type of tree-based learning algorithms: boosted trees.

Boosting

Boosting, similar to Bagging, is a general class of learning algorithm where a set of weak learners are combined to get strong learners.

For classification problems, a weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

Recall that bagging involves creating multiple copies of the original training data set via bootstrapping, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. Notably, each tree is built on a bootstrap data set, independent of the other trees.

In contrast, in Boosting, the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

Unlike fitting a single large decision
tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Given the current model, a decision tree is fit to the residuals from the model. That is, a tree is fit using the current residuals, rather than the outcome as the response.

The above concept of additively fitting weak learners to get a strong learner is not limited to tree based methods. In fact, we can choose these weak learners to be any model of our choice. In practice, however, decision trees or sometimes linear models are the most common choices.
In this post, we will be focusing mainly on boosted trees - i.e. our choice of weak learners would be decision trees. Boosted trees are one of the most used models in on-line machine learning competitions like
Kaggle.

Let us consider the case of boosted trees more rigorously. In particular, we will look at the case of boosted decision trees for regression problems. To begin, set the resultant model $\hat{F}(x) = 0$ and residual of the training response as the training target $y$.
Now Sequentially, for steps $b=1,2\ldots,B$ update $\hat{F}(x)$ as follows: $\hat{F}(x)\leftarrow \hat{F}(x) + \lambda_b \hat{f}^{b}(x)$, where $\hat{f}^{b}(x)$ is a decision tree learner fit on $(X, r)$ with $d$ splits (or $d+1$ terminal nodes) and $\lambda_b$ is a weighting parameter. Also update residuals, $r$ as,
$r \leftarrow r - \lambda_b \hat{f}^{b}(x)$. At the end of $B$ steps, we will get the final model as $\hat{F}(x) = \sum_{b=1}^{B}\lambda_b\hat{f}^{b}(x)$.

The above algorithm of boosted is quite generic. One can come with several strategies for how the decision trees are grown and fit on residuals, and for the choice of the weighting parameter $\lambda$. Therefore, there are various boosting methods. Some common examples are:

In this post, we will focus specifically on two of the most common boosting algorithms, AdaBoost and Gradient Boosting.

Adaptive Boosting (AdaBoost)

Adaptive Boosting or AdaBoost refers to a particular formulation of boosting where the idea of residual learning is implemented through the concept of sample weights. In particular, at any iteration, $b$, all training samples have a weight $w_i^b$ $\forall i = 1, \ldots N$, which is equal to the current error $E(\hat{f}(x_i))$. The decision tree $\hat{f}^{b}(x)$ is fit to minimize the error $E_b = \sum_{i=1}^{N}{E\big(\hat{F}(x) + \lambda_b \hat{f}^{b}(x)\big)}$. AdaBoost uses an exponential error/loss function: $E(\hat{F}(x_i)) = e^{-y_i \hat{F}(x_i)}$. The choice of the error function, $E()$, also determines the way weights $w_i^b$ and the weighting parameter $\lambda_b$ are updated at different steps.
A neat mathematical derivation of these quantities for the case of binary classification can be found at the Wiki article.

In particular case of the classification problems, we can also think of AdaBoost to be an algorithm where at any step, the misclassified samples get a higher weight than the correctly classified samples. The weights attached to samples are used to inform the training of the weak learner, in this case, decision trees to be grown in a way that favor splitting the sets of samples with high weights.

In practice, the above definition of the AdaBoost model is modified to include a learning rate term, $\gamma$, as a regularization term that shrinks the contribution of all of the models. In particular, the model is updated as, $\hat{F}(x) \leftarrow \hat{F}(x) + \gamma \lambda_b \hat{f}^{b}(x)$. The main tuning parameters of AdaBosst are, the learning rate $\gamma$, number of estimators $B$, and decision tree related parameters like depth of trees $d$, and number of samples per split etc.

AdaBoost in Python

The python scikit-learn library implementations of AdaBoost (AdaBoostClassifier and AdaBoostRegressor) are based on a modified version of the AdaBoost algorithms, AdaBoost SAMME and SAMME.R. Stage-wise Additive Modeling using a Multi-class Exponential loss function (SAMME) algorithm provides an extension of AdaBoost for the case of multi-class classification. The SAMME.R (R for real) variation of the algorithm is for prediction of weighted probabilities rather than the class itself.

AdaBoost Classifier in Python

Recall the US income data that we used in the previous based post on tree based methods. In summary, in this dataset, we are required to predict the income range of adults (<=50K or >50K) based on following features: Race, Sex, Education, Work Class, Capital Loss, Capital Gain, Relationship, Marital Status, Age Group, Occupation and Hours of Work per Week. We have already seen that, with the best decision tree model, we were able to achieve a prediction accuracy of 85.9%. With the use of random forest models, we were able to increase the accuracy to 86.6%.

Once the number of hyper-parameters and the range of those parameter in models becomes large, the time required to find the best parameters becomes exponentially large using simple grid search.
Instead of trying out every possible combination of parameters, we can evolve only the combinations that give the best results using Genetic Algorithms. A scikit-learn friendly implementation of this can be found in the sklearn-deap library. In many of the examples below we will be using this library to search for best model parameters.

The genetic algorithm itself has three main parameters: population size, tournament size and no. of generations. Typically, for the problem of parameter search, we can use a population of 10-15 per hyper-parameter. The tournament size parameter affects the diversity of population being considered in future generation. This roughly translates into global vs local optimization for parameter search. In other words, if the tournament size is too large, we will have higher chance of getting a solution that is from local minima. A typical value of 0.1 times the population works well for the exercise of finding optimal model parameters. The number of generation decides the convergence of genetic algorithms. A larger value leads to better convergence but requires larger computation time.

We see a significant improvement in our results with an accuracy of 87.06% on the testing data!

Given our data is highly imbalanced, let us look at the confusion matrix of our model on the test data. Note that we are using the confusion_matrix() method from the previous post on tree based methods.

We find that our model has a much better predictive power (94.1%) for the dominant class (<=50K), while it has prediction rate of only 64.2% for the other class.

Gradient Boosting

The approach used in the case of AdaBoost can be also viewed as an exponential loss minimization approach. Let us look at this mathematically for the case of a generic loss function, $L\big(y,\hat{F}(x)\big)$. The goal of the method is to find an $\hat{F}(x)$ that minimizes the average value of the loss function $L\big(y,\hat{F}(x)\big)$ on the training set. This can be achieved by starting with a constant function $\hat{F_0}(x)$, and incrementally expanding it in a greedy fashion as follows:

where $f(x) \in \mathcal{H}$ refers to base learners, in this case tree models. The problem with this set up is that, it is computationally infeasible to choose optimal $f(x)$ at every step for an arbitrary loss function $L\big(y,\hat{F}(x)\big)$.

However, this can be simplified using a steepest descent approach. In this approach, at any step the decision tree model is trained on the pseudo-residuals, rather than residuals. The approach can be described in the following algorithm:

In most real implementations of gradient boosted trees, rather than an individual tree weighing parameter $\gamma_b$, different parameters are used at different splits, $\gamma_{jb}$. If you recall from the last post, a decision tree model corresponds to diving the feature space in multiple rectangular regions and hence it can be represented as,

$$\hat{f^b}(x) = \sum_{j=1}^{J} k_{jb} I\big(x\in R_{jb}\big)$$

where, $J$ is the number of terminal nodes (leaves), $I\big(x\in R_{jb}\big)$ is an indicator function which is 1 if $x\in R_{jb}$ and $k_{jb}$ is the prediction in $j^{th}$ leaf.
Now, we can replace $\hat{f^b}(x)$ in above algorithm and replace $\gamma_b$ for the whole tree by $\gamma_{jb}$ per terminal node (leaf).

where $\gamma_{jb}$ is given by the following line search,
$\gamma_{jb} = \mathop{\arg\min}\limits_{\gamma} \sum_{x_i \in R_{jb}} L\big(y_i, \hat{F}_{b-1}(x) + \gamma \big)$

Here $J$ refers to the number of terminal nodes (leaves) in any of constituent decision trees. A value of $J_b =2$, i.e. decision stumps means no interactions among feature variables are considered. With a value of $J_b=3$ the model may include effects of the interaction between up to two variables, and so on. Typically
a value of $4 \le J_b \le 8$ work well for boosting.

Regularization of Gradient Boosted Trees

Gradient Boosted Trees can be regularized by multiple approaches. Some common approaches are:

Shrinkage / Learning Rate: For each gradient step, the step magnitude is multiplied by a factor between 0 and 1 called a learning rate ($0 <\nu < 1$). In other words, each gradient step is shrunken by some factor $\nu$. The rational for this to work as a regularization parameter has never been clear to me. My personal take is the shrinkage enables us to use a different prior. Telgarsky et al. provide a mathematical proof that shrinkage makes gradient boosting to produce an approximate maximum margin classifier, i.e. a classifier which is able to maximize the associated distance from the decision boundary for each example.

Sub-Sampling: Motivated by the bagging method, at each iteration of the algorithm, a decision tree is fit on a subsample of the training set drawn at random without replacement. Also, like in bagging, sub-sampling allows one to define an out-of-bag error of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.

Minimum sample size for splitting trees, and Minimum sample size for tree leaves: It is used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances. Imposing this limit helps to reduce variance in predictions at leaves.

The number of trees or Boosting Iterations, $B$ Increasing $B$ reduces the error on training set, but setting it too high may lead to over-fitting. An optimal value of $B$ is often selected by monitoring prediction error on a separate validation data set.

Sampling Features: We can apply the idea of randomly choosing small subsets of features for different trees, as in the case of Random Forest models.

scikit-learn Implementation

scikit-learn provides a simple and generic implementation of the above described algorithm that is valid of different types of loss functions. Below is a simple implementation for the case of income data.

Turns out we have got a quite good model just by chance! This has an accuracy of 87.11% on our test data!

We could try finding optimal parameters for this as before. However, in my experience this is a very academic implementation of boosted trees. Other implementation like XGBoost, LightGBM and CatBoost are more optimized implementations and hence we will focus our tuning for some of these libraries only.

XGBoost

The biggest drawback of the gradient boosting trees is that the algorithm is quite sequential in nature and hence are very slow and they can not take advantage of advanced computing features for parallelization like multiple threads and cores, GPU etc.

XGBoost is a python (C++ and R as well) library that provides an optimized implementation of gradient boosted trees. It uses various tricks like regularization of number of leaves and leaf weights, sparsity aware split finding, column block for parallel learning and cache-aware access. Using these tricks the implementation provides a parallelized efficient version of the algorithm. The details of these can be found here. XGBoost is one of the most famous machine learning libraries used in on-line machine learning competitions like Kaggle.

Additional Weak Learners: Apart from decision trees, XGBoost also supports linear models and DART (decision trees with dropout) as weak learners. In the DART algorithm, only a subset of available trees are considered in calculating the pseudo-residuals on which the new trees are fit.

XGBoost has many parameters that control the fitting of the model. Below are some of the relevant parameters and tuning them would be helpful in the most common cases. Please note that original XGBoost library parameters might have a different name than before, since I am using the scikit-learn API parameter names below.

max_depth (default=3) : Maximum depth of a tree, increase this value will make the model more complex / likely to be over-fitting. A value of 0 indicates no limit. A Typical Value in the range of 2-10 can be used for model optimization.

n_estimators (default=100) : The number of boosting steps to perform. This can also be seen as number of trees to be used in the boosted model. This number is inversely proportional to the learningrate (eta) parameter, i.e. if we use a smaller value of learning_rate, n_estimators has be made larger. _A Typical Value in the range of $\ge 100$ can be used for model optimization. However, it is best to used XGBoost built-in cv() method to find this parameter (See the example ahead).

min_child_weight (default=1) : The minimum sum of instance weight needed in a child node. If the tree partition step results in a leaf node with the sum of instance weight less than the minchild_weight, then any further partitioning will be stopped. The larger the value of this parameter, the more conservative the algorithm will be. _A Typical Value in the range of 1-10 can be used for model optimization.

learning_rate (default=0.1) : The step size shrinkage used to prevent over-fitting. After each boosting step, we can directly get the weights of new features. and the learningrate parameter (also referred as eta in regular XGBoost API) shrinks the feature weights to make the boosting process more conservative (See above formulation of Gradient Boosted Trees for mathematical details). _A Typical Value in the range of 0.001-0.3 can be used for model optimization.

gamma (default=0) : (Also referred as min_split_loss in regular XGBoost API) The minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. The value of this parameter depends on the type of loss function being used. A Typical Value in the range of 0.0-0.7 can be used for model optimization.

reg_alpha (default=0) : (Also referred as alpha in regular XGBoost API) L1 regularization term on weights, increasing this value will make model more conservative (see XGBoost paper for mathematical details). A Typical Value in the range of 0-1 can be used for model optimization.

reg_lambda (default=1) : (Also referred as lambda in regular XGBoost API) L2 regularization term on weights, increasing this value will make model more conservative (see XGBoost paper for mathematical details). A Typical Value in the range of 0-1 can be used for model optimization.

subsample (default=1) : Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees. This parameter is used to prevent over-fitting. A Typical Value in the range of 0.5-1.0 can be used for model optimization.

colsample_bytree (default=1) : Subsample ratio of columns when constructing each tree. Similar to Random Forest models, models tend to be more generalizable, if a number between 0 and 1 is used. A Typical Value in the range of 0.5-1.0 can be used for model optimization.

Apart from above set of parameters, there are several parameters that should also be considered while tuning. Some examples of such parameters are: objective (objective function/ loss function to use, depends on problem, for eg. binary vs. multi-class classification), tree_method (The tree construction algorithm used in XGBoost(see description in the paper), random_state (seed for random number generator), n_jobs (number of threads to use to train the model) and many other parameters related to different types of tree methods used. This article can be used as a good general reference for tuning XGBoost models.

Let us tune XGBoost model for our problem of income prediction. A simple sklearn API implementation can be used as below.

With this reasonable guess of parameters based on previous models, we already see an accuracy of 86.75%.

Let us try to find the optimal parameters for the XGBoost model. If we simply try to do a brute force grid search, it can be computationally very expensive and unreasonable on a desktop. Here is a sample parameters list that can give us an idea of what such a grid search could look like.

If we try to do a grid search of this with 5-fold cross validation, it will involve a whopping 0.81 million model training calls! And, even this will not be enough, as we will need additional model training steps to fine-tune our parameter search for finer and/or different range of parameters. Clearly, we need a different approach to solve this.

I will take an approach of optimizing different set of parameters in batches. To begin, we will choose a fixed learning rate of 0.1, and n_estimators=200. We will try to find only tree related parameters (i.e. max_depth, gamma, subsample and colsample_bytree) using grid search or genetic algorithm.

We get test accuracy of 86.99%. Now, this seems odd. After we did all this computation and we still have got a test accuracy that is smaller than an optimized version of scikit-learn's Gradient Boosting Tree implementation. However, if you have paid attention to the metric, you can notice - for cross validation, I started using 'auc' as metric, instead of 'accuracy'. This will give us better accuracy for the less abundant class (> 50K salary) but at the cost of slight decrease in the accuracy of the more abundant class (<= 50K salary). We can verify this by the following confusion matrix plot.

LightGBM

Similar to XGBoost, LightGBM is another optimized implementation of Gradient Boosting developed by Microsoft (similar to XGBoost available in Python, C++ and R). The main difference between LightGBM and other Gradient boosted trees (like XGBoost) implementations is in the way trees are grown. The details of this can be found on their features page. Briefly, LightGBM splits the tree leaf-wise with the best fit whereas other boosting algorithms split the tree depth-wise or level-wise. The two approaches can be best visualized in the following illustrations:

Level-wise Splits:

Leaf-wise Splits:

Leaf-wise splits lead to increase in complexity and may lead to over-fitting, and hence extra caution needs to be taken in tuning. Some of the biggest advantages of LightGBM over XGBoost is in terms of extremely fast training speed, lower memory usage, compatibility with large datasets and highly parallel computational support using threads, MPI and GPUs. LightGBM also has inbuilt support for categorical variables, unlike XGBoost, where one has to pre-process the data to convert all of the categorical features to integer ones using one-hot encoding or label encoding.

Since the trees are grown differently in LightGBM, its tuning procedure is quite different than XGBoost. Note that the latest version of XGBoost also provides a tree building strategy (depth-wise) which is quite similar to LightGBM.

All the parameters described above for XGBoost are also valid for LightGBM library. However, some parameters can have different names.
Given the strategy of growing trees is different in LightGBM, an additional parameter needs to be tuned as well.

num_leaves: Maximum tree leaves for base learners. Note that since trees are grown depth-first, this parameters is independent
of the max_depth parameter and has to be tuned independently. A typical value for starting should be much less than 2max
_depth.

When using the scikit-learn API of LightGBM, one should keep in mind that some of the parameter names are not standard ones (even though described in the API reference). In particular, I found that seed and nthreads as parameters, instead of random_state and n_jpbs, respectively.

Given this library also has many parameters, similar to XGBoost, we need to use a similar strategy of tuning in stages.

First we will fix learning rate to a reasonable value of 0.1 and number of estimators = 200, and tune only the major tree building parameters: max_depth, subsample, colsample_bytree and num_leaves. We will use the genetic algorithm to search for optimal values of these parameters.

Similar to XGBoost, LightGBM also provides a cv() method that can be used to find optimal value of number of estimators using early stopping strategy. Another strategy would be to search for this parameter as well. In the following, I want to use this grid search strategy to find best value of number of estimators, just to show how tedious this can be!

Concluding Remarks

XGBoost has been one of the most famous libraries used to win several machine learning competitions on Kaggle and similar sites. Slowly, LightGBM is also gaining traction due to its speed and parallelization capabilities. CatBoost is another boosting library from Yandex that has been shown to be quite efficient. I have used it personally yet though. If I find it to be worth making a move to, I will write about it in a future post.