How to Configure Gradient Boosting Machines

The v-M trade-off is clearly evident; smaller values of v give rise to larger optimal M-values. They also provide higher accuracy, with a diminishing return for v < 0.125. The misclassification error rate is very flat for M > 200, so that optimal M-values for it are unstable. … the qualitative nature of these results is fairly universal.

He suggests to first set a large value for the number of trees, then tune the shrinkage parameter to achieve the best results. Studies in the paper preferred a shrinkage value of 0.1, a number of trees in the range 100 to 500 and the number of terminal nodes in a tree between 2 and 8.

The “shrinkage” parameter 0 < v < 1 controls the learning rate of the procedure. Empirically …, it was found that small values (v <= 0.1) lead to much better generalization error.

In the paper, Friedman introduces and empirically investigates stochastic gradient boosting (row-based sub-sampling). He finds that almost all subsampling percentages are better than so-called deterministic boosting and perhaps 30%-to-50% is a good value to choose on some problems and 50%-to-80% on others.

… the best value of the sampling fraction … is approximately 40% (f=0.4) … However, sampling only 30% or even 20% of the data at each iteration gives considerable improvement over no sampling at all, with a corresponding computational speed-up by factors of 3 and 5 respectively.

He also studied the effect of the number of terminal nodes in trees finding values like 3 and 6 better than larger values like 11, 21 and 41.

In both cases the optimal tree size as averaged over 100 targets is L = 6. Increasing the capacity of the base learner by using larger trees degrades performance through “over-fitting”.

In his talk titled “Gradient Boosting Machine Learning” at H2O, Trevor Hastie made the comment that in general gradient boosting performs better than random forest, which in turn performs better than individual decision trees.

They comment that a good value the number of nodes in the tree (J) is about 6, with generally good values in the range of 4-to-8.

Although in many applications J = 2 will be insufficient, it is unlikely that J > 10 will be required. Experience so far indicates that 4 <= J <= 8 works well in the context of boosting, with results being fairly insensitive to particular choices in this range.

They suggest monitoring the performance on a validation dataset in order to calibrate the number of trees and to use an early stopping procedure once performance on the validation dataset begins to degrade.

As in Friedman’s first gradient boosting paper, they comment on the trade-off between the number of trees (M) and the learning rate (v) and recommend a small value for the learning rate < 0.1.

Smaller values of v lead to larger values of M for the same training risk, so that there is a tradeoff between them. … In fact, the best strategy appears to be to set v to be very small (v < 0.1) and then choose M by early stopping.

Also, as in Friedman’s stochastic gradient boosting paper, they recommend a subsampling percentage (n) without replacement with a value of about 50%.

A typical value for n can be 1/2, although for large N, n can be substantially smaller than 1/2.

Configuration of Gradient Boosting in R

The gradient boosting algorithm is implemented in R as the gbm package.

It is interesting to note that a smaller shrinkage factor is used and that stumps are the default. The small shrinkage is explained by Ridgeway next.

In the vignette for using the gbm package in R titled “Generalized Boosted Models: A guide to the gbm package“, Greg Ridgeway provides some usage heuristics. He suggest firs setting the learning rate (lambda) to as small as possible then tuning the number of trees (iterations or T) using cross validation.

In practice I set lambda to be as small as possible and then select T by cross-validation. Performance is best when lambda is as small as possible performance with decreasing marginal utility for smaller and smaller lambda.

He comments on his rationale for setting the default shrinkage to the small value of 0.001 rather than 0.1.

It is important to know that smaller values of shrinkage (almost) always give improved predictive performance. That is, setting shrinkage=0.001 will almost certainly result in a model with better out-of-sample predictive performance than setting shrinkage=0.01. … The model with shrinkage=0.001 will likely require ten times as many iterations as the model with shrinkage=0.01

Ridgeway also uses quite large numbers of trees (called iterations here), thousands rather than hundreds

I usually aim for 3,000 to 10,000 iterations with shrinkage rates between 0.01 and 0.001.

Configuration of Gradient Boosting in scikit-learn

It is useful to review the default configuration for the algorithm in this library.

There are many parameters, but below are a few key defaults.

learning_rate=0.1 (shrinkage).

n_estimators=100 (number of trees).

max_depth=3.

min_samples_split=2.

min_samples_leaf=1.

subsample=1.0.

It is interesting to note that the default shrinkage does match Friedman and that the tree depth is not set to stumps like the R package. A tree depth of 3 (if the created tree was symmetrical) will have 8 leaf nodes, matching the upper bound of the preferred number of terminal nodes in Friedman’s studies (alternately max_leaf_nodes can be set).

In the scikit-learn user guide under the section titled “Gradient Tree Boosting” the authors comment that setting the maximum leaf nodes has a similar effect to setting the max depth to one minus the maximum leaf nodes, but results in worse performance.

We found that max_leaf_nodes=k gives comparable results to max_depth=k-1 but is significantly faster to train at the expense of a slightly higher training error.

In a small study demonstrating regularization methods for gradient boosting titled “Gradient Boosting regularization“, the results show the benefit of using both shrinkage and sub-sampling.

If the system is overlearning, slow the learning down (using shrinkage?).

If the system is underlearning, speed the learning up to be more aggressive (using shrinkage?).

In Owen Zhang’s talk to the NYC Data Science Academy in 2015 titled “Winning Data Science Competitions“, he provides some general tips for configuring gradient boost with XGBoost. Owen is a heavy user of gradient boosting.

Hi Jason,
Hope you are well. Thanks for the amazing information on machine learning techniques. Currently I’m working on the GBM in R and I’m trying to figureout the best parameters for our GBM model. However, based on what I found and read and also based on your information about the GBM parameter tuning, the lower learning rate should result in a better AUC. However, I have checked this with the piece of code below:

gbm1 h2o.auc(gbm1)
[1] 0.8592122
> gbm2 h2o.auc(gbm2)
[1] 0.8628086
> gbm3 h2o.auc(gbm3)
[1] 0.8628086
************************************************************************************************************************************************************************************************************************************
The difference between gbm1 and gbm2 in their learn_rate. GBM 2 has bigger learn_rate than the first one but ends to a better result. It made me very confused about what I have read and I would appreciate if you could please help me to understand the problem.

I have a binary classification where positive class is 10 to 20% . Have 50 to 100k data but lot of columns 1000 to 2000 many of them 0( one hot vectors)

My problem is i am getting a very low recall on positive class which is the one i care more about. Tried default sklearn and did a few tweakssimilar to ones above.Any ideas on specific things to try other than undersampling?

Hi Jason,
Thanks for this crisp article. I have highly unbalanced data. what exact parameters/Attributes should is tune in GradientBoostingClassifier() and from what value i should start. also i am ok if my Recall value/False Positive number is higher. Please help me with this