Extreme gradient boosting

Extreme gradient boosting (XGBoost) is a faster and improved implementation of gradient boosting for supervised learning and has recently been very successfully applied in Kaggle competitions. Because I’ve heard XGBoost’s praise being sung everywhere lately, I wanted to get my feet wet with it too. So this week I want to compare the prediction success of gradient boosting with the same dataset. Additionally, I want to test the influence of different preprocessing methods on the outcome.

“XGBoost uses a more regularized model formalization to control over-fitting, which gives it better performance.” Tianqi Chen, developer of xgboost

XGBoost is a tree ensemble model, which means the sum of predictions from a set of classification and regression trees (CART). In that, XGBoost is similar to Random Forests but it uses a different approach to model training.

Starting with the same test and training data (partitioned into validation test and validation train subsets) from last week’s post, I am training extreme gradient boosting models as implemented in the xgboost and caret packages with different preprocessing settings.

Out of the different implementations and variations of gradient boosting algorithms, caret performed best on PCA-preprocessed data in the validation set. These paramteres were then used to predict the outcome in the test set and compare it to last week’s predictions.

Compared to last week, there is much less uncertainty in the predictions from XGBoost. Overall, I would say that this algorithm is superior to the others I have used before.

xgboost

Extreme gradient boosting is implemented in the xgboost package.

# install the stable/pre-compiled version from CRAN
install.packages('xgboost')# or install from weekly updated drat repo
install.packages("drat",repos="https://cran.rstudio.com")drat:::addRepo("dmlc")install.packages("xgboost",repos="http://dmlc.ml/drat/",type="source")

XGBoost supports only numbers, so the outcome classes have to be converted into integers and both training and test data have to be in numeric matrix format.

Training with gbtree

gbtree is the default booster for xgb.train.

bst_1xgb.train(data=xgb_train_matrix,label=getinfo(xgb_train_matrix,"label"),max.depth=2,eta=1,nthread=4,nround=50,# number of trees used for model building
watchlist=watchlist,objective="binary:logistic")

Training with gblinear

bst_2xgb.train(data=xgb_train_matrix,booster="gblinear",label=getinfo(xgb_train_matrix,"label"),max.depth=2,eta=1,nthread=4,nround=50,# number of trees used for model building
watchlist=watchlist,objective="binary:logistic")

library(tidyr)results_combined_gatherresults_combined%>%gather(group_dates,date,date.of.onset:date.of.hospitalisation)results_combined_gather$group_datesfactor(results_combined_gather$group_dates,levels=c("date.of.onset","date.of.hospitalisation"))results_combined_gather$group_datesmapvalues(results_combined_gather$group_dates,from=c("date.of.onset","date.of.hospitalisation"),to=c("Date of onset","Date of hospitalisation"))results_combined_gather$gendermapvalues(results_combined_gather$gender,from=c("f","m"),to=c("Female","Male"))levels(results_combined_gather$gender)c(levels(results_combined_gather$gender),"unknown")results_combined_gather$gender[is.na(results_combined_gather$gender)]"unknown"results_combined_gatherresults_combined_gather%>%gather(group_pred,prediction,predicted_outcome_xgboost:predicted_outcome_last_week)results_combined_gather$group_predmapvalues(results_combined_gather$group_pred,from=c("predicted_outcome_xgboost","predicted_outcome_last_week"),to=c("Predicted outcome from XGBoost","Predicted outcome from last week"))

ggplot(data=subset(results_combined_gather,group_dates=="Date of onset"),aes(x=date,y=as.numeric(age),fill=prediction))+stat_density2d(aes(alpha=..level..),geom="polygon")+geom_jitter(aes(color=prediction,shape=gender),size=2)+geom_rug(aes(color=prediction))+labs(fill="Predicted outcome",color="Predicted outcome",alpha="Level",shape="Gender",x="Date of onset in 2013",y="Age",title="2013 Influenza A H7N9 cases in China",subtitle="Predicted outcome of cases with unknown outcome",caption="")+facet_grid(group_pred~province)+my_theme()+scale_shape_manual(values=c(15,16,17))+scale_color_brewer(palette="Set1",na.value="grey50")+scale_fill_brewer(palette="Set1")

There is much less uncertainty in the XGBoost data, even tough I used slightly different methods for classifying uncertainty: In last week’s analysis I based uncertainty on the ratio of combined prediction values from all analyses, this week uncertainty is based on the prediction value from one analysis.