Using XGBoost with Tidymodels

Andy Merlino and Nick Merlino

2020/05/19

Background

XGBoost is a machine learning library originally written in C++ and ported to R in the xgboost R package. Over the last several years, XGBoost’s effectiveness in Kaggle competitions catapulted it in popularity. At Tychobra, XGBoost is our go-to machine learning library.

François Chollet and JJ Allaire summarize the value of XGBoost in the intro to “Deep Learning in R”:

In 2016 and 2017, Kaggle was dominated by two approaches: gradient boosting machines and deep learning. Specifically, gradient boosting is used for problems where structured data is available, whereas deep learning is used for perceptual problems such as image classification. Practitioners of the former almost always use the excellent XGBoost library.

These are the two techniques you should be the most familiar with in order to be successful in applied machine learning today: gradient boosting machines, for shallow-learning problems; and deep learning, for perceptual problems. In technical terms, this means you’ll need to be familiar with XGBoost and Keras—the two libraries that currently dominate Kaggle competitions.

At Tychobra, we have trained XGBoost models using the caret R package created by Max Kuhn. caret has treated us very well over the years (check out our post Machine Learning for Insurance Claims for an example of using xgboost with caret).

Max Kuhn and others at Rstudio have more recently turned their attention from caret to “tidymodels” (the successor to caret). “tidymodels” is a collection of R packages that work together to simplify and supercharge model training and tuning. With the recent launch of tidymodels.org, we felt it was time to give the tidymodels R packages a shot.

Overview

In this post we will train and tune an XGBoost model using the tidymodels R packages. We use the AmesHousing dataset which contains housing data from Ames, Iowa. Our model will predict house sale price.

# set the random seed so we can reproduce any simulated results.
set.seed(1234)
# load the housing data and clean names
ames_data <- make_ames() %>%
janitor::clean_names()

Step 0: EDA (Exploratory Data Analaysis)

At this point we would normally make a few simple plots and summaries of the data to get a high-level understanding of the data. For simplicity, we are going to cut the EDA process from this post, but, in a real-world analysis, understanding the business problems and doing effective EDA are often the most time consuming and crucial aspects of the analysis.

Step 1: Initial Data Split

Now we split the data into training and test data. Training data is used for the model training and hyperparameter tuning. Once trained, the model can be evaluated against test data to assess accuracy.

Step 2: Preprocessing

Preprocessing alters the data to make our model more predictive and the training process less compute intensive. Many models require careful and extensive variable preprocessing to produce accurate predictions. XGBoost, however, is robust against highly skewed and/or correlated data, so the amount of preprocessing required with XGBoost is minimal. Nevertheless, we can still benefit from some preprocessing.

In tidymodels, we use the recipes package to define these preprocessing steps, in what is called a “recipe”.

As you can see in the chart below, for the “neighborhood” variable, several of the factor levels with the fewest observations (less than 1% of the total number of observations) have been lumped into an “other” factor level. We did this preprocessing in step_other() in the above recipe.

Step 3: Splitting for Cross Validation

We apply our previously defined preprocessing recipe with bake(). Then we use cross-validation to randomly split the training data into further training and test sets. We will use these additional cross validation folds to tune our hyperparameters in a later step.

Step 5: Grid Specification

Next we set up the grid space. The dails::grid_* functions support several methods for defining the grid space. We are using the dails::grid_max_entropy() function which covers the hyperparameter space such that any portion of the space has an observed combination that is not too far from it.

The above metrics on the test data are significantly worse than our training data metrics, so we know that there is some overfitting going on in our model. This highlights the importance of using test data, rather than training data, to evaluate model performance.

To quickly check that there is not an obvious issue with our model’s predictions, let’s plot the test data residuals.

The above chart does not show any super obvious trends in the residuals. This indicates that, at a very high level, our model is not systematically making inaccurate predictions for houses with certain predicted sale prices. We would do more model validation here for a real-world analysis, but, for the sake of this post, the above chart is good enough for us.

Conclusion

In this post, we were not overly concerned with our model’s performance. Our goal was to simply work through the process of training an XGBoost model using tidymodels, and to learn the tidymodels basics along the way.

Tidymodels gives us a standard process and vocabulary to handle resampling (rsample), data preprocessing (recipes), model specification (parsnip), tuning (tune), and model validation (yardstick). The work done by the tidymodels team to “tidy” the machine learning process is a step change improvement for approachability to machine learning in R; it is easier than ever to train and (more importantly) understand the model training process using the tidymodels packages. Thank you tidymodels team!

We are still just getting started with tidymodels, so please let me know if you see errors or have suggestions for improvements!