Modeling with `parsnip` and `tidymodels`

Sep 1, 2018
10 min read

Before I enrolled in the Data Challenge Lab at Stanford, my approach to data science in R was almost entirely ad hoc. Beyond the very fundamentals of base R, I relied on a combination of Stack Overflow and code copied from textbooks and lecture slides to get anything done. Naturally, I couldn’t do much. The DCL changed all of that by introducing me to the Tidyverse, which not only gave me a unified set of tools to explore, visualize, and customize my data, but also gave me a much more intuitive sense for how data works and feels.

That said, I’ve never felt the same way about statistical modeling. While I’ve studied statistical learning at Stanford and know my way around most common models and techniques, I’m still working to achieve the kind of fluency in modeling that the Tidyverse has given me in data transformation and visualization.

One obstacle I’ve faced has been the ad hoc nature of modeling itself. Because modeling is pretty much always oriented toward a single purpose — accurate, useful prediction — it’s easy to approach it with a reductive mindset. Even tasks like variable selection and parameter tuning are typically automated or routinized to the point that exploration is all but relegated to the preliminary stages of building a model. Part of the joy of the Tidyverse is that it allows users to be expressive — why can’t modeling be the same?

Enter tidymodels, a meta-package that includes a growing set of tools under development by Max Kuhn and his colleagues at RStudio. Along with parsnip, which marks an attempt to unify the expansive universe of R modeling packages into a common interface, tidymodels provides the tools needed to iterate and explore modeling tasks with a tidy philosophy.

In this post, I’ll demonstrate the basic workflow provided by these packages and comment briefly on what they add to the model building process. I’ll be using the Wisconsin Breast Cancer data set provided by UC Irvine’s Machine Learning Repository, which I found on Kaggle. I picked up the basics of the tidymodels packages from Max Kuhn’s vignettes on GitHub and Clayton Yochum’s slides from a meeting of the Ann Arbor R User Group — I highly recommend you check them out for more detailed explanations than I provide here.

The data

The data consists of 569 observations of cell samples, the physical features of which are summarized by three statistics: mean, standard error, and “worst” — in this case, the largest observed values for each of the features measured. With ten features (such as cell area, concavity, and fractal dimension), there are 30 total predictors in the dataset. Each observation is also labeled with a unique identifier and a diagnosis which will form the target of our prediction: malignant or benign.

Because this is a relatively rich but small dataset, I’ll follow best practice and try to be picky about which predictors I include. Let’s assume that malignant cancer cells will have several physical differences from benign cells and that our measures will be able to detect these differences. In that case, a ratio of these measures’ means might be a reasonable heuristic for identifying which dimensions exhibit the most difference on average, and thus which measures are likely to have the most predictive power.

There seems to be a fair amount of separation along measures of area, concavity, and compactness. The average area of a malignant cell sample is nearly twice as large as a benign sample, for instance.

Let’s explore our hunch about separation further by visualizing pairs of these dimensions with scatter plots — which features seem more predictive? Along what dimensions would the diagnoses be easier to predict?

Next, recipes handles the pre-processing. This, in my opinion, is where the real magic of tidymodels comes in. Given a sample of training data, you first specify a model formula using add_role() (or the traditional y ~ x notation). Once roles are assigned, variables can be referenced with dplyr-like helper functions such as all_predictors() or all_nominal() — this comes in handy for the processing steps that follow.

The various step_ functions allow for easy rescaling and transformation, but more importantly they allow you to specify a routine that will consistently reshape all the data you’re feeding into your model. Apart from removing predictors and rescaling values, there are step_ functions for PCA, missing value imputation and more.

Once we’ve created a recipe() object, the next step is to prep() it. In the baking analogy, the recipe we created is simply a specification for how we want to process our data, and prepping is the process of getting our ingredients and tools in order so that we can bake it. We specify retain = TRUE in the prepping process if we want to hold onto the recipe’s initial training data for later.

# Prepping
prepped <-
rec %>%
prep(retain = TRUE)

Now that we have a recipe and a prepped object, we’re ready to start baking. The bake() function allows us to apply a prepped recipe to new data, which will be processed according to our exact specifications. The juice() function is essentially a shortcut for bake() that’s useful when we want to process and output the training data used to originally specify the recipe (with retain = TRUE during prepping).

Now we’re ready to train our model with parsnip. I won’t say too much about parsnip, which is still in what appears to be a beta mode, but the gist is that like it’s predecessor caret, R users will now be able to specify model families, engines, and arguments through a common interface. In other words, it takes a huge headache out of the model building process. It also has the benefits of working nicely with the tidymodels family, also developed by Max Kuhn and co.

Not bad for a first try! Using a fraction of the original predictors, we’re able to achieve 93.0% accuracy. Of course, this high of a figure is not too surprising given the clean separation we observed during EDA, and we could probably do just as well with even fewer predictors if we wanted. One of the advantages of unifying ever step of the process within a tidy workflow is that it’s easy to step back and make adjustments without rewriting code.

We can further validate our test set accuracy with 10-fold cross validation using rsample to create the folds and purrr to fit all ten models within a nested tibble. One quick note: rather than create 10 tibbles, vfold_cv() creates a list of splits containing the indices for each fold. To get training data from a splits object, simply call analysis(), and to get the test set, call assessment().

Closing thoughts

This data set turned out to be pretty easy to predict on and we got some satisfying results on the first go-round, but there’s always room for improvement. The beauty of tidymodels is that with the above code as a foundation, it would only take a few lines of edits to change the model type with parsnip, the pre-processing with recipes, or our assessment with yardstick and rsample. While modeling will always be ad hoc by nature, tidymodels opens up the process to greater expressiveness and more purposeful exploration.