Other sites

k-fold cross validation with modelr and broom

@drsimonj here to discuss how to conduct k-fold cross validation, with an emphasis on evaluating models supported by David Robinson’sbroom package. Full credit also goes to David, as this is a slightly more detailed version of his past post, which I read some time ago and felt like unpacking.

This function takes a data frame and randomly partitions it’s rows (1 to 32 for mtcars) into k roughly equal groups. We’ve partitioned the row numbers into k = 5 groups. The results are returned as a tibble (data frame) like the one above.

Each cell in the test column contains a resample object, which is an efficient way of referencing a subset of rows in a data frame (?resample to learn more). Think of each cell as a reference to the rows of the data frame belonging to each partition. For example, the following tells us that the first partition of the data references rows 5, 9, 17, 20, 27, 28, 29, which accounts for roughly 1 / k of the total data set (7 of the 32 rows).

folds$test[[1]]
#> 5, 9, 17, 20, 27, 28, 29

Each cell in train also contains a resample object, but referencing the rows in all other partitions. For example, the first train object references all rows except5, 9, 17, 20, 27, 28, 29:

folds$train[[1]]
#> 1, 2, 3, 4, 6, 7, 8, 10, 11, 12, ...

We can now run a model on the data referenced by each train object, and validate the model results on each corresponding partition in test.

Fitting models to training data

Say we’re interested in predicting Miles Per Gallon (mpg) with all other variables. With the whole data set, we’d do this via:

lm(mpg ~ ., data = mtcars)

Instead, we want to run this model on each set of training data (data referenced in each train cell). We can do this as follows:

Predicting the test data

The next step is to use each model for predicting the outcome variable in the corresponding test data. There are many ways to achieve this. One general approach might be:

folds %>% mutate(predicted = map2(model, test, ))

map2(model, test, ...) iterates through each model and set of test data in parallel. By referencing these in the function for predicting the test data, this would add a predicted column with the predicted results.

For many common models, an elegant alternative is to use augment from broom. For regression, augment will take a fitted model and a new data frame, and return a data frame of the predicted results, which is what we want! Following above, we can use augment as follows: