Join GitHub today

R Package: Cross-validate one or multiple gaussian or binomial regression models at once. Perform repeated cross-validation. Returns results in a tibble for easy comparison, reporting and further analysis.

README.md

cvms

Overview

R package: Cross-validate one or multiple regression models and get
relevant evaluation metrics in a tidy format. Validate the best model on
a test set and compare it to a baseline evaluation. Alternatively,
evaluate predictions from an external model. Currently supports linear
regression, logistic regression and (some functions only) multiclass
classification.

Main functions:

cross_validate()

validate()

evaluate()

baseline()

combine_predictors()

cv_plot()

select_metrics()

reconstruct_formulas()

Important News

Adds 'multinomial' family to baseline() and evaluate().

evaluate() is added. Evaluate your model’s predictions with the
same metrics as used in cross_validate().

AUC calculation has changed. Now explicitly sets the direction in
pROC::roc. (27th of May 2019)

Argument "positive" now defaults to 2. If a dependent variable
has the values 0 and 1, 1 is now the default positive class, as
that’s the second smallest value. If the dependent variable is of
type character, it’s in alphabetical order.

Results now contain a count of singular fit messages. See
?lme4::isSingular for more information.

Repeated cross-validation

Let’s first add some extra fold columns. We will use the num_fold_cols
argument to add 3 unique fold columns. We tell fold() to keep the
existing fold column and simply add three extra columns. We could also
choose to remove the existing fold column, if for instance we were
changing the number of folds (k). Note, that the original fold column
will be renamed to “.folds_1”.

Evaluating predictions

Evaluate predictions from a model trained outside cvms. Works with
linear regression (gaussian), logistic regression (binomial), and
multiclass classification (multinomial). The following is an example
of multinomial evaluation.

Multinomial

Create a dataset with 3 predictors and a target column. Partition it
with groupdata2::partition() to create a training set and a validation
set. multiclass_probability_tibble() is a simple helper function for
generating random tibbles.

Baseline evaluations

Create baseline evaluations of a test set.

Gaussian

Approach: The baseline model (y ~ 1), where 1 is simply the intercept
(i.e. mean of y), is fitted on n random subsets of the training set and
evaluated on the test set. We also perform an evaluation of the model
fitted on the entire training set.

Plot results

There are currently a small set of plots for quick visualization of the
results. It is supposed to be easy to extract the needed information to
create your own plots. If you lack access to any information or have
other requests or ideas, feel free to open an issue.

Gaussian

cv_plot(CV1, type="RMSE") +
theme_bw()

cv_plot(CV1, type="r2") +
theme_bw()

cv_plot(CV1, type="IC") +
theme_bw()

cv_plot(CV1, type="coefficients") +
theme_bw()

Binomial

cv_plot(CV2, type="ROC") +
theme_bw()

Generate model formulas

Instead of manually typing all possible model formulas for a set of
fixed effects (including the possible interactions),
combine_predictors() can do it for you (with some constraints).

When including interactions, >200k formulas have been precomputed for
up to 8 fixed effects, with a maximum interaction size of 3, and a
maximum of 5 fixed effects per formula. It’s possible to further limit
the generated formulas.

We can also append a random effects structure to the generated formulas.