K-fold cross-validation is used to validate a model internally, i.e.,
estimate the model performance without having to sacrifice a validation
split. Also, you avoid statistical issues with your validation split (it
might be a “lucky” split, especially for imbalanced data). Good values
for K are around 5 to 10. Comparing the K validation metrics is always a
good idea, to check the stability of the estimation, before “trusting”
the main model.

You have to make sure, however, that the holdout sets for each of the K
models are good. For i.i.d. data, the random splitting of the data into
K pieces (default behavior) or modulo-based splitting is fine. For
temporal or otherwise structured data with distinct “events”, you have
to make sure to split the folds based on the events. For example, if you
have observations (e.g., user transactions) from K cities and you want
to build models on users from only K-1 cities and validate them on the
remaining city (if you want to study the generalization to new cities,
for example), you will need to specify the parameter “fold_column” to
be the city column. Otherwise, you will have rows (users) from all K
cities randomly blended into the K folds, and all K cross-validation models will see
all K cities, making the validation less useful (or totally wrong,
depending on the distribution of the data). This is known as “data
leakage”: https://youtu.be/NHw_aKO5KUM?t=889

In general, for all algos that support the nfolds parameter, H2O’s
cross-validation works as follows:

For example, for nfolds=5, 6 models are built. The first 5 models
(cross-validation models) are built on 80% of the training data, and a
different 20% is held out for each of the 5 models. Then the main model
is built on 100% of the training data. This main model is the model you
get back from H2O in R, Python and Flow (though the CV models are also stored and available to access later).

This main model contains training metrics and cross-validation metrics
(and optionally, validation metrics if a validation frame was provided).
The main model also contains pointers to the 5 cross-validation models
for further inspection.

All 5 cross-validation models contain training metrics (from the 80%
training data) and validation metrics (from their 20% holdout/validation
data). To compute their individual validation metrics, each of the 5
cross-validation models had to make predictions on their 20% of of rows
of the original training frame, and score against the true labels of the
20% holdout.

For the main model, this is how the cross-validation metrics are
computed: The 5 holdout predictions are combined into one prediction for
the full training dataset (i.e., predictions for every row of the
training data, but the model making the prediction for a particular row
has not seen that row during training). This “holdout prediction” is
then scored against the true labels, and the overall cross-validation
metrics are computed.

This approach has some implications. Scoring the holdout predictions
freshly can result in different metrics than taking the average of the 5
validation metrics of the cross-validation models. For example, if the
sizes of the holdout folds differ a lot (e.g., when a user-given
fold_column is used), then the average should probably be replaced with
a weighted average. Also, if the cross-validation models map to slightly
different probability spaces, which can happen for small DL models that
converge to different local minima, then the confused rank ordering of
the combined predictions would lead to a significantly different AUC
than the average.

To gain more insights into the variance of the holdout metrics (e.g.,
AUCs), you can look up the cross-validation models, and inspect their
validation metrics. Here’s an R code example showing the two approaches:

Each cv-model produces a prediction frame pertaining to its fold. It can
be saved and probed from the various clients if
keep_cross_validation_predictions parameter is set in the model
constructor.

These holdout predictions have some interesting properties. First they
have names like:

prediction_GBM_model_1452035702801_1_cv_1

and they contain, unsurprisingly, predictions for the data held out in
the fold. They also have the same number of rows as the entire input
training frame with 0s filled in for all rows that are not in the
hold out.

Let’s look at an example.

Here is a snippet of a three-class classification dataset (last column
is the response column), with a 3-fold identification column appended to
the end:

The frame of cross-validated predictions is a single-column frame, where each row is the cross-validated prediction of that row. If you want H2O to keep these cross-validated predictions, you must set keep_cross_validation_predictions to True. Here’s an example in R:

library(h2o)h2o.init()# H2O Cross-validated K-means exampleprosPath<-system.file("extdata","prostate.csv",package="h2o")prostate.hex<-h2o.uploadFile(path=prosPath)fit<-h2o.kmeans(training_frame=prostate.hex,k=10,x=c("AGE","RACE","VOL","GLEASON"),nfolds=5,#If you want to specify folds directly, then use "fold_column" argkeep_cross_validation_predictions=TRUE)# This is where list of cv preds are stored (one element per fold):fit@model[["cross_validation_predictions"]]# However you most likely want a single-column frame including all cv predscvpreds<-h2o.getFrame(fit@model[["cross_validation_holdout_predictions_frame_id"]][["name"]])