In real-world scenarios, data can change. For example, you may have a model currently in production that was built using 1 million records. At a later date, you may receive several hundred thousand more records. Rather than building a new model from scratch, you can use the checkpoint option to create a new model based on the existing model.

The checkpoint option is available for DRF, GBM, and Deep Learning algorithms. This allows you to specify a model key associated with a previously trained model. This will build a new model as a continuation of a previously generated model. If this is not specified, then the algorithm will start training a new model instead of continuing building a previous model.

When setting parameters that continue to build on a previous model, specifically ntrees (in GBM/DRF) or epochs (in Deep Learning), specify the total amount of training that you want if you had started from scratch, not the number of additional epochs or trees you want. Note that this means the ntrees or epochs parameter for the checkpointed model must always be greater than the original value. For example:

If the first model builds 1 tree, and you want your new model to build 50 trees, then the continuation model (using checkpointing) would specify ntrees=50. This gives you a total of 50 trees including 49 new ones.

If your original model included 20 trees, and you specify ntrees=50 for the continuation model, then the new model will add 30 trees to the model, again giving you a total of 50 trees.

If your oringinal model included 20 trees, and you specify ntrees=10 (a lower value), then you will receive an error indicating that the requested ntrees must be higher than 21.

Notes:

The response type and model type of the training data must be the same as for the checkpointed model.

The columns of the training data must be the same as for the checkpointed model.

Categorical factor levels of the training data must be the same as for the checkpointed model.

The total number of predictors of the training data must be the same as for the checkpointed model.

Cross-validation is not currently supported for checkpointing. In addition, if you use a dataset for validation (with the validation_frame parameter), you must use this same validation set each time you continue training through checkpointing.

The parameters that you can specify with checkpointing vary based on the algorithm that was used for model training. Scenarios for different algorithms are described in the sections that follow.

In Deep Learning, checkpoint can be used to continue training on the same dataset for additional epochs or to train on new data for additional epochs.

To resume model training, use checkpoint model keys (model_id) to incrementally train a specific model using more iterations, more data, different data, and so forth. To further train the initial model, use it (or its key) as a checkpoint argument for a new model.

To get the best possible model in a general multi-node setup, we recommend building a model with train_samples_per_iteration=-2 (default, auto-tuning) and saving it to disk so that youâll have at least one saved model.

To improve this initial model, start from the previous model and add iterations by building another model, specifying checkpoint=previous_model_id, and changing train_samples_per_iteration, target_ratio_comm_to_comp, or other parameters. Many parameters can be changed between checkpoints, especially those that affect regularization or performance tuning.

Checkpoint restart suggestions:

For multi-node only: Leave train_samples_per_iteration=-2 and increase target_ratio_comm_to_comp from 0.05 to 0.25 or 0.5 (more communication). This should lead to a better model when using multiple nodes. Note: This has no effect on single-node performance at all because there is no actual communication needed.

For both single and multi-node (bagging-like): Explicitly set train_samples_per_iteration=N, where \(N\) is the number of training samples for the whole cluster to train with for one iteration. Each of the \(n\) nodes will then train on \(N/n\) randomly chosen rows for each iteration. Obviously, a good choice for \(N\) depends on the dataset size and the model complexity. Refer to the logs to see what values of \(N\) are used in option 1 (when auto-tuning is enabled). Typically, option 1 is sufficient.

For both single and multi-node: Change regularization parameters such as l1, l2, max_w2, input_dropout_ratio, hidden_dropout_ratios. For best results, build the first model with RectifierWithDropout and input_dropout_ratio=0 and hidden_dropout_ratios with a list of all 0s, just to be able to enable dropout regularization later. Hidden dropout is often used for initial models because it often improves generalization. Input dropout is especially useful if there is some noise in the input.

Options 1 and 3 should result in a good model. Of course, grid search can be used with checkpoint restarts to scan a broad range of good continuation models.

Note: The following parameters cannot be modified during checkpointing:

activation

autoencoder

backend

channels

distribution

drop_na20_cols

ignore_const_cols

max_categorical_features

mean_image_file

missing_values_handling

momentum_ramp

momentum_stable

momentum_start

network

network_definition_file

nfolds

problem_type

standardize

use_all_factor_levels

y (response column)

The following example demonstrates how to build a deep learning model that will later be used for checkpointing. This example will cover both types of checkpointing: checkpointing with the same dataset and checkpointing with new data. This example uses the famous MNIST dataset, which is used to classify handwritten digits from 0 through 9.

importh2ofromh2o.estimators.deeplearningimportH2ODeepLearningEstimatorh2o.init()# Import the mnist datasetmnist_original=h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")# The last column, C785, is the target that lists whether the# handwritten digit was a 0,1,2,3,4,5,6,7,8, or 9. Before we# set the variables for our predictors and target, we will# convert our target column from type int to type enum.mnist_original['C785']=mnist_original['C785'].asfactor()predictors=mnist_original.columns[0:-1]target='C785'# Split the data into training and validation sets, and split# a piece off to demonstrate adding new data with checkpointing.# In a real world scenario, however, you would not have your# new data at this point.train,valid,new_data=mnist_original.split_frame(ratios=[.7,.15],seed=1234)# Build the first deep learning model, specifying the model_id so you# can indicate which model to use when you want to continue training.# We will use 4 epochs to start off with and then build an additional# 16 epochs with checkpointing.dl=H2ODeepLearningEstimator(distribution='multinomial',model_id='dl',epochs=4,activation='rectifier_with_dropout',hidden_dropout_ratios=[0,0],seed=1234)dl.train(x=predictors,y=target,training_frame=train,validation_frame=valid)print('Validation Mean Per Class Error for DL:',dl.mean_per_class_error(valid=True))('Validation Mean Per Class Error for DL:',0.0665710328899672)print('Validation Logloss for DL:',dl.logloss(valid=True))('Validation Logloss for DL:',0.38771905396189366)# Checkpoint on the same dataset. This shows how to train an additional# 16 epochs on top of the first 4. To do this, set epochs equal to 20 (not 6).# This example also changes the list of hidden dropout ratios.dl_checkpoint1=H2ODeepLearningEstimator(distribution='multinomial',model_id='dl_w_checkpoint1',checkpoint='dl',epochs=20,activation='rectifier_with_dropout',hidden_dropout_ratios=[0,0.5],seed=1234)dl_checkpoint1.train(x=predictors,y=target,training_frame=train,validation_frame=valid)print('Validation Mean Per Class Error for DL with Checkpointing:',dl_checkpoint1.mean_per_class_error(valid=True))('Validation Mean Per Class Error for DL with Checkpointing:',0.05596493320234874)print('Validation Logloss for DL with Checkpointing:',dl_checkpoint1.logloss(valid=True))('Validation Logloss for DL with Checkpointing:',0.2622290756893055)improvement_dl=dl.logloss(valid=True)-dl_checkpoint1.logloss(valid=True)print('Overall improvement in logloss is {0}'.format(improvement_dl))Overallimprovementinloglossis0.142712240337# Checkpoint on a new dataset. Notice that to train on new data,# you set training_frame to new_data (not train) and leave the# same dataset to use for validation.dl_checkpoint2=H2ODeepLearningEstimator(distribution='multinomial',model_id='dl_w_checkpoint2',checkpoint='dl',epochs=15,activation='rectifier_with_dropout',hidden_dropout_ratios=[0,0],seed=1234)dl_checkpoint2.train(x=predictors,y=target,training_frame=new_data,validation_frame=valid)print('Validation Mean Per Class Error for DL:',dl_checkpoint2.mean_per_class_error(valid=True))('Validation Mean Per Class Error for DL:',0.06465957648350525)print('Validation Logloss for DL:',dl_checkpoint2.logloss(valid=True))('Validation Logloss for DL:',0.3616085918270951)improvement_dl=dl.logloss(valid=True)-dl_checkpoint2.logloss(valid=True)print('Overall improvement in logloss is {0}'.format(improvement_dl))Overallimprovementinloglossis0.0261104621348

In DRF, checkpoint can be used to continue training on the same dataset for additional iterations, or continue training on new data for additional iterations.

Note: The following parameters cannot be modified during checkpointing:

build_tree_one_node

max_depth

min_rows

nbins

nbins_cats

nbins_top_level

sample_rate

The following example demonstrates how to build a distributed random forest model that will later be used for checkpointing. This checkpoint example shows how to continue training on an existing model and also builds with new data. This example uses the cars dataset, which classifies whether or not a car is economical based on the car’s displacement, power, weight, and acceleration, and the year it was made.

library(h2o)
h2o.init()# Import the cars dataset.
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")# Convert the response column to a factor
cars["economy_20mpg"]<-as.factor(cars["economy_20mpg"])# Set the predictor names and the response column name
predictors <-c("displacement","power","weight","acceleration","year")
response <-"economy_20mpg"# Split the data into training and validation sets, and split# a piece off to demonstrate adding new data with checkpointing.# In a real world scenario, however, you would not have your# new data at this point.
cars.split <- h2o.splitFrame(data = cars,ratios =c(0.7,0.15), seed =1234)
train <- cars.split[[1]]
valid <- cars.split[[2]]
new_data <- cars.split[[3]]# Build the first DRF model, specifying the model_id so you can# indicate which model to use when you want to continue training.# We will use 1 tree to start off with and then build an additional# 9 trees with checkpointing.
drf <- h2o.randomForest(model_id ='drf',
x = predictors,
y = response,
training_frame = train,
validation_frame = valid,
ntrees =1,
seed =1234)print(h2o.mean_per_class_error(drf, valid=TRUE))[1]0.09453782print(h2o.logloss(drf, valid=TRUE))[1]3.597789# Checkpoint on the same dataset. This shows how to train an additional# 9 trees on top of the first 1. To do this, set ntrees equal to 10.
drf_continued <- h2o.randomForest(model_id ='drf_continued',
x = predictors,
y = response,
training_frame = train,
validation_frame = valid,
checkpoint ='drf',
ntrees =10,
seed =1234)print(h2o.mean_per_class_error(drf_continued, valid=TRUE))[[1]0.06512605print(h2o.logloss(drf_continued, valid=TRUE))[1]0.1826136print(improvement_drf <- h2o.logloss(drf, valid=TRUE)- h2o.logloss(drf_continued, valid=TRUE))[1]3.415176# Checkpoint on a new dataset. Notice that to train on new data,# you set training_frame to new_data (not train) and leave the# same dataset to use for validation.
drf_newdata <- h2o.randomForest(model_id ='drf_newdata',
x = predictors,
y = response,
training_frame = new_data,
validation_frame = valid,
checkpoint ='drf',
ntrees =15,
seed =1234)print(h2o.mean_per_class_error(drf_newdata, valid=TRUE))[1]0.07142857print(h2o.logloss(drf_newdata, valid=TRUE))[1]0.1767007print(improvement_drf <- h2o.logloss(drf, valid=TRUE)- h2o.logloss(drf_newdata, valid=TRUE))[1]3.421088

importh2ofromh2o.estimators.random_forestimportH2ORandomForestEstimatorh2o.init()# Import the cars dataset.cars=h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")# Convert the response column to a factorcars["economy_20mpg"]=cars["economy_20mpg"].asfactor()# Set the predictor names and the response column namepredictors=["displacement","power","weight","acceleration","year"]response="economy_20mpg"# Split the data into training and validation sets, and split# a piece off to demonstrate adding new data with checkpointing.# In a real world scenario, however, you would not have your# new data at this point.train,valid,new_data=cars.split_frame(ratios=[.7,.15],seed=1234)# Build the first DRF model, specifying the model_id so you can# indicate which model to use when you want to continue training.# We will use 1 trees to start off with and then build an additional# 9 trees with checkpointing.drf=H2ORandomForestEstimator(model_id="drf",ntrees=1,seed=1234)drf.train(x=predictors,y=response,training_frame=train,validation_frame=valid)print('Validation Mean Per Class Error for DRF:',drf.mean_per_class_error(valid=True))('Validation Mean Per Class Error for DRF:',[[1.0,0.09453781512605042]])print('Validation Logloss for DRF:',drf.logloss(valid=True))('Validation Logloss for DRF:',3.597789207803196)# Checkpoint on the same dataset. This shows how to train an additional# 9 trees on top of the first 1. To do this, set ntrees equal to 10.drf_continued=H2ORandomForestEstimator(model_id='drf_continued',checkpoint=drf,ntrees=10,seed=1234)drf_continued.train(x=predictors,y=response,training_frame=train,validation_frame=valid)print('Validation Mean Per Class Error for DRF with Checkpointing:',drf_continued.mean_per_class_error(valid=True))('Validation Mean Per Class Error for DRF with Checkpointing:',[[0.7,0.06512605042016806]])print('Validation Logloss for DRF with Checkpointing:',drf_continued.logloss(valid=True))('Validation Logloss for DRF with Checkpointing:',0.1826135624064031)improvement_drf=drf.logloss(valid=True)-drf_continued.logloss(valid=True)print('Overall improvement in logloss is {0}'.format(improvement_drf))Overallimprovementinloglossis3.4151756454# Checkpoint on a new dataset. Notice that to train on new data,# you set training_frame to new_data (not train) and leave the# same dataset to use for validation.drf_newdata=H2ORandomForestEstimator(model_id='drf_newdata',checkpoint='drf',ntrees=15,seed=1234)drf_newdata.train(x=predictors,y=response,training_frame=new_data,validation_frame=valid)print('Validation Mean Per Class Error for DRF:',drf_newdata.mean_per_class_error(valid=True))('Validation Mean Per Class Error for DRF:',[[0.5575757582982381,0.06512605042016806]])print('Validation Logloss for DRF:',drf_newdata.logloss(valid=True))('Validation Logloss for DRF:',0.17670074914138334)improvement_drf=drf.logloss(valid=True)-drf_newdata.logloss(valid=True)print('Overall improvement in logloss is {0}'.format(improvement_drf))Overallimprovementinloglossis3.42108845866

In GBM, checkpoint can be used to continue training on a previously generated model rather than rebuilding the model from scratch. For example, you may train a model with 50 trees and wonder what the model would look like if you trained 10 more.

Note: The following parameters cannot be modified during checkpointing:

build_tree_one_node

max_depth

min_rows

nbins

nbins_cats

nbins_top_level

sample_rate

The following example demonstrates how to build a gradient boosting model that will later be used for checkpointing. This checkpoint example shows how to continue training on an existing model. We do not recommend using GBM to train on new data. This example uses the cars dataset, which classifies whether or not a car is economical based on the car’s displacement, power, weight, and acceleration, and the year it was made.

importh2ofromh2o.estimators.gbmimportH2OGradientBoostingEstimatorh2o.init()# Import the cars dataset.cars=h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")# Convert the response column to a factorcars["economy_20mpg"]=cars["economy_20mpg"].asfactor()# Set the predictor names and the response column namepredictors=["displacement","power","weight","acceleration","year"]response="economy_20mpg"# Split the data into training and validation sets, and split# a piece off to demonstrate adding new data with checkpointing.# In a real world scenario, however, you would not have your# new data at this point.train,valid,new_data=cars.split_frame(ratios=[.7,.15],seed=1234)# Build the first GBM model, specifying the model_id so you can# indicate which model to use when you want to continue training.# We will use 5 trees to start off with and then build an additional# 45 trees with checkpointing.gbm=H2OGradientBoostingEstimator(model_id="gbm",ntrees=5,seed=1234)gbm.train(x=predictors,y=response,training_frame=train,validation_frame=valid)print('Validation Mean Per Class Error for GBM:',gbm.mean_per_class_error(valid=True))('Validation Mean Per Class Error for GBM:',[[0.6978087517334117,0.05882352941176472]])print('Validation Logloss for GBM:',gbm.logloss(valid=True))('Validation Logloss for GBM:',0.38223687802228534)# Checkpoint on the same dataset. This shows how to train an additional# 45 trees on top of the first 5. To do this, set ntrees equal to 50.gbm_continued=H2OGradientBoostingEstimator(model_id='gbm_continued',checkpoint=gbm,ntrees=50,seed=1234)gbm_continued.train(x=predictors,y=response,training_frame=train,validation_frame=valid)print('Validation Mean Per Class Error for GBM with Checkpointing:',gbm_continued.mean_per_class_error(valid=True))('Validation Mean Per Class Error for GBM with Checkpointing:',[[0.8908495796146818,0.02941176470588236]])print('Validation Logloss for GBM with Checkpointing:',gbm_continued.logloss(valid=True))('Validation Logloss for GBM with Checkpointing:',0.19595254685018604)improvement_gbm=gbm.logloss(valid=True)-gbm_continued.logloss(valid=True)print('Overall improvement in logloss is {0}'.format(improvement_gbm))Overallimprovementinloglossis0.186284331172# See how the variable importance changes between the original model# trained on 5 trees and the checkpointed model that adds 45 more treesgbm.varimp(use_pandas=True).head()variablerelative_importancescaled_importancepercentage0displacement157.4926301.0000000.8263011year16.0861070.1021390.0843972weight13.4846560.0856210.0707493power1.9952520.0126690.0104684acceleration1.5409240.0097840.008085gbm_continued.varimp(use_pandas=True).head()variablerelative_importancescaled_importancepercentage0displacement207.9836731.0000000.6127531weight74.3078160.3572770.2189232year34.2556420.1647040.1009233power12.9487290.0622580.0381494acceleration9.9293410.0477410.029253# Train a GBM with cross validation (nfolds=3)gbm_cv=H2OGradientBoostingEstimator(distribution='multinomial',model_id='gbm_cv',ntrees=5,nfolds=3)gbm_cv.train(x=predictors,y=response,training_frame=train,validation_frame=valid)# Recall that cross validation is not supported for checkpointing.# Add 2 more trees to the GBM without cross validation.gbm_nocv_checkpoint=H2OGradientBoostingEstimator(distribution='multinomial',model_id='gbm_nocv_checkpoint',checkpoint='gbm_cv',ntrees=(5+2),seed=1234)gbm_nocv_checkpoint.train(x=predictors,y=response,training_frame=train,validation_frame=valid)# Logloss on cross validation hold out does not change on checkpointed modelgbm_cv.logloss(xval=True)==gbm_nocv_checkpoint.logloss(xval=True)True# Logloss on training and validation data changes as more trees are added (checkpointed model)print('Validation Logloss for GBM: '+str(round(gbm_cv.logloss(valid=True),3)))ValidationLoglossforGBM:0.382print('Validation Logloss for GBM with Checkpointing: '+str(round(gbm_nocv_checkpoint.logloss(valid=True),3)))ValidationLoglossforGBMwithCheckpointing:0.331