CloudML is a managed service where you pay only for the hardware resources that you use. Prices vary depending on configuration (e.g. CPU vs. GPU vs. multiple GPUs). See https://cloud.google.com/ml-engine/pricing for additional details.

Local Development

Working on a CloudML project always begins with developing a training script that runs on your local machine. This will typically involve using one of these packages:

keras — A high-level interface for neural networks, with a focus on enabling fast experimentation.

tfestimators — High-level implementations of common model types such as regressors and classifiers.

tensorflow — Lower-level interface that provides full access to the TensorFlow computational graph.

There are no special requirements for your training script, however there are a couple of things to keep in mind:

When you train a model on CloudML all of the files in the current working directory are uploaded. Therefore, your training script should be within the current working directory and references to other scripts, data files, etc. should be relative to the current working directory. The most straightforward way to organize your work on a CloudML application is to use an RStudio Project.

Your training data may be contained within the working directory, or it may be located within Google Cloud Storage. If your training data is large and/or located in cloud storage, the most straightforward workflow for development is to use a local subsample of your data. See the article on Google Cloud Storage for a detailed example of using distinct data for local and CloudML execution contexts, as well as reading data from Google Cloud Storage buckets.

Once your script is working the way you expect you are ready to submit it as a job to CloudML.

Submitting Jobs

The core unit of work in CloudML is a job. A job consists of a training script and related files (e.g. other scripts, data files, etc. within the working directory). To submit a job to CloudML you use the cloudml_train() function, passing it the name of the training script to run. For example:

Note that the very first time you submit a job to CloudML the various packages required to run your script will be compiled from source. This will make the execution time of the job considerably longer that you might expect. It’s only the first job that incurs this overhead though (since the package installations are cached), and subsequent jobs will run more quickly.

The cloudml_train() function returns a job object. This is a reference to the training job which you can use later to check it’s status, collect it’s output, etc. For example:

Collecting Job Results

Note also that if you are using RStudio v1.1 or higher you’ll be given the to monitor and collect submitted jobs in the background using an RStudio terminal:

In this case you don’t need to call job_collect() explicitly as this will be done from within the background terminal after the job completes.

Once the job is complete it’s results will be downloaded and a report will be automatically displayed:

Training Runs

Each training job will produce one or more training runs (it’s typically only a single run, however when doing hyperparmeter turning there will be multiple runs). When you collect a job from CloudML it is automatically downloaded into the runs sub-directory of the current working directory.

You can list all of the runs as a data frame using the ls_runs() function:

Tuning Your Application

Tuning your application typically requires choosing and then optimizing a set of hyperparameters that influence your model’s performance. This could include the number and type of layers, units within layers, drop rates, regularization, etc.

You can experiment with hyperparameters on an ad-hoc basis, but in general it’s better to explore them more systematnically. The key to doing this with CloudML is by defining training flags within your script and the parameterizing runs using those flags.

Training with a GPU

By default, CloudML utilizes “standard” CPU-based instances suitable for training simple models with small to moderate datasets. You can request the use of other machine types, including ones with GPUs, using the master_type parameter of cloudml_train().

For example, the following would train the same model as above but with a Tesla K80 GPU:

See the CloudML website for documentation on available machine types. Also note that GPU instances can be considerably more expensive that CPU ones! See the documentation on CloudML Pricing for details.

Training Configuration

You can provide custom configuration for training by creating a cloudml.yml file within the working directory from which you submit your training job. This file can be used to customize various aspects of training behavior including the virtual machines used as well as the runtime version of CloudML used in the job.

For example, the following config file specifies a custom scale tier with a master type of “large_model”. It also specifies that the CloudML runtime version should be 1.2.

Learning More

The following articles provide additional documentation on training and deploying models with CloudML:

Hyperparameter Tuning explores how you can improve the performance of your models by running many trials with distinct hyperparameters (e.g. number and size of layers) to determine their optimal values.

Google Cloud Storage provides information on copying data between your local machine and Google Storage and also describes how to use data within Google Storage during training.

Deploying Models describes how to deploy trained models and generate predictions from them.