Using GPUs for training models in the cloud

Graphics Processing Units (GPUs) can significantly accelerate the training
process for many deep learning models. Training models for tasks like image
classification, video analysis, and natural language processing involves
compute-intensive matrix multiplication and other operations that can take
advantage of a GPU's massively parallel architecture.

Training a deep learning model that involves intensive compute tasks on
extremely large datasets can take days to run on a single processor. However, if
you design your program to offload those tasks to one or more GPUs, you can
reduce training time to hours instead of days.

Some models don't benefit from running on GPUs. We recommend GPUs for large,
complex models that have many mathematical operations. Even then, you should
test the benefit of GPU support by running a small sample of your data through
training.

Requesting GPU-enabled machines

To use GPUs in the cloud, configure your training job to access GPU-enabled
machines in one of three ways: Use the BASIC_GPU scale tier, use GPU-enabled
Cloud ML Engine machine types, or use Compute Engine machine
types and attach GPUs.

Basic GPU-enabled machine

If you are learning how to use Cloud ML Engine or
experimenting with GPU-enabled machines, you can set the scale tier to
BASIC_GPU to get a single worker instance with a single NVIDIA Tesla K80 GPU.

Compute Engine machine types with GPU attachments

Beta

This is a Beta release of
Compute Engine machine types and GPU attachments for training. This feature
is not covered by any SLA or deprecation policy and might be subject to backward-incompatible
changes.

Alternatively, if you configure your training job with Compute Engine
machine types, which
do not include GPUs by default, you can attach a custom number of GPUs to
accelerate your job:

Add an acceleratorConfig field
with the type and number of GPUs you want to
masterConfig, workerConfig, or parameterServerConfig, depending on which
virtual machines you would like to accelerate. You can use the following GPU
types:

NVIDIA_TESLA_K80

NVIDIA_TESLA_P4(Beta)

NVIDIA_TESLA_P100

NVIDIA_TESLA_V100

To create a valid acceleratorConfig, you must account for several restrictions:

You can only use certain numbers of GPUs in your configuration. For example,
you can attach 2 or 4 NVIDIA Tesla K80s, but not 3. To see what counts are
valid for each type of GPU, see the compatibility table
below.

You must make sure each of your GPU configurations provides sufficient
virtual CPUs and memory to the machine type you attach it to. For example, if
you use n1-standard-32 for your workers, then each worker has 32 virtual
CPUs and 120 GB of memory. Since each NVIDIA Tesla V100 can provide up to 8
virtual CPUs and 52 GB of memory, you must attach at least 4 to each
n1-standard-32 worker to support its requirements.

The following table provides a quick reference of how many of each type of
accelerator you can attach to each Compute Engine machine type:

Valid numbers of GPUs for each machine type

Machine type

NVIDIA Tesla K80

NVIDIA Tesla P4 (Beta)

NVIDIA Tesla P100

NVIDIA Tesla V100

n1-standard-4

1, 2, 4, 8

1, 2, 4

1, 2, 4

1, 2, 4, 8

n1-standard-8

1, 2, 4, 8

1, 2, 4

1, 2, 4

1, 2, 4, 8

n1-standard-16

2, 4, 8

1, 2, 4

1, 2, 4

2, 4, 8

n1-standard-32

4, 8

2, 4

2, 4

4, 8

n1-standard-64

4

8

n1-standard-96

4

8

n1-highmem-2

1, 2, 4, 8

1, 2, 4

1, 2, 4

1, 2, 4, 8

n1-highmem-4

1, 2, 4, 8

1, 2, 4

1, 2, 4

1, 2, 4, 8

n1-highmem-8

1, 2, 4, 8

1, 2, 4

1, 2, 4

1, 2, 4, 8

n1-highmem-16

2, 4, 8

1, 2, 4

1, 2, 4

2, 4, 8

n1-highmem-32

4, 8

2, 4

2, 4

4, 8

n1-highmem-64

4

8

n1-highmem-96

4

8

n1-highcpu-16

2, 4, 8

1, 2, 4

1, 2, 4

2, 4, 8

n1-highcpu-32

4, 8

2, 4

2, 4

4, 8

n1-highcpu-64

8

4

4

8

n1-highcpu-96

4

8

Below is an example of submitting a job using
Compute Engine machine types with GPUs attached.

Regions that support GPUs

You must run your job in a region that supports GPUs. The following regions
currently provide access to GPUs:

us-east1

us-central1

us-west1

asia-east1

europe-west1

europe-west4

In addition, some of these regions only provide access to certain types of GPUs.
To fully understand the available regions for Cloud ML Engine services,
including model training and online/batch prediction, read the guide to
regions.

If your training job uses multiple types of GPUs, they must all be available in a single zone in
your region. For example, you cannot run a job in us-central1 with a master worker
using NVIDIA Tesla V100 GPUs, parameter servers using NVIDIA Tesla K80 GPUs, and workers using
NVIDIA Tesla P100 GPUs. While all of these GPUs are available for training jobs in
us-central1, no single zone in that region provides all three types of GPU. To
learn more about the zone availability of GPUs, see the
comparison of GPUs for compute workloads.

The next example shows a configuration file for a job with a similar
configuration as the one above. However, this configuration uses
Compute Engine machine types with GPUs attached. This type of
configuration is in beta:

Use the gcloud command to submit the job, including a --config
argument pointing to your config.yaml file. The following example assumes
you've set up environment variables, indicated by a $ sign followed by
capital letters, for the values of some arguments:

Alternatively, if you install the gcloud beta component, you may specify
cluster configuration details with command-line flags, rather than in a
configuration file. Learn more about how to use these
flags. To install or
update the gcloud beta component, run gcloud components install beta.

The following example shows how to submit a job with the same configuration as
the previous example (using Compute Engine machine types with GPUs
attached), but it does so without using a config.yaml file:

Assigning ops to GPUs

To make use of the GPUs on a machine, make the appropriate changes to your
TensorFlow training application:

High-level Estimator API: No code changes are necessary as long as your
ClusterSpec
is configured properly. If a cluster is a mixture of CPUs and GPUs, map the
ps job name to the CPUs and the worker job name to the GPUs.

When you assign a GPU-enabled machine to a Cloud ML Engine process,
that process has exclusive access to that machine's GPUs; you can't share the
GPUs of a single machine in your cluster among multiple processes. The process
corresponds to the distributed TensorFlow task in your cluster specification.
The
distributed TensorFlow documentation
describes cluster specifications and tasks.

GPU device strings

A standard_gpu machine's single GPU is identified as "/gpu:0".
Machines with multiple GPUs use identifiers starting with "/gpu:0", then
"/gpu:1", and so on. For example, complex_model_m_gpu machines have four
GPUs identified as "/gpu:0" through "/gpu:3".

Python packages on GPU-enabled machines

Maintenance events

If you use GPUs in your training jobs, be aware that the underlying virtual
machines will occasionally be subject to Compute Engine host
maintenance.
The GPU-enabled virtual machines used in your training jobs are configured to
automatically restart after such maintenance events, but you may have to do some
extra work to ensure that your job is resilient to these shutdowns. Configure
your training application to regularly save model checkpoints (usually along the
Cloud Storage path you specify through the --job-dir argument to
gcloud ml-engine jobs submit training) and to restore the most recent
checkpoint in the case that a checkpoint already exists.

The TensorFlow Estimator API
implements this functionality for you, so if your model is already wrapped in an
Estimator, you do not have to worry about maintenance events on your GPU
workers.

If it is not feasible for you to wrap your model in a TensorFlow Estimator and
you want your GPU-enabled training jobs to be resilient to maintenance events,
you must write the checkpoint saving and restoration functionality into
your model manually. TensorFlow does provide some useful resources for such an
implementation in the tf.train
module - specifically, tf.train.checkpoint_exists
and tf.train.latest_checkpoint.