TensorFlow

TensorFlow Distributed Training: Introduction and Tutorials

Large-scale deep learning models take a long time to run and can benefit from distributing the work across multiple resources. TensorFlow can help you distribute training across multiple CPUs or GPUs.

We’ll explain how TensorFlow distributed training works and show brief tutorials to get you oriented. We’ll also show how to take distributed resources to the next level, scaling up experiments on hundreds of machines on and off the cloud with the MissingLink deep learning platform.

Introduction to TensorFlow Distributed Training

To run large deep learning models, or a large number of experiments, you will need to distribute them across multiple CPUs, GPUs or machines. TensorFlow can help you distribute your training by splitting models over many devices and carrying out the training in parallel on the devices.

Types of Parallelism in Distributed Deep Learning

There are two main methods of implementation that you can use to distribute deep learning model training: model parallelism or data parallelism. At times, a single approach will lead to better application performance, while at other times a combination of the two approaches will improve performance.

Model parallelism

In model parallelism, the model is segmented into different parts so you can run it in parallel. You can run each part on the same data in different nodes. This approach may decrease the need for communication between workers, as workers only need to synchronize the shared parameters.

Model parallelism also works well for GPUs in a single server that shares a high-speed bus and with larger models, as hardware constraints per node are not a limitation.

Data parallelism

In this mode, the training data is divided into multiple subsets. You run each subset on the same replicated model in a different node. You must synchronize the model parameters at the end of the batch computation to ensure they are training a consistent model because the prediction errors are computed independently on each machine.

Each device must, therefore, send notification of all changes, to all of the models of all the other devices.

Distributed TensorFlow using tf.distribute.Strategy

The DistributionStrategy API is an easy way to distribute training workloads across multiple machines. DistributionStrategy API is designed to give users access to existing models and code. Users only need to make minimal changes to the models and code to enable distributed training.

Supported types of distribution strategies

MirroredStrategy: Support synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable. TMirroredVariables are kept in sync with each other by applying identical updates.

MultiWorkerMirroredStrategy: This is a version of MirroredStrategy for multi-worker training. It implements synchronous distributed training across multiple workers, each of which can have multiple GPUs. It creates copies of all variables in the model on each device across all workers.

ParameterServerStrategy: Supports parameter servers. It can be used for multi-GPU synchronous local training or asynchronous multi-machine training. When used to train locally on one machine, variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs.

Make the Most of TensorFlow Multi-GPU Machines with MissingLink

When running distributed experiments on multi-GPU machines, it is very important to ensure high utilization. MissingLink can help to better utilize expensive GPU resources by scheduling multiple experiments, keeping track of TensorFlow hyperparameters and datasets and seeing exactly what works

Quick Tutorial #1: Distribution Strategy API with Estimator

In this example, we will use the MirroredStrategy with the Estimator class to train and evaluate a simple TensorFlow model. This example is based on the official TensorFlow documentation.

With the typical setup of one GPU per process, this can be set to local rank. In that case, the first process on the server will be allocated the first GPU, the second process will be allocated the second GPU and so forth.

The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.

5. Add hvd.BroadcastGlobalVariablesHook(0)

This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint. Alternatively, if you’re not using MonitoredTrainingSession, you can simply execute the hvd.broadcast_global_variables op after global variables have been initialized.

6. Modify code

Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.