NVIDIA: Accelerating Deep Learning with Uber’s Horovod

NVIDIA, inventor of the GPU, creates solutions for building and training AI-enabled systems. In addition to providing hardware and software for much of the industry’s AI research, NVIDIA is building an AI computing platform for developers of self-driving vehicles. With over 370 automotive companies among their self-driving partners, NVIDIA has established itself as the industry leader in creating systems for sensing, perceiving, mapping, and driving this next generation of driverless transportation.

The AI perception models need to be trained under intense conditions, across eight Volta GPUs inside DGX-1 servers to ensure that the vehicles using them can reliably assess and safely react to the world around them. To determine the performance capacities of their GPUs, Tim Zaman, deep learning software engineer at NVIDIA, and his team leverage machine learning software that enables each new generation of GPU to work faster and more efficiently both individually and as part of a distributed system.

“Researchers just want something that works, is fast, has a straightforward API, and is simple to use,” said Tim. “At the end of the day, they don’t want to have to worry about the software so that they can just focus on their research.”

Horovod, Uber’s open source distributed deep learning system, was a clear choice for NVIDIA. With only a few lines of code, Horovod allowed them to scale from one to eight GPUs, optimizing model training for their self-driving sensing and perception technologies, leading to faster, safer systems.

Large-scale GPU training

NVIDIA assessed a variety of options when it came to selecting a framework that could meet these needs. At first, they could only train non-parallel workloads on a single device, making distributed training for autonomous technologies extremely difficult.

To ensure that their GPUs are battle tested for handling high performance training and can adapt to the ever-evolving nature of deep learning, NVIDIA needed an API that was easy-to-use, quick to iterate on, and could be distributed across entire workloads. Horovod presented the ultimate solution.

In fact, to develop Horovod, the Uber team leveraged some of NVIDIA’s open source software and hardware, including NCCL, an open source low-level API used to communicate between GPUs. Horovod’s seamless implementation in their GPUs was another testament to the natural partnership between the two AI-focused companies.

“Working with the NCCL team and the rest of NVIDIA was a true pleasure,” said Alex Sergeev, Horovod project lead. “We launched this collaboration over a year ago when NCCL 2 was entering early access phase and, through this collaboration, we were able to quickly build a solution that improved on both usability and performance aspects of distributed deep learning. Anytime we had an issue or suggestion, the NCCL team was there to make the product better for end users.”

Enter Horovod

According to Tim, Horovod far outperformed any other high-level library they had previously tried. For Zaman, usability and speed were Horovod’s key differentiating factors.

“It’s actually quite remarkable that Horovod scored so well on these two metrics because usually you make a tradeoff where it’s very usable but a lot slower, or vice versa,” said Tim. “Horovod brings together a lot of pieces into one package that was easy to use and generated great performance for our team.”

As part of their research, Tim’s team optimized their GPUs in 2017 to work with TensorFlow, a popular and widely used distributed training framework for deep learning, and the set-up has been stable ever since. However, a frequent complaint from their users was that TensorFlow code, when parallelized, is prone to user-error and hard to reason about. Horovod filled a big gap in this process by making TensorFlow easy to work with, particularly when it came to distributed training. According to Tim, Horovod’s ease-of-use and simplicity drove changes made by the TensorFlow team themselves to ensure more user friendly multi-device distribution.

Using Horovod

As NVIDIA continues training on their GPUs, Horovod becomes ever more important to the robust development of its autonomous solutions. NVIDIA leverages Horovod for training perception models processed by its DGX Systems. Building such systems demands an infrastructure capable of training thousands of hours of data and millions of images via deep learning and AI.

At NVIDIA, Horovod training jobs are run on their DGX SATURNV cluster. From there, it runs in Docker containers (hosted on NGC) on pre-made Docker images that include deep learning frameworks, configured to be highly optimized. To train their self-driving systems, they use TensorFlow images that come with Horovod pre-installed on them alongside CUDA, CuDNN, and NCCL. With Horovod, researchers experience a scaling factor greater than seven times on an eight GPU system, with hundreds of multi-GPU jobs launched per day per perception model (e.g., lane detector, road signs, etc.). They automate the process of launching jobs and finding optimized parameters using MagLev, NVIDIA’s AI training and inference infrastructure.

Specifically, Horovod exposes a few low and high-level primitives that are easy for most deep learning practitioners to use. One example, Tim notes, is called average use, which takes a tensor (a value of all the distributed tasks that are running) and returns the reduction of that (in other words, the mean). Horovod allows users to return the value average across all nodes using one line of code. A high-level example is the optimizer object, which takes care of the training in TensorFlow; Horovod offers a one-line optimizer that enables developers to train across distributed nodes, affording greater speed and resource optimization.

Scaling AI with NVIDIA and Horovod

Once implemented in their self-driving perception specs, NVIDIA was able to iterate quickly, receiving nearly immediate assistance from Uber’s Horovod team when issues arose. Over time, support for Keras and PyTorch were added to Horovod, offering even more expansive opportunities for NVIDIA’s deep learning training.

“It’s very important that an open source project is maintained, and with Horovod, we have no doubt that our questions will be answered as soon as possible,” Tim said. “We really enjoy working with this team and seeing where Horovod can take us.”

As NVIDIA continues to develop self-driving systems for production deployment, the team looks forward to leveraging Horovod to build GPU and software technologies that power safer, smarter autonomous vehicles.