How AI Training Scales

We've discovered that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training on a wide range of tasks. Since complex tasks tend to have noisier gradients, increasingly large batch sizes are likely to become useful in the future, removing one potential limit to further growth of AI systems. More broadly, these results show that neural network training need not be considered a mysterious art, but can be rigorized and systematized.

In the last few years AI researchers have had increasing success in speeding up neural network training through data-parallelism, which splits large batches of data across many machines. Researchers have successfully used batch sizes of tens of thousands for image classification and language modeling, and even millions for RL agents that play the game Dota 2. These large batches allow increasing amounts of compute to be efficiently poured into the training of a single model, and are an important enabler of the fast growth in AI training compute. However, batch sizes that are too large show rapidly diminishing algorithmic returns, and it's not clear why these limits are larger for some tasks and smaller for others [1].

The gradient noise scale (appropriately averaged over training) explains the vast majority (r2 = 80%) of the variation in critical batch size over a range of tasks spanning six orders of magnitude. Batch sizes are measured in either number of images, tokens (for language models), or observations (for games).

We have found that by measuring the gradient noise scale, a simple statistic that quantifies the signal-to-noise ratio of the network gradients[2], we can approximately predict the maximum useful batch size. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data.

This type of statistic is widely used for samplesizeselection and has been proposed for use in deeplearning, but has not been measured or applied systematically for modern training runs. We verified this prediction for a wide range of machine learning tasks shown in the figure above, including image recognition, language modeling, Atari games, and Dota. Specifically, we did training runs at a wide range of batch sizes (tuning the learning rate separately for each) for all of these tasks and compared the speedups in training to what the noise scale predicts should happen. Since large batch sizes often require careful and expensive tuning or special learning rate schedules to be effective, knowing an upper limit ahead of time provides a significant practical advantage in training new models.

We've found it helpful to visualize the results of these experiments in terms of a tradeoff between wall time for training and total bulk compute that we use to do the training (proportional to dollar cost). At very small batch sizes, doubling the batch allows us to train in half the time without using extra compute (we run twice as many chips for half as long). At very large batch sizes, more parallelization doesn't lead to faster training. There is a "bend" in the curve in the middle, and the gradient noise scale predicts where that bend occurs.

Increasing parallelism makes it possible to train more complex models in a reasonable amount of time. We find that a Pareto frontier chart is the most intuitive way to visualize comparisons between algorithms and scales.

We make these curves by setting a level of performance (say a score of 1000 on the Atari game of Beam Rider) and seeing how long it takes to train to that performance at various batch sizes. The results match our model's predictions relatively closely, across many different values of the performance target.

The tradeoff between experience and training time needed to achieve a given score is predictable.

Patterns in the Gradient Noise Scale

We've observed several patterns in the gradient noise scale which offer clues as to what the future of AI training may hold.

First, in our experiments, the noise scale typically increases by an order of magnitude or more over the course of training. Intuitively, this means the network learns the more "obvious" features of the task early in training and learns more intricate features later. For example, in the case of an image classifier, the network might first learn to identify small-scale features such as edges or textures that are present in most images, while only later putting these pieces together into more general concepts such as cats and dogs. To see the full variety of edges or textures, the network only needs to see a small number of images, so the noise scale is smaller; once the network knows more about larger objects, it can process many more images at once without seeing duplicative data.

We see some preliminary indications that the same effect holds across different models on the same dataset – more powerful models have a higher gradient noise scale, but only because they achieve a lower loss. Thus, there's some evidence that the increasing noise scale over training isn't just an artifact of convergence, but occurs because the model gets better. If this is true, then we expect future, more powerful models to have higher noise scale and therefore be more parallelizable.

Second, tasks that are subjectively more difficult are also more amenable to parallelization. In the context of supervised learning, there is a clear progression from MNIST, to SVHN, to ImageNet. In the context of reinforcement learning, there is a clear progression from Atari Pong to Dota 1v1 to Dota 5v5, with the optimal batch sizes differing by a factor of more than 10,000. Thus, as AI advances to new and more difficult tasks, we expect models to tolerate higher batch size.

Implications

The degree of data parallelism significantly affects the speed at which AI capabilities can progress. Faster training makes more powerful models possible and accelerates research through faster iteration times.

In an earlier study, AI and Compute, we observed that the compute being used to train the largest ML models is doubling every 3.5 months, and we noted that this trend is driven by a combination of economics (willingness to spend money on compute) and the algorithmic ability to parallelize training. The latter factor (algorithmic parallelizability) is harder to predict and its limits are not well-understood, but our current results represent a step toward systematizing and quantifying it. In particular, we have evidence that more difficult tasks and more powerful models on the same task will allow for more radical data-parallelism than we have seen to date, providing a key driver for the continued fast exponential growth in training compute. (And this is without even considering recent advances in model-parallelism, which may allow for even further parallelization on top of data-parallelism).

The continued growth of training compute, and its apparently predictable algorithmic basis, further highlights the possibility of rapid increases in AI capabilities over the next few years, and emphasizes the urgency of research into making sure such systems are safe and that they are used responsibly. A central challenge of AI policy will be to work out how to use measures like this to make predictions about the characteristics of future AI systems, and use this knowledge to conceive of policies that let society maximize the upsides and minimize the downsides of these technologies.

OpenAI is committed both to continuing to conduct rigorous analyses to give us foresight on what the future of AI holds, and to acting to address the issues that these analyses raise.

If you want to study the 'Science of AI' and help us make neural network training more data-driven, consider applying to work at OpenAI.

[1] A complementary study done in parallel with this one performed meticulous experimental tests of large batch training as well as an extensive review of the previous literature, clearing up a number of inconsistencies in earlier work. They found a significant variation in the potential for parallelism between tasks, and our work appears to explain a major portion of this variance. Their work also suggests that large batch training does not affect generalization. We believe that systematic surveys such as this one and others are incredibly valuable to the field, and we are committed to continued work on the 'Science of AI.'

[2] When training neural networks, we typically process only a small batch of data at a time, which gives a noisy estimate of the true network gradient. We find that the gradient noise scaleBnoise=E[|G-Gtrue|2] / |Gtrue|2, where the expectation is taken over individual data points, estimates the maximum useful batch size. When the gradient is computed from a batch of size B, the normalized distance between the estimated gradient and the true gradient is given by E[|GB-Gtrue|2/|Gtrue|2] = Bnoise/B. The point at which increasing B stops reducing the noisiness of the gradient significantly occurs around B = Bnoise, and this is also the point at which gains in training speed taper off.

Thanks to Ben Barry, Jack Clark, Ashley Pilipiszyn, and Justin Wang for their work on this post.

We also thank the following for feedback on drafts of this post: Greg Brockman, Paul Christiano, Danny Hernandez, Joel Hestness, Heewoo Jun, Jaehoon Lee, Aleksander Madry, Chris Olah, and John Schulman.