Measuring the Effects of Data Parallelism on Neural Network Training

Recent hardware developments have made unprecedented amounts of data
parallelism available for accelerating neural network training. Among the
simplest ways to harness next-generation accelerators is to increase the batch
size in standard mini-batch neural network training algorithms. In this work,
we aim to experimentally characterize the effects of increasing the batch size
on training time, as measured in the number of steps necessary to reach a goal
out-of-sample error. Eventually, increasing the batch size will no longer
reduce the number of training steps required, but the exact relationship
between the batch size and how many training steps are necessary is of critical
importance to practitioners, researchers, and hardware designers alike. We
study how this relationship varies with the training algorithm, model, and
dataset and find extremely large variation between workloads. Along the way, we
reconcile disagreements in the literature on whether batch size affects model
quality. Finally, we discuss the implications of our results for efforts to
train neural networks much faster in the future.

Underfox3:
Google researchers have studied the effects of data parallelism by increasing lot size at the time of neural network training.
https://t.co/jShuujoFAu

thuereyGroup:
Our work on video super-resolution with GANs is online now as a preview! The main trick is a special discriminator CNN that learns to supervise in terms of detail as well as temporal coherence. https://t.co/GaTsaurABM , and the paper https://t.co/C4TBe1PYPF https://t.co/rXyImR2RUw

RogerGrosse:
Important paper from Google on large batch optimization. They do impressively careful experiments measuring # iterations needed to achieve target validation error at various batch sizes. The main "surprise" is the lack of surprises. [thread]
https://t.co/7QIx5CFdfJ https://t.co/rYWst5SOmY

dcpage3:
Lots of evidence + a few puzzles for our theory of distinct NN training regimes dominated by catastrophic forgetting/curvature in the huge new study from @GoogleAI https://t.co/VUbqq81r1u.
Background here:
https://t.co/x8w3tgL07e

NirantK:
If you are into making training and inference times faster (e.g. for edge devices, or mobiles) - consider taking a look at this paper as well
https://t.co/dOGhsgPWcT
They explore mostly batchsize and the effects of such hyperparams

SalehCU:
@mraginsky @SebastienBubeck @prfsanjeevarora Interesting work by colleagues at Google looking into SGD, SGD
with momentum as well as Nesterov momentum as optimizers in
a data- parallelism training framework: https://t.co/cLmNpNNzUY
@roydanroy

jaschasd:
Everything you wanted to know about the role of batch size in neural net training, but didn't have the computational resources to ask!
With Chris Shallue, Jaehoon Lee, Joe @joe_antognini, Roy Frostig, and George Dahl.
https://t.co/TWKtsaJiRu

zacharynado:
Awesome large scale, large batch paper! Turns out while you can scale the batch size pretty high, first order optimizers don't converge quicker after a certain batch size (across datasets and model sizes!) https://t.co/L5GFfvTMCq https://t.co/G1jtqD9fY4