Sunday, April 1, 2018

Aesthetically Pleasing Learning Rates

In this blog post, we abandoned all pretense of theoretical rigor and used pixel values from natural images as learning rate schedules.

The learning rate schedule is an important hyperparameter to choose when training neural nets. Set the learning rate too high, and the loss function may fail to converge. Set the learning rate too low, and the model may take a long time to train, or even worse, overfit.

The optimal learning rate is commensurate with the scale of the smoothness of the gradient of the loss function (or in fancy words, the “Lipschitz constant” of a function). The smoother the function, the larger the learning rate we are allowed to take without the optimization “blowing up”.

What is the right learning rate schedule? Exponential decay? Linear decay? Piecewise constant? Ramp up then down? Cyclic? Warm restarts? How big should the batch size be? How does this all relate to generalization (what ML researchers care about)?

Tragically, in the non-convex, messy world of deep neural networks, all theoretical convergence guarantees are off. Often those guarantees rely on a restrictive set of assumptions, and then the theory-practice gap is written off by showing that it also “empirically works well” at training neural nets.

Fortunately, the research community has spent thousands of GPU years establishing empirical best practices for the learning rate:

Given that theoretically-motivated learning rate scheduling is really hard, we ask, "why not use a learning rate schedule which is at least aesthetically pleasing?" Specifically, we scan across a nice-looking image one pixel at a time, and use the pixel intensities as our learning rate schedule.

Several recent papers (5, 6, 7) suggest that waving learning rates up and down are good for deep learning.

Pixel values from natural images have both of the above properties. When reshaped into a 1-D signal, an image waves up and down in a random manner, sort of like Brownian motion. Natural images also tend to be lit from above, which lends to a decaying signal as the image gets darker on the bottom.

We compared several learning rate schedules on MNIST and CIFAR-10 classification benchmarks, training each model for about 100K steps. Here's what we tried:

The Mona Lisa and puppy images turns out to be a pretty good schedules, even better than cyclic learning rates and Andrej Karpathy’s favorite 3e-4 with Adam. The "bad MNIST" digit appears to be a pretty dank learning rate schedule too, just edging out Geoff’s portrait (you’ll have to imagine the error bars on your own). All learning rates perform about equally well on MNIST.

The fixed learning rate of 3e-4 is quite bad (unless one uses the Adam optimizer). Our experiments suggest that maybe pretty much any learning rate schedule can outperform a fixed one, so if ever you see or think about writing a paper with a constant learning rate, just use literally any schedule instead. Even a silly one. And then cite this blog post.