A Bayesian Perspective on Generalization and Stochastic Gradient Descent

Venue

Publication Year

Authors

BibTeX

Abstract

Zhang et al. (2016) argued that understanding deep learning requires rethinking
generalization. To justify this claim, they showed that deep networks can easily
memorize randomly labeled training data, despite generalizing well when shown real
labels of the same inputs. We show here that the same phenomenon occurs in small
linear models with fewer than a thousand parameters; however there is no need to
rethink anything, since our observations are explained by evaluating the Bayesian
evidence in favor of each model. This Bayesian evidence penalizes sharp minima. We
also explore the “generalization gap” observed between small and large batch
training, identifying an optimum batch size which scales linearly with both the
learning rate and the size of the training set. Surprisingly, in our experiments
the generalization gap was closed by regularizing the model.