Title:Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Abstract: This paper proposes a new optimization algorithm called Entropy-SGD for
training deep neural networks that is motivated by the local geometry of the
energy landscape at solutions found by gradient descent. Local extrema with low
generalization error have a large proportion of almost-zero eigenvalues in the
Hessian with very few positive or negative eigenvalues. We leverage upon this
observation to construct a local entropy based objective that favors
well-generalizable solutions lying in the flat regions of the energy landscape,
while avoiding poorly-generalizable solutions located in the sharp valleys. Our
algorithm resembles two nested loops of SGD, where we use Langevin dynamics to
compute the gradient of local entropy at each update of the weights. We prove
that incorporating local entropy into the objective function results in a
smoother energy landscape and use uniform stability to show improved
generalization bounds over SGD. Our experiments on competitive baselines
demonstrate that Entropy-SGD leads to improved generalization and has the
potential to accelerate training.