Gradient Descent as Exponential Weights

Wouter M. Koolen

2016-02-21

Introduction

We start by reviewing two common strategies in online learning: gradient descent and exponential weights. We then see that gradient descent can be regarded as an instance of exponential weights. We conclude by discussing why that viewpoint is useful.

Prediction proceeds in rounds. In round \(t\) the learner produces a prediction \({\boldsymbol x}_t \in {\mathbb R}^d\), encounters a loss vector \({\boldsymbol {\ell}}_t \in {\mathbb R}^d\), and incurs loss \({\boldsymbol x}_t^{\intercal}{\boldsymbol {\ell}}_t\) given by the dot product (we are sweeping the gradient trick under the carpet here). We first look at two strategies to choose the predictions \({\boldsymbol x}_t\). For the purpose of this post we take a fixed learning rate \(\eta > 0\).

Reduction

The main point of this post is to render gradient descent as an instance of exponential weights. This will work as follows. We presented exponential weights for a finite set of dimensions or “experts”. Here we generalise that slightly. We will consider all points in \({\mathbb R}^d\) as experts, and maintain weights on these in the form of a density. We start out with a Gaussian prior density, and show that each subsequent “posterior” distribution is again Gaussian. Moreover, and this is the crux, we will see that the resulting posterior mean evolves exactly according to the GD update equation \(\eqref{eq:gd.weigths}\).

Discussion

This perspective of gradient descent as an instance of exponential weights is not news by itself. Yet it provides a bridge to transport extensions developed in the exponential weights world to the gradient descent world. I am thinking about, for example, constructions involving specialists (Freund et al. (1997), Chernov and Vovk (2009), Koolen, Adamskiy, and Warmuth (2012)) (this is the route taken by Luo and Schapire (2015 Section 5.1)), and methods to learn the learning rate \(\eta\) (which we considered a fixed constant in this post) Koolen and Van Erven (2015).

This perspective also hints at a possible shortcoming of gradient descent: it does not learn the covariance. In fact, this is what more advanced methods like Online Newton Step by Hazan, Agarwal, and Kale (2007) and AdaGrad by Duchi, Hazan, and Singer (2011) do do.