Exploiting Curvature Using Exponential Weights

Wouter M. Koolen

2016-09-06

Introduction

There are many algorithms in Online Learning. The list includes Gradient Descent, Exponential Weights, Mirror Descent, Follow the Regularized Leader and more (see e.g. (Shalev-Shwartz 2012)). So it is always good to see how these algorithms relate, and to judge how different they really are. In this post we look at three basic algorithms and show that they are in fact obtained from the same master algorithm. That is, we show that

Gradient Descent (tuned for convex loss functions),

Gradient Descent (tuned for strongly convex loss functions), and

Online Newton Step (for exp-concave loss functions)

are all instances of Exponential Weights with certain so-called surrogate losses. We give a uniform derivation of the algorithm and its regret analysis, and recover the known bounds in each case.

I addressed the first case above in this earlier post, and encountered the connection with the third case while working on (Erven and Koolen 2016). A similar connection between Exponential Weights and the Mirror Descent family of algorithms is studied by (Hoeven 2016).

Setting: Online Optimisation

We look at online optimisation over a convex set \({{\mathcal U}}\subseteq {{\mathbb R}}^d\). Learning proceeds in rounds. In round \(t\) the learner plays a point \({{\boldsymbol w}}_t \in {{\mathcal U}}\). Then the adversary reveals a loss function \(f_t : {{\mathcal U}}\to {{\mathbb R}}\), upon which the learner incurs loss \(f_t({{\boldsymbol w}}_t)\). We evaluate the learner by its so-called regret. After \(T\) rounds, the regret compared to a point \({{\boldsymbol u}}\in {{\mathcal U}}\) is defined by \begin{equation}\label{eq:regret}
R_T^{{\boldsymbol u}}~{:=}~ \sum_{t=1}^T {\big(f_t({{\boldsymbol w}}_t) - f_t({{\boldsymbol u}})\big)}
.
\end{equation} The goal of the learner is to ensure small regret compared to all \({{\boldsymbol u}}\in {{\mathcal U}}\).

Now how to control that surrogate regret \(\tilde R_T^{{\boldsymbol u}}\)? Let’s play standard Exponential Weights! Hold on, why not play Exponential Weights on the original losses \(f_t\) directly? Well, the cool thing about Exponential Weights on the quadratic surrogate losses \({\ell}_t\) (and with Gaussian prior) is that all quantities of interest can be computed exactly in closed form. This is a major computational advantage, and it also provides more explicit insight into what the algorithm “thinks”.

Exponential Weights

We now analyze the behavior of the standard Exponential Weights (technically the exponentially weighted average forecaster) applied to the surrogate loss \({\ell}_t\). The Exponential Weights strategy will produce a sequence of distributions \(P_1, P_2, \ldots\) on \({{\mathbb R}}^d\). It will be convenient to maintain weights on all of \({{\mathbb R}}^d\) instead of only on \({{\mathcal U}}\), but we will ensure that the each \(P_t\) has its mean in \({{\mathcal U}}\). We will start out with Gaussian prior \[P_1 ~{:=}~ {{\mathcal N}}({{\boldsymbol 0}}, {{\boldsymbol \Sigma}})
,\] with zero mean (here we assume that \({{\boldsymbol 0}}\in {{\mathcal U}}\)) and covariance \({{\boldsymbol \Sigma}}\succeq {{\boldsymbol 0}}\). Then in each round \(t\) we play the mean of \(P_t\)\[{{\boldsymbol w}}_t ~{:=}~ \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} {\left[{{\boldsymbol u}}\right]}
.\] Playing the mean is a better idea than randomisation since \({\ell}_t({{\boldsymbol w}}_t) \le \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})}{\left[{\ell}_t({{\boldsymbol u}})\right]}\) by Jensen’s inequality. Exponential Weights subsequently updates the distribution to \[P_{t+1}
~{:=}~
\min_{P \in \mathcal P}~
\operatorname*{KL}(P\|P_t) + \eta \operatorname*{\mathbb E}_{P({{\boldsymbol u}})} {\left[{\ell}_t({{\boldsymbol u}})\right]}
,\] where \[\mathcal P ~{:=}~ {\left\{P\middle|\operatorname*{\mathbb E}_{P({{\boldsymbol u}})}[{{\boldsymbol u}}] \in {{\mathcal U}}\right\}}\] is the set of distributions on \({{\mathbb R}}^d\) with mean in \({{\mathcal U}}\) and \(\eta > 0\) is the learning rate (a parameter we will tune below). It is customary and convenient to equivalently decompose the update in the following two steps \[P_{t+1}
~=~
\min_{P \in \mathcal P}~
\operatorname*{KL}(P\|\tilde P_{t+1})
\qquad
\text{where}
\qquad
\tilde P_{t+1}
~{:=}~
\min_{P}~
\operatorname*{KL}(P\|P_t) + \eta \operatorname*{\mathbb E}_{P({{\boldsymbol u}})} {\left[{\ell}_t({{\boldsymbol u}})\right]}
.\] We then find that both \(P_{t+1}\) and \(\tilde P_{t+1}\) are Gaussian, in particular \[P_{t+1} ~=~ {{\mathcal N}}({{\boldsymbol w}}_{t+1}, {{\boldsymbol \Sigma}}_{t+1})
\qquad
\text{and}
\qquad
\tilde P_{t+1} ~=~ {{\mathcal N}}(\tilde {{\boldsymbol w}}_{t+1}, {{\boldsymbol \Sigma}}_{t+1})
,\] where \begin{align*}
{{\boldsymbol \Sigma}}_{t+1}^{-1}
&~=~
{{\boldsymbol \Sigma}}_t^{-1} + \eta {{\boldsymbol M}}_t
\\
\tilde {{\boldsymbol w}}_{t+1}
&~=~
{{\boldsymbol w}}_t - \eta {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t
\\
{{\boldsymbol w}}_{t+1}
&~=~
\operatorname*{argmin}_{{{\boldsymbol w}}\in {{\mathcal U}}}~
({{\boldsymbol w}}- \tilde {{\boldsymbol w}}_{t+1})^{\intercal}{{\boldsymbol \Sigma}}_{t+1}^{-1} ({{\boldsymbol w}}-\tilde {{\boldsymbol w}}_{t+1})
\end{align*} So far so good. Even though the Exponential Weights algorithm maintains distributions \(P_t\) on \({{\mathbb R}}^d\) (which could have arbitrarily high complexity), it “collapses”, meaning that it can be implemented by maintaining just \({{\boldsymbol w}}_t\) and \({{\boldsymbol \Sigma}}_t\), which require at most \(d\) and \(d^2\) parameters respectively. This is the big advantage of working with a quadratic loss and a Guassian prior.

Applications

We now recover the three algorithms and their analysis from the introduction. Throughout we take spherical prior covariance \({{\boldsymbol \Sigma}}= \sigma^2{{\boldsymbol I}}\) and use \({{\boldsymbol g}}_t = \nabla f_t({{\boldsymbol w}}_t)\). So the three cases only differ in their choice of \({{\boldsymbol M}}_t\). Let us also make the standard boundedness assumptions \({\|{{\boldsymbol g}}_t\|} \le G\) and \({\|{{\boldsymbol \nu}}\|} \le D\) for each \({{\boldsymbol \nu}}\in {{\mathcal U}}\).

Discussion

We can see a few interesting things

We find that Gradient Descent and Online Newton Step are instances of Exponential Weights. I find this intriguing, as in the past I regarded them as from the “Squared Euclidean” world, not from the “Shannon Entropy” world. Yet once we employ Gaussian weights it turns out the distinction vanishes. It would be interesting to see which tricks from the Entropy world can be ported over in this way.

The algorithm is over-parametrised. We might as well have fixed the product of the learning rate \(\eta\) and the prior covariance \({{\boldsymbol \Sigma}}\).

I was curious where the factor \(d\) difference in the regret rate between the strongly convex and exp-concave cases comes from. I was thinking the answer had to be in one of the pieces \(\eqref{eq:expcomp}\), \(\eqref{eq:cummixloss}\) or \(\eqref{eq:kl}\). But that is not quite right, as the “dangerous terms” in \(\eqref{eq:kl}\) (those involving \({{\boldsymbol \Delta}}\)) actually cancel with fragments from both \(\eqref{eq:cummixloss}\) and \(\eqref{eq:expcomp}\). With the generic bound \(\eqref{eq:mainbound}\) in place, I now see it comes from the difference in behaviour of the summands \[{{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t
~=~
{{\boldsymbol g}}_t ({{\boldsymbol \Sigma}}_t^{-1} + \eta {{\boldsymbol M}}_t)^{-1} {{\boldsymbol g}}_t
.\] This is the place where the difference in the form of \({{\boldsymbol M}}_t\) kicks in. With \({{\boldsymbol M}}_t \propto {{\boldsymbol I}}\) the norm of \({{\boldsymbol g}}_t\) matters, which is independent of \(d\). But with \({{\boldsymbol M}}_t \propto {{\boldsymbol g}}_t {{\boldsymbol g}}_t^{\intercal}\) the log-determinant naturally arises, and we unavoidably pick up a factor of \(d\).