The Advantage of Full AdaGrad

Wouter M. Koolen

2017-01-21

Introduction

Online (or Stochastic) Gradient Descent is a popular method for optimisation. It is conceptually simple and can often be efficiently implemented. Yet it has a parameter (the step size) that needs to be tuned. Various schemes have been proposed to automate this tuning. One such successful scheme is AdaGrad by (Duchi, Hazan, and Singer 2011). AdaGrad comes in two versions. Diagonal AdaGrad (which is the one used in practice) maintains and adapts one learning rate per dimension. Full AdaGrad maintains one learning rate per direction (i.e. a full PSD matrix). In this post we consider the benefit of this additional adaptive power for optimising linear functions.

Setting

To keep things simple and focused, I will work in the setting of online linear optimisation without projections. Each round \(t\) the learner plays a point \({\boldsymbol{w}}_t \in {\mathbb R}^d\), the environment plays a loss coefficient vector \({\boldsymbol{g}}_t \in {\mathbb R}^d\) and the learner incurs loss \({\boldsymbol{w}}_t^{\intercal}{\boldsymbol{g}}_t\). The goal of the learner is to minimise the cumulative regret \[R_T^{{\boldsymbol{u}}}
~=~
\sum_{t=1}^T ({\boldsymbol{w}}_t - {\boldsymbol{u}})^{\intercal}{\boldsymbol{g}}_t\] compared to any point \({\boldsymbol{u}}\) in some pre-specified convex domain. In this post we will consider a simple yet typical domain, namely the Euclidean ball of radius \(D\), i.e. \({\|{\boldsymbol{u}}\|} \le D\).

Follow the Regularised Leader / Online Gradient Descent

Before we discuss AdaGrad, let us first review a simpler algorithm for online optimisation, namely Follow the Regularised Leader with squared Euclidean regularisation. After \(t\) rounds, it plays \begin{equation}\label{eq:ftrl}
{\boldsymbol{w}}_{t+1} ~=~ \operatorname*{argmin}_{{\boldsymbol{w}}}~ \sum_{s=1}^t {\boldsymbol{w}}^{\intercal}{\boldsymbol{g}}_s + \frac{{\|{\boldsymbol{w}}\|}^2}{2 \eta}
,
\end{equation} or, equivalently, it sequentially updates as \[{\boldsymbol{w}}_{t+1} ~=~ {\boldsymbol{w}}_t - \eta {\boldsymbol{g}}_t\] which is called Online Gradient Descent (without projections and with a fixed learning rate \(\eta\), FTRL and OGD are identical). The learning rate parameter \(\eta\) needs to be tuned carefully to get a good bound. It can be shown that the scheme \(\eqref{eq:ftrl}\) guarantees \[R_T^{{\boldsymbol{u}}} ~\le~ \frac{{\|{\boldsymbol{u}}\|}^2}{2 \eta} + \frac{\eta}{2} \sum_{t=1}^T {\|{\boldsymbol{g}}_t\|}^2
\qquad
\text{for all ${\boldsymbol{u}}$}.\] The right-hand side expression is optimised in the learning rate \(\eta\) at \begin{equation}
\label{eq:GD.tune}
\eta ~=~
\frac{
{\|{\boldsymbol{u}}\|}
}{
\sqrt{
\sum_{t=1}^T {\|{\boldsymbol{g}}_t\|}^2
}
}
\end{equation} which results in the bound \begin{equation}
\label{eq:GD.bound}
R_T^{{\boldsymbol{u}}}
~\le~
{\|{\boldsymbol{u}}\|}
\sqrt{
\sum_{t=1}^T {\|{\boldsymbol{g}}_t\|}^2
}
.
\end{equation} In practice this tuning \(\eqref{eq:GD.tune}\) is inaccessible as we do not have prior knowledge of \({\|{\boldsymbol{u}}\|}\) or \(\sum_{t=1}^T {\|{\boldsymbol{g}}_t\|}^2\). But we may get pretty close with actual algorithms. For example by tuning for a known upper bound \({\|{\boldsymbol{u}}\|} \le D\), and by using the doubling trick on \(\sum_{t=1}^T {\|{\boldsymbol{g}}_t\|}^2\). For smoother methods see e.g. (Orabona and Pál 2016).

Adapting the Regularisation

The idea of AdaGrad (see (Hazan 2016 Section 5.6)) is that the regulariser \(\frac{1}{2} {\|{\boldsymbol{w}}\|}^2\) in \(\eqref{eq:ftrl}\) may not be the best choice for the problem at hand. So let’s replace the scaled Euclidean norm \(\eta^{-1} {\|{\boldsymbol{w}}\|}^2 = \eta^{-1} {\boldsymbol{w}}^{\intercal}{\boldsymbol{w}}\) by the more general norm \({\|{\boldsymbol{w}}\|}_{{\boldsymbol{X}}}^2 = {\boldsymbol{w}}^{\intercal}{\boldsymbol{X}}^{-1} {\boldsymbol{w}}\), where \({\boldsymbol{X}}\succeq {\boldsymbol{0}}\) is now a PSD matrix parameter. Online Gradient Descent is recovered by setting \({\boldsymbol{X}}= \eta {\boldsymbol{I}}\). With this upgrade, the learner plays \begin{equation}\label{eq:BentDescent}
{\boldsymbol{w}}_{t+1}
~=~
\arg\min_{{\boldsymbol{w}}}~
\sum_{s=1}^t {\boldsymbol{g}}_s^{\intercal}{\boldsymbol{w}}+ \frac{1}{2} {\boldsymbol{w}}{\boldsymbol{X}}^{-1} {\boldsymbol{w}}~=~
- {\boldsymbol{X}}\sum_{s=1}^t {\boldsymbol{g}}_s
\end{equation} or, equivalently, sequentially updates \[{\boldsymbol{w}}_{t+1} ~=~ {\boldsymbol{w}}_t - {\boldsymbol{X}}{\boldsymbol{g}}_t
.\] To see how we should tune \({\boldsymbol{X}}\), let us analyse the regret of this scheme.

Optimal tuning

Tuning \({\boldsymbol{X}}\) optimally in \(\eqref{eq:regretbd}\) results in regret at most \begin{align*}
R_T^{{\boldsymbol{u}}}
~\le~
\min_{{\boldsymbol{X}}} ~
\frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ {\boldsymbol{X}}^{-1} {\boldsymbol{u}}{\boldsymbol{u}}^{\intercal})
&~=~
\sqrt{{\boldsymbol{u}}^{\intercal}{\boldsymbol{G}}{\boldsymbol{u}}}
\end{align*} where the optimiser is \(
{\boldsymbol{X}}= \frac{
{\boldsymbol{u}}{\boldsymbol{u}}^{\intercal}}{
\sqrt{{\boldsymbol{u}}^{\intercal}{\boldsymbol{G}}{\boldsymbol{u}}}
}
\). This makes perfect sense. If you want to compare to \({\boldsymbol{u}}\), you should ignore all other directions and only play in the \({\boldsymbol{u}}\) direction. The problem becomes one-dimensional. This is not quite satisfying, as we need to know \({\boldsymbol{u}}\) (and not just its norm) to be able to tune in this way. In the next sections we discuss tuning with less severe dependence on \({\boldsymbol{u}}\).

Full AdaGrad tuning

Starting from the regret bound \(\eqref{eq:regretbd}\), we may use that \({\|{\boldsymbol{u}}\|} \le D\) so that \({\boldsymbol{u}}{\boldsymbol{u}}^{\intercal}\preceq D^2 {\boldsymbol{I}}\) and hence the regret is bounded by \[R_T^{{\boldsymbol{u}}}
~\le~
\frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ {\boldsymbol{X}}^{-1} {\boldsymbol{u}}{\boldsymbol{u}}^{\intercal})
~\le~
\frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ D^2 {\boldsymbol{X}}^{-1})
.\] Now tuning results in \[R_T^{{\boldsymbol{u}}}
~\le~
\min_{{\boldsymbol{X}}}~
\frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ D^2 {\boldsymbol{X}}^{-1})
~=~
D \operatorname{tr}{\big({\boldsymbol{G}}^{1/2}\big)}\] with optimiser \(
{\boldsymbol{X}}=
D {\boldsymbol{G}}^{-1/2}
\). As before, this optimal value for \({\boldsymbol{X}}\) is not known a priori. Yet AdaGrad is an actual algorithm that achieves the above bound up to a factor \(\sqrt{2}\) (see (Duchi, Hazan, and Singer 2011 Corollary 11) or (Hazan 2016 Theorem 5.9)). But is this bound desirable? How else could we tune?

Minimax tuning

In the previous section we used \({\boldsymbol{u}}{\boldsymbol{u}}^{\intercal}\preceq D^2 {\boldsymbol{I}}\). This upper bound replaces a single eigenvalue \(D^2\) with \(d\) many eigenvalues \(D^2\). So intuitively, it may result in suboptimal dependence on the dimension. To tighten the tuning, let’s take a minimax approach instead. That is, we tune \(\eqref{eq:regretbd}\) as follows: \begin{align*}
R_T^{{\boldsymbol{u}}}
&~\le~
\min_{{\boldsymbol{X}}} \max_{{\boldsymbol{u}}: {\|{\boldsymbol{u}}\|} \le D}~
\frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ {\boldsymbol{X}}^{-1} {\boldsymbol{u}}{\boldsymbol{u}}^{\intercal})
\\
&~=~
\min_{{\boldsymbol{X}}}~
\frac{1}{2} {\left(
\operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}})
+ D^2 \lambda_{\max}({\boldsymbol{X}}^{-1})
\right)}
\\
&~=~
\min_{\eta}~
\frac{1}{2} {\left(
\eta \operatorname{tr}({\boldsymbol{G}})
+ \frac{D^2}{\eta}
\right)}
\\
&~=~
D \sqrt{ \operatorname{tr}({\boldsymbol{G}})}
\end{align*} where the third line is obtained by observing that the optimal matrix \({\boldsymbol{X}}\) will have all eigenvalues equal, i.e. \({\boldsymbol{X}}= \eta {\boldsymbol{I}}\) and the fourth line is obtained by plugging in the optimiser \(\eta = \frac{D}{\sqrt{\operatorname{tr}({\boldsymbol{G}})}}\). Surprisingly, this is the Online Gradient Descent tuning \(\eqref{eq:GD.tune}\) and its bound \(\eqref{eq:GD.bound}\) with \({\|{\boldsymbol{u}}\|} \le D\). Hence Online Gradient Descent (with doubling trick) achieves this bound up to a small constant factor.

Discussion

Online Gradient Descent has regret of order \(D \sqrt{\operatorname{tr}({\boldsymbol{G}})}\) (e.g. using a doubling trick). Full AdaGrad has regret of order \(D \operatorname{tr}(\sqrt{{\boldsymbol{G}}})\). The latter is always at least as high as the former (by concavity of the square root) and it may be a factor \(\sqrt{d}\) higher (in case \({\boldsymbol{G}}\propto {\boldsymbol{I}}\)).

So what is the benefit of Full AdaGrad over Online Gradient Descent? For linear losses (the simplest case) I do not know. It may exist. But I have not found any regret analysis bringing it out. If you do, please let me know. I would be very interested.