Matrix iProd and Matrix Squint

Wouter M. Koolen

2016-11-10

Introduction

A great many algorithms for online learning maintain a vector parameter. Gradient Descent, Exponential Weights, Mirror Descent, etc. Many of these online learning algorithms have a ntural matrix counterpart. A random sample of papers is here (Tsuda, Rätsch, and Warmuth 2005), (Warmuth and Kuzmin 2006), (Warmuth and Kuzmin 2008), (Warmuth and Kuzmin 2010), (Koolen, Kotłowski, and Warmuth 2011), (Hazan, Kale, and Shalev-Shwartz 2012). A variety of settings are considered in the literature, differing in the specific matrix domains considered and in the choice of loss function. In this post we look at the Hedge setting and the matrix Hedge setting. The matrix Hedge setting is interesting from the perspective of applications (e.g. PCA, see (Warmuth and Kuzmin 2008)) but also from a mathematical point of view, to see which ideas survive and which new tricks are needed. In this post we upgrade the recent iProd and Squint algorithms by (Koolen and Van Erven 2015) to matrices. These algorithms have so-called second-order quantile regret bounds, and the goal will be to derive these for the matrix setting.

Matrix Hedge Setting

We start by reviewing the matrix Hedge setting. Learning proceeds in rounds \(t=1, 2, \ldots\). In round \(t\) the learner plays a density matrix \({\boldsymbol{W}}_t\). A density matrix is a positive semi-definite matrix with unit trace. Density matrices generalise probability mass functions. Their eigenvalues form a probability distribution, but in addition they carry an orthogonal transformation in the eigenvectors. The adversary then reveals a symmetric loss matrix \({\boldsymbol{L}}_t\). We assume that the eigenvalues of \({\boldsymbol{L}}_t\) are in \([0,1]\). The loss is \(\operatorname{tr}({\boldsymbol{W}}_t {\boldsymbol{L}}_t)\), i.e. the default dot product of matrices. We define the instantaneous regret matrix \({\boldsymbol{R}}_t\) by \[{\boldsymbol{R}}_t
~:=~
\operatorname{tr}({\boldsymbol{W}}_t {\boldsymbol{L}}_t) {\boldsymbol{I}}- {\boldsymbol{L}}_t
.\] So \({\boldsymbol{R}}_t\) is symmetric with eigenvalues falling in \([-1,1]\). The goal of the learner is to keep the cumulative regret matrix \(\sum_{t=1}^T {\boldsymbol{R}}_t\) as small as possible. The traditional regret bound for the Matrix Hedge algorithm by (Warmuth and Kuzmin 2006) in \(d\) dimensions is \[\sum_{t=1}^T \operatorname{tr}({\boldsymbol{W}}_t {\boldsymbol{L}}_t)
- \min_{{\boldsymbol{W}}} \sum_{t=1}^T \operatorname{tr}({\boldsymbol{W}}{\boldsymbol{L}}_t)
~=~
\lambda_{\max} {\left(\sum_{t=1}^T {\boldsymbol{R}}_t\right)}
~\le~
\sqrt{T/2 \ln d}
.\] This regret bound is equal to the regret bound for the standard (vector) Hedge setting. In this sense predicting with density matrices is not any harder than predicting with probability distributions. The central point of this post is to replace the time horizon \(T\) by a more sophisticated measure of the complexity of the data.

Implementation

Let us now think about computing the Matrix iProd prediction \(\eqref{eq:MiProd.w}\). As argued by (Koolen and Van Erven 2015), it suffices to compute the integral over \(\eta\) on a grid of \(\ln T\) exponentially spaced values. If we maintain the matrix \[\sum_{t=1}^T \ln {\left({\boldsymbol{I}}+ \eta {\boldsymbol{R}}_t\right)}\] for each \(\eta\) from the grid then we have to do \(O(d^3)\) work per round per learning rate to compute the matrix exponentials. Unfortunately, the eigensystem of the above matrix, which is the argument of the matrix exponential, is a function of \(\eta\), meaning that we cannot re-use the eigendecomposition between learning rates. So all in all this algorithm can be implemented in \(O(d^3 \ln T)\) time per round with \(O(d^2 \ln T)\) space. We now consider a different algorithm that might fare better in this regard.

Matrix Squint

In this section we look at the upgrade of Squint by (Koolen and Van Erven 2015) to the matrix setting. In the vector setting, Squint is a slight weakening of iProd with the computational advantage that the integral over \(\eta\) can be computed in closed form. Let’s see if this remains useful in the Matrix setting. Whereas for iProd the bound \(\eqref{eq:prod}\) appears in the analysis, for Squint it is incorporated in the algorithm.

Implementation

Let us now think about computing the Matrix Squint prediction \(\eqref{eq:MSquint}\). Following (Koolen and Van Erven 2015) we can compute the integral over \(\eta\) on a grid of \(\ln T\) values. If we maintain the matrices \(\sum_{t=1}^T {\boldsymbol{R}}_t\) and \(\sum_{t=1}^T {\boldsymbol{R}}_t^2\) then we have to do \(O(d^3)\) work per round per learning rate to compute the matrix exponentials. Unfortunately, the eigensystem of \[\eta \sum_{t=1}^T {\boldsymbol{R}}_t - \eta^2 \sum_{t=1}^T {\boldsymbol{R}}_t^2,\] which is the argument of the exponential, is a function of \(\eta\), meaning that we cannot re-use the eigendecomposition between learning rates. In particular, it seems we cannot apply the closed-form expression of the classical vector Squint potential spectrally.

Quest for an Even Sharper Matrix Algorithm

iProd is sharper than Squint because it delays the prod bound \(\eqref{eq:prod}\) to the analysis. It is very tempting to also try and delay Golden-Thompson \(\eqref{eq:gt}\) to the analysis. Two intuitively reasonable candidates are \[{\boldsymbol{F}}_T^\eta ~=~ \prod_{t=1}^T ({\boldsymbol{I}}+ \eta {\boldsymbol{R}}_T)
,\] and \[{\boldsymbol{F}}_T^\eta
~=~
\sqrt{{\boldsymbol{I}}+ \eta {\boldsymbol{R}}_T}
\cdots
\sqrt{{\boldsymbol{I}}+ \eta {\boldsymbol{R}}_1}
\sqrt{{\boldsymbol{I}}+ \eta {\boldsymbol{R}}_1}
\cdots
\sqrt{{\boldsymbol{I}}+ \eta {\boldsymbol{R}}_T}
.\] The first is the most direct analogue of the vector iProd potential. Unfortunately the resulting weights are not symmetric. The second expression solves this, and the associated weights keep the potential constant. But now the problem reappears in another form, namely that small potential is not necessarily useful.

I have not succeeded in finding a Golden-Thompson free potential. Let me know if you do.