Proximal point algorithm revisited, episode 3. Catalyst acceleration

This is episode 3 of the three part series that revisits the classical proximal
point algorithm. See the first post on the subject for
introduction and notation.

Catalyst acceleration

In the previous posts, we looked at the
proximally guided subgradient method
and the prox-linear algorithm.
The final example concerns inertial acceleration in convex optimization.
Setting the groundwork, consider a -strongly convex function
with a -Lipschitz gradient map .
Classically, gradient descent will find a point satisfying
after at most

iterations.
Accelerated gradient methods, beginning with Nesterov (1983),
equip the gradient descent method with an inertial correction. Such
methods have the much lower complexity guarantee

which is
optimal within the first-order oracle model of computation (Nemirovsky
and Yudin 1983).

It is natural to ask which other methods, aside from gradient descent,
can be “accelerated”. For example, one may wish to accelerate coordinate
descent or so-called variance reduced methods for finite sum problems; I
will comment on the latter problem class shortly.

One appealing strategy relies on the proximal point method. Güler (1992)
showed that the proximal point method itself can be
equipped with inertial steps leading to improved convergence guarantees.
Building on this work, Lin, Mairal, and Harchaoui (2015; 2017) explained how to derive the total complexity
guarantees for an inexact accelerated proximal point method that take
into account the cost of applying an arbitrary linearly convergent
algorithm to the subproblems. Their Catalyst
acceleration framework is summarized below. The code for
Catalyst is publicly available here.

Catalyst Acceleration

Data: , , algorithm

Set , , and

Fordo

Use to approximately solve:

Compute from the equation

Compute:

To state the guarantees of this method, suppose that
converges on the proximal subproblem in function value at a linear rate
. Then a simple termination policy on the subproblems
to solve for yields an algorithm with overall
complexity

That is, the expression
above describes the maximal number of iterations of
used by the Catalyst algorithm until it finds a point
satisfying . Typically depends on
; therefore the best choice of is the one that
minimizes the ratio .

The main motivation for the Catalyst framework, and its most potent
application, is the regularized Empirical Risk Minimization (ERM)
problem:

Such large-finite sum problems are ubiquitous in machine learning and
high-dimensional statistics, where each function typically models
a misfit between predicted and observed data while promotes some low
dimensional structure on , such as sparsity or low-rank.

Assume that is -strongly convex and each individual is
-smooth with -Lipschitz gradient. Since is assumed to be
huge, the complexity of numerical methods is best measured in terms of
the total number of individual gradient evaluations . In
particular, fast gradient methods have the worst-case complexity

since
each iteration requires evaluation of all the individual gradients
. Variance reduced algorithms, such as SAG
(Schmidt, Roux, and Bach 2013), SAGA (Defazio, Bach, and Lacoste-Julien
2014), SDCA (Shalev-Shwartz and Zhang 2012), SMART (Davis 2016), SVRG
(Johnson and Zhang 2013; Xiao and Zhang 2014), FINITO (Defazio, Domke,
and Caetano 2014), and MISO (Mairal 2015; Lin, Mairal, and Harchaoui
2015), aim to improve the dependence on . In their raw form, all of
these methods exhibit a similar complexity

in
expectation, and differ only in storage requirements and in whether one
needs to know explicitly the strong convexity constant.

It was a long standing open question to determine if the dependence on
can be improved. This is not quite possible in full
generality, and instead one should expect a rate of the form

Indeed, such a rate would be optimal in an appropriate oracle model of
complexity (Woodworth and Srebro 2016; Arjevani 2017; Agarwal and Bottou
2015; Lan 2015). Thus acceleration for ERM problems is only beneficial
in the setting .

Early examples for specific algorithms are the accelerated SDCA
(Shalev-Shwartz and Zhang 2015), APPA (Frostig et al. 2015), and RPDG
(Lan 2015).1 The accelerated SDCA and APPA, in particular, use a
specialized proximal-point construction.2 Catalyst generic
acceleration allows to accelerate all of the variance reduced methods
above in a single conceptually transparent framework. It is worth noting
that the first direct accelerated variance reduced methods for ERM
problems were recently proposed in Allen-Zhu (2016) and Defazio (2016).

In contrast to the convex setting, the role of inertia for nonconvex
problems is not nearly as well understood. In particular, gradient
descent is black-box optimal for -smooth nonconvex minimization
(Carmon et al. 2017b), and therefore inertia can not help in the worst
case. On the other hand, the recent paper (Carmon et al. 2017a) presents
a first-order method for minimizing and smooth functions
that is provably faster than gradient descent. At its core, their
algorithm also combines inertia with the proximal point method. For a
partial extension of the Catalyst framework to weakly convex problems,
see Paquette et al. (2017).

Conclusion

The proximal point method has long been ingrained in the foundations of
optimization. Recent progress in large scale computing has shown that
the proximal point method is not only conceptual, but can guide
methodology. Though direct methods are usually preferable, proximally
guided algorithms can be equally effective and often lead to more easily
interpretable numerical methods. In this blog, I outlined three examples
of this viewpoint, where the proximal-point method guides both the
design and analysis of numerical methods.

Acknowledgements

The author thanks Damek Davis, John Duchi, and Zaid Harchaoui for their
helpful comments on an early draft.