Abstract.

We consider the solution of linear saddle-point problems, using the
alternating direction method-of-multipliers (ADMM) as a preconditioner
for the generalized minimum residual method (GMRES). We show, using
theoretical bounds and empirical results, that ADMM is made remarkably
insensitive to the parameter choice with Krylov subspace acceleration.
We prove that ADMM-GMRES can consistently converge, irrespective of
the exact parameter choice, to an ϵ-accurate solution of
a κ-conditioned problem in O(κ2/3logϵ−1)
iterations. The accelerated method is applied to randomly generated
problems, as well as the Newton direction computation for the interior-point
solution of semidefinite programs in the SDPLIB test suite. The empirical
results confirm this parameter insensitivity, and suggest a slightly
improved iteration bound of O(√κlogϵ−1).

Department of Electrical Engineering and Computer Science, Massachusetts
Institute of Technology, Cambridge, MA 02139-4307. Email: ryz@mit.edu
and white@mit.edu. Financial support for this work was provided
in part by the Skolkovo-MIT initiative in Computational Mathematics.

Additionally, we assume that efficient solutions (i.e. black-box oracles)
to the two subproblems

(2)

[DATA−β−1I][~x~y]=[~rx~ry],[0BTB−β−1I][~z~y]=[~rz~ry],

are available for a fixed choice of β>0.

Saddle-point problems with this structure arise in numerous settings,
ranging from nonlinear optimization to the numerical solution of partial
differential equations (PDEs); the subproblems (2)
are often solved with great efficiency by exploiting application-specific
features. For example, when the data matrices are large-and-sparse,
the smaller saddle-point problems (2) can admit highly
sparse factorizations, based on nested dissection or minimum degree
orderings [31, 30].
Also, the Schur complements D+βATA and BTB are symmetric
positive definite, and can often be interpreted as discretized Laplacian
operators, for which many fast solvers are available [29, 26, 32, 25].
In some special cases, a triangular factorization or a diagonalization
may be available analytically [28]. The reader
is referred to [4] for a more comprehensive
review of possible applications.

The problem structure has an interpretation of establishing consensus
between the two subproblems. To see this, note that (1)
is the Karush–Kuhn–Tucker (KKT) optimality condition associated
with the equality-constrained least-squares problem

(3)

minimize x,z

12xTDx−rTxx−rTzz,

subject to

Ax+Bz=ry,

and the solution to (1) is the unique optimal point.
This problem is easy to solve if one of two variables were held fixed.
For instance, holding z fixed, the minimization of (3)
over x to ϵ-accuracy can be made with just a single call
to the first subproblem in (2), taking the parameter
to be β∈O(ϵ−1). The difficulty of the overall
problem, then, lies entirely in the need for consensus, i.e. for two
independent minimizations to simultaneously satisfy a single equality
constraint.

The alternating direction method-of-multipliers (ADMM) is a popular
first-order method widely used in signal processing, machine learning,
and related fields, to solve consensus problems like the one posed
in (3); cf. [7] for an extensive
review. Each ADMM iteration calls the subproblems in (2),
with β serving as the step-size parameter for the underlying
gradient ascent. Under the assumptions on the data matrices stated
at the start of the paper, ADMM converges at a linear rate (with error
scaling O(e−k) at the k-th iteration), starting from any
initial point.

The choice of the parameter β heavily influences the effectiveness
of ADMM. Using an optimal choice [15, 19, 13, 14],
the method is guaranteed converge to an ϵ-accurate solution
in

(4)

O(√κlogϵ−1) iterations,

where κ is the condition number associated with the rescaled
matrix ~D=(AD−1AT)−1. This bound is asymptotically
optimal, in the sense that the square-root factor cannot be improved [18, Thm. 2.1.13].

Unfortunately, explicitly estimating the optimal parameter choice
can be challenging. Picking any arbitrarily value, say β=1,
often results in convergence that is so slow as to be essentially
stagnant, even on well-conditioned problems [14, 19].
A heuristic that works well in practice is to adjust the parameter
after each iteration, using a rule-of-thumb based on keeping the primal
and dual residuals within the same order of magnitude [17, 33, 7].
However, varying the value of β between iterations can substantially
increase the cost of solving the subproblems in (2).

When applied to a least-squares problem, ADMM reduces to a classic
block Gauss-Seidel splitting on the corresponding KKT equations, i.e.
the original saddle-point problem in (1). Viewing
ADMM as the resulting linear fixed-point iterations, convergence can
be optimally accelerated by using a Krylov subspace method
like generalized minimum residual (GMRES) [24, 23].
Or equivalently, viewing ADMM as a preconditioner, it may be
used to improve the conditioning of the KKT equations for a Krylov
subspace method like GMRES. We refer to the GMRES-accelerated version
of ADMM (or the ADMM-preconditioned GMRES) as ADMM-GMRES.

In this paper, we show, using theoretical bounds and empirical results,
that ADMM-GMRES (nearly) achieves the optimal convergence rate in
(4) for every parameter choice. Figure b
makes this comparison for two representative problems. Our first main
result (Theorem 7) conclusively establishes the
optimal iteration bound when β is very large or very small.
Our second main result (Theorem 9) proves a slightly
weaker statement: ADMM-GMRES converges within O(κ2/3logϵ−1)
iterations for all remaining choices of β, subject to a certain
normality assumption. The two bounds gives us the confidence to select
the parameter choice β in order to maximize numerical stability.

To validate these results, we benchmark the performance of ADMM-GMRES
with a randomly selected β in Section 7 against
regular ADMM with an optimally selected β. Two problem classes
are considered: (1) random problems generated by selecting random
orthonormal bases and singular values; and (2) the Newton direction
subproblems associated with the interior-point solution of large-scale
semidefinite programs. Our numerical results suggest that ADMM-GMRES
converges in O(√κlogϵ−1) iterations for all
values of β, which is a slightly stronger iteration bound that
the one we have proved.

1.2. Related ideas

When the optimal parameter choice β for (regular) ADMM is explicitly
available, we showed in a previous paper [34] that
ADMM-GMRES can consistently converge in just O(κ1/4logϵ−1)
iterations, which is an entire order of magnitude better than the
optimal bound (4). Problems that would otherwise
require thousands of iterations to solve using ADMM are reduced to
just tens of ADMM-GMRES iterations. However, the improved rate cannot
be guaranteed over all problems, and there exist problem classes where
ADMM-GMRES convergences in Ω(√κlogϵ−1)
iterations for all choices of β.

More generally, the idea of using a preconditioned Krylov subspace
method to solve a saddle-point system has been explored in-depth by
a number of previous authors [2, 28, 20, 1, 3, 4].
We make special mention of the Hemitian / Skew-Hermitian (HSS) splitting
method, first proposed by Bai, Golub & Ng [1]
and used as a preconditioner for saddle-point problems by Benzi &
Golub [3], which also makes use of efficient
solutions to the subproblems in (2). It is curious
to note that HSS has a strikingly similar expression for its optimal
parameter choice and the resulting convergence rate, suggesting that
the two methods may be closely related.

Note that ADMM is an entirely distinct approach from the augmented
Lagrangian / method of multipliers (MM) in optimization, or equivalently,
the Uzawa method in saddle-point problems [10, 8].
In MM, convergence is guaranteed in a small, constant number of iterations,
but each step requires the solution of an ill-conditioned symmetric
positive definite system of equations, often via preconditioned conjugate
gradients. In ADMM, convergence is slowly achieved over a large number
of iterations, but each iteration is relatively inexpensive. We refer
the reader to [7] for a more detailed comparison
of the two methods.

Finally, it remains unknown whether these benefits extend to nonlinear
saddle-point problems (or equivalently, nonlinear versions of the
consensus problem), where the ADMM update equations are also nonlinear.
There are a number of competiting approaches to generalize GMRES to
nonlinear fixed-point iterations [22, 9, 11].
Their application to ADMM is the subject of future work.

1.3. Definitions & Notation

Given a matrix M, we use λi(M) to refer to its i-th
eigenvalue, and Λ{M} to denote its set of eigenvalues,
including multiplicities. If the eigenvalues are purely-real, then
λmax(M) refers to its most positive eigenvalue, and λmin(M)
its most negative eigenvalue. Let ∥⋅∥ denote the l2
vector norm, as well as the associated induced norm, also known as
the spectral norm. We use σi(M) to refer to the i-th
largest singular value.

Define m=λmin(~D) and ℓ=λmax(~D)
as the strong convexity parameter and the gradient Lipschitz constant
for the quadratic form associated with the matrix ~D=(AD−1AT)−1.
The quantity κ=ℓ/m is the corresponding condition number.

Beginning with a choice of the quadratic-penalty / step-size parameter
β>0 and initial points {x(0),z(0),y(0)}, the
method generates iterates

Local var. update: x(k+1)

=argminx12xTDx−rTxx+β2∥Ax+Bz(k)−c+1βy(k)∥2,

Global var. update: z(k+1)

=argminz−rTzz+β2∥Ax(k+1)+Bz−c+1βy(k)∥2,

Multiplier update: y(k+1)

=y(k)+β(Ax(k+1)+Bz(k+1)−c).

Note that the local and global variable updates can each be implemented
by calling one of the two subproblems in (2). Since
the KKT optimality conditions are linear with respect to the decision
variables, the update equations are also linear, and can be written

upon the vector of local, global, and multiplier variables, u(k)=[x(k);z(k);y(k)].
We will refer to the residual norm in all discussions relating to
convergence.

Definition 1 (Residual convergence).

Given the initial and final iterates u(0)=[x(0);z(0);y(0)]
and u(k)=[x(k),z(k),y(k)], we say ϵ residual
convergence is achieved in k iterations if ∥Mu(k)−r∥≤ϵ∥Mu(k)−r∥,
where M and r are the KKT matrix and vector in (1).

2.1. Basic Spectral Properties

Convergence analysis for linear fixed-point iterations is normally
performed by examining the spectral properties of the corresponding
iteration matrix. Using dual feasibility arguments, a block-Schur
decomposition for (6) can be explicitly specified.

where the size ny×ny inner iteration matrix G22(β)=12I+12K(β)
is defined in terms of the matrix

(9)

K(β)=[QT−PT][(β−1~D+I)−1−(β~D−1+I)−1][QP],

and ~D=(AD−1AT)−1.

We may immediately conclude that GAD(β) has nx+nz
zero eigenvalues, and ny nonzero eigenvalues the lie within a
disk on the complex plane centered at +12, with radius
of 12∥K(β)∥. It is straightforward to compute the
radius of this disk exactly.

Lemma 3.

Let ~D=(AD−1AT)−1, and define
m=λmin(~D) and ℓ=λmax(~D).
Then the spectral norm of K(β) is given

(10)

∥K(β)∥=γ−1γ+1, where γ=max{βm,ℓβ}.

Also, we see from (8) that each Jordan block
associated with a zero eigenvalue of GAD(β) is at most
size 2×2. After two iterations, the behavior of ADMM becomes
entirely dependent upon the inner iteration matrix G22(β)=12I+12K(β).

where c1 is defined in Lemma 4, and
κM=∥M∥∥M−1∥ with M defined in (1).

2.2. Accelerating Convergence using GMRES

In the context of quadratic objectives, the convergence of ADMM can
be accelerated by GMRES in a largely plug-and-play manner. Given a
specific choice of parameter β>0 and an initial point u(0)=[x(0);z(0);y(0)],
we may task GMRES with the fixed-point equation associated with the
ADMM update equation (5)

(11)

u⋆−GAD(β)u⋆=b(β),

which is indeed a linear system of equations when β is held
fixed. It is an elementary fact that the resulting iterates will always
converge onto the fixed-point point faster than regular ADMM (under
a suitably defined metric) [24].

Alternatively, the fixed-point equation (11) is
equivalent to the left-preconditioned system of equations

Note that the ADMM iteration matrix satisfies GAD(β)=I−P−1AD(β)M
by definition. In turn, GMRES-accelerated ADMM is equivalent to a
preconditioned GMRES solution to the KKT system, Mu=r, with preconditioner
PAD(β). Matrix-vector products with P−1AD(β)
can always be implemented as the composition of an augmentation operation
and a single iteration of ADMM, as seen in the factorization in (13).

GMRES can also be used to solve the right-preconditioned system

(14)

MP−1AD(β)^u−r=0,

and the solution is recovered via u=P−1AD(β)^u.
The resulting method performs essentially the same steps as the one
above, but optimizes the iterates under a more preferable metric.
Starting from the same initial point u(0), the k-th iterate
of GMRES as applied to (14), written u(k)GM,
is guaranteed to produce a KKT residual norm that is smaller than
or equal to that of the k-th iterate of regular ADMM, written u(k)AD,
as in ∥Mu(k)GM−r∥≤∥Mu(k)AD−r∥. This property
is preferable as PAD(β) becomes progressively ill-conditioned
and numerical precision becomes an issue; cf. [34, Sec. 7]
for details.

Throughout this paper, we will refer to both methods as GMRES-accelerated
ADMM, or ADMM-GMRES for short, and reserve the “left-preconditioned”
or the “right-preconditioned” specifications only where the distinctions
are important. The reason is that both methods share a common bound
for the purposes of convergence analysis.

Proposition 6.

Given fixed β>0, let u(k) be
the iterate generated at the k-th iteration of GMRES as applied
to either (12) or (14). Then the
following bounds hold for all k≥2

∥Mu(k)−r∥∥Mu(0)−r∥≤c1κPminp∈Pk−2p(1)=1∥p(K)∥,

where K≡K(β) is defined in (9), c1
is defined in Lemma 4, κP=∥PAD∥∥P−1AD∥
with PAD≡PAD(β) defined in (13),
and Pk denotes the space of order-k polynomials.

and the last equality is due to the existence of a bijective linear
map between G22∪{1} and K∪{1}. In the left-preconditioned
system (12), the matrix I−A is I−P−1ADM=GAD,
and the κP factor arises by bounding the residuals ∥PADr(k)∥/∥PADr(0)∥≤κP∥r(k)∥/∥r(0)∥.
In the right-preconditioned system (12), the matrix
I−A is I−MP−1AD=PADGADP−1AD, and the κP
factor arises via ∥p(PADGADP−1AD)∥≤κP∥p(GAD)∥.
∎

In order to use Proposition 6 to derive useful
convergence estimates, the polynomial norm-minimization problem can
be reduced into a polynomial min-max approximation problem over a
set of points on the complex plane. More specifically, consider the
following sequence of inequalities

which makes the following normality assumption, that is standard within
this context.

Assumption 1 (κX is bounded).

Given fixed
β>0, the matrix K≡K(β), defined in (9),
is diagonalizable with eigendecomposition, K=XΛX−1. Furthermore,
the condition number for the matrix-of-eigenvectors, κX=∥X∥∥X−1∥,
is bounded from above by an absolute constant.

We refer to this last problem in (17) as the eigenvalue
approximation problem. Only in very rare cases is an explicit closed-form
solution known, but any heuristic choice of polynomial p(⋅)
will provide a valid upper-bound.

Our main result in this paper is that ADMM-GMRES converges to an ϵ-accurate
solution in O(κ2/3logϵ−1) iterations for any
value of β>0, in the sense of the residual. We split the precise
statement into two parts. First, for very large and very small values
of β, we can conclusively establish that ADMM-GMRES convergences
in O(√κlogϵ−1) iterations. This is asymptotically
the same as the optimal figure for regular ADMM.

Theorem 7 (Extremal β).

For any choice of β>ℓ or 0<β<m,
GMRES-accelerated ADMM generates the iterate u(k)=[x(k);z(k);y(k)]
at the k-th iteration that satisfies

∥Mu(k)−r∥∥Mu(0)−r∥≤2c1κP[1+(max{βℓ,mβ}−1)−1](√2κ−1√2κ+1)0.317k

where κ=ℓ/m and the factors c1,κP are polynomial
in β+β−1 and defined in Lemma 4
and Proposition 6.

Corollary 8.

GMRES-accelerated ADMM achieves ϵ residual convergence in

O(√κlogϵ−1+√κ|logβ|) iterations

for any choice of β>ℓ or 0<β<m.

For intermediate choices of β, the same result almost
holds. Subject to the normality assumption in Assumption 1,
ADMM-GMRES converges in O(κ2/3logϵ−1) iterations.
Accordingly, we conclude that ADMM-GMRES converge within this number
of iterations for every fixed value of β>0.

Theorem 9 (Intermediate β).

For any choice of m≤β≤ℓ, GMRES-accelerated
ADMM generates the iterate u(k)=[x(k);z(k);y(k)] at
the k-th iteration that satisfies

∥Mu(k)−r∥∥Mu(0)−r∥≤2c1κPκX(κ2/3κ2/3+1)0.209k

where κ=ℓ/m, the factors c1,κP are polynomial
in β and defined in Lemma 4 and Proposition 6,
and the factor κX is defined in Assumption 1.

Corollary 10.

GMRES-accelerated ADMM achieves ϵ residual
convergence in

O(κ2/3logϵ−1+κ2/3|logβ|) iterations

for any choice of β>0.

As we reviewed in Section 2, convergence analysis
for GMRES can be reduced to a polynomial approximation problem over
the eigenvalues of K(β), written in (17). Our
proofs for Theorems 7 & 9 are
based on solving this polynomial approximation problem heuristically.
The discrete eigenvalue distribution of K(β) is enclosed within
simple regions on the complex plane in Section 4
for different values of β. The polynomial approximation problems
associated with these outer enclosures are solved in their most general
form in Section 5. These results are pieced
together in Section 6, yielding proofs to our
main results.

The eigenvalues of the iteration matrix K(β) play a pivotal
role in driving the convergence of both ADMM as well as ADMM-GMRES.
Figure f plots these for a fixed, randomly generated
problem, while sweeping the value of β. Initially, we see two
clusters of purely-real eigenvalues, tightly concentrated about ±1,
that enlargen and shift closer towards the origin and towards each
other with increasing β. As the two clusters coalesce, some
of the purely-real eigenvalues become complex. The combined radius
of the two clusters reaches its minimum at around β=√mℓ,
at which point most of the eigenvalues are complex. Although not shown,
the process is reversed once β moves past √mℓ;
the two clusters shrink, become purely-real, break apart, and move
away from the origin, ultimately reverting into two clusters concentrated
about ±1.

Three concrete findings can be summarized from these observations.
First, despite the fact that K(β) is nonsymmetric, its eigenvalues
are purely-real over a broad range of β.

Lemma 11.

Let β>ℓ or β<m. Then K(β)
is diagonalizable and its eigenvalues are purely real. Furthermore,
let κX be the condition number for the matrix-of-eigenvectors
as defined in Assumption 1. Then this quantity is
bound

(18)

κX≤1+(max{βℓ,mβ}−1)−1.

Furthermore, the eigenvalues are partitioned into two distinct, purely-real
clusters that only become complex once they coalesce.

Lemma 12.

Define the positive scalar γ=max{β/m,ℓ/β},
which satisfies γ≥√κ by construction. If γ∈[√κ,κ],
then Λ(K) is enclosed within the union of a disk and an interval:

(19)

Λ(K)⊂{z∈C:|z|≤κγ+κ−1γ+1}∪[−γ−1γ+1,+γ−1γ+1].

If γ∈(κ,2κ], then Λ(K) is enclosed within
a single interval:

(20)

Λ(K)⊂[−γ−1γ+1,+γ−1γ+1].

Finally, if γ∈(2κ,∞), then Λ(K) is enclosed
within the union of two disjoint intervals:

(21)

Λ(K)⊂[−γ−1γ+1,−γ−2κγ+κ]∪[+γ−2κγ+κ,+γ−1γ+1].

Furthermore, if 0<nx<ny, then Λ(K) contains at least
one eigenvalue within each interval.

In the limits β→0 and β→∞, the two clusters
in (21) concentrate about ±1, and the spectral
radius of the ADMM iteration matrix converges towards 1.

Corollary 13.

Let β>2ℓ or β<12m.
Then for 0<nz<ny, there exists an eigenvalue of the ADMM iteration
matrix λi∈Λ{GAD(β)} whose modulus is
lower-bounded

|λi|≥γ−κγ+κ, where γ=max{βm,ℓβ}.

The spectral radius determines the asympotic convergence rate, so
given Corollary 13, it is unsurprising that ADMM
stagnates if β is poorly chosen. But the situation is different
with ADMM-GMRES, because it is able to exploit the clustering of eigenvalues.
As we will see later, this is the mechanism that allows ADMM-GMRES
to be insensitive to the parameter choice.

4.1. Properties of J-symmetric matrices

Most of our characterizations for the eigenvalues of K(β) are
based on a property known as “J-symmetry”. In the following
discussion, we will drop all arguments with respect to β for
clarity. Returning to its definition in (9), we note
that K has the block structure

(22)

K=[XZ−ZTY],

with subblocks X∈Rnz×nz, Y∈R(ny−nz)×(ny−nz),
and Z∈Rnz×(ny−nz),

(23)

X=QT~KQ,Y=−PT~KP,Z=QT~KP,

(24)

~K=(β−1~D+I)−1−(β~D−1+I)−1,

and the matrices Q and P with orthonormal columns are defined
as in Lemma 2. From the block structure in (22)
we see that the matrix K is self-adjoint with respect to the indefinite
product (assuming 0<nz<ny) defined by J=blkdiag(Inz,−I(ny−nz)):

⟨y,Mx⟩J=⟨My,x⟩J⟺yTJMx=(My)TJx.

Matrices that have this property frequently appear in saddle-point
type problems; cf. [4, 5]
for a more detailed treatment of this subject. Much can be said about
their spectral properties.

Proposition 14.

The J-symmetric matrix K in (22)
has at most 2min{nz,ny−nz} eigenvalues with nonzero imaginary
parts, counting conjugates. These eigenvalues are contained within
the disk Da={z∈C:|z|≤a} of radius

a=minη∈R∥K+ηJ∥,

where P denotes the space of polynomials.

Proof.

Benzi & Simoncini [5] provide a succinct
proof for the first statement. The second statement is based on the
fact that every eigenpair {λi,xi} of K satisfying
Im(λi)≠0 must have an eigenvector xi
that is “J-neutral”, i.e. satisfying x∗iJxi=0;
cf. [5, Thm. 2.1]. Hence, the following
bound holds

(25)

|λi|≤max∥x∥=1{|x∗Kx|:x∗Jx=0}

for every λi with Im(λi)≠0. Taking
the Lagrangian dual of (25) yields the desired statement.
∎

Also, we can derive a simple sufficient condition for the eigenvalues
of K to be purely real, based on the ideas described in [5].

Proposition 15.

Suppose that there exists a real scalar η≠0
to make the matrix H=ηJK positive definite. Then K is diagonalizable
with eigendecomposition, K=XΛX−1, its eigenvalues are
purely-real, and the condition number of the matrix-of-eigenvectors
satisfies κX≜∥X∥∥X−1∥≤√∥H∥∥H−1∥

Proof.

It is easy to verify that K is also symmetric with respect to H,
as in KM=KTH. Since H is positive definite, there exists
a symmetric positive definite matrix W=WT satisfying W2=H,
and the H-symmetry implies

W(WMW−1)W=W(W−1MTW)W⟺WMW−1=(WMW−1)T=~M.

Hence we conclude that M is similar to the real symmetric matrix
~M, with purely-real eigenvalues and eigendecomposition
~M=VΛVT, where V is orthogonal. The corresponding
eigendecomposition for M is M=XΛX−1 with X=W−1V.
∎

Finally, we may use the block-generalization of Gershgorin’s circle
theorem to decide when the off-diagonal block Z is sufficiently
“small” such that the eigenvalues of K become similar to the
block diagonal matrix blkdiag(X,Y).

Proof.

Proof.

The cases of nz=0 and nz=ny are trivial. In the remaining
cases, K is J-symmetric, and we will use Proposition 15
to prove the statement. Noting that JK is unitarily similar with
~K, simple direct computation reveals that +JK is positive
definite for β>ℓ, and −JK is positive definite for β<m.
Hence, for these choices of β, the eigenvalues of K are
purely real. Some further computation reveals that ∥JK∥=(γ−1)/(γ+1)
and ∥(JK)−1∥=(γ+κ)/(γ−κ), so the condition
number of JK is bound

(26)

∥JK∥∥(JK)−1∥<∥(JK)−1∥=γ+κγ−κ=1+2γ/κ−1.

Taking the square-root and substituting √1+2x≤1+x yields
the desired estimate for κX in (18).
∎

Proof.

Again, the cases of nz=0 and nz=ny are trivial. In all remaining
cases, K is J-symmetric. The single interval case (20)
is a trivial consequence of our purely-real result in Lemma 11.
The disk-and-interval case (19) arises from Proposition 14,
in which we use the disk of radius

a=minη∈R∥K−ηJ∥=minη∈R∥~K−ηI∥=κγ+κ−1γ+1

to enclose the eigenvalues with nonzero imaginary parts, and the spectral
norm disk |λi(K)|≤∥K∥ to enclose the purely-real eigenvalues.
The two-interval case (21) is a consequence of the
block Gershgorin theorem in Proposition 16. Some
direct computation on the block matrices in (22) yields
the following two statements

(27)

Λ{X}∪Λ{Y}

⊂[−γ−1γ+1,−γ−κγ+κ]∪[+γ−κγ+κ,+γ−1γ+1]⊂R,

(28)

∥Z∥

≤κγ+κ,

the latter of which also uses the fact that ∥QT~KP∥=∥QT(~K−ηI)P∥≤∥~K−ηI∥.
The projections of the corresponding Gershgorin regions onto the real
line satisfy

Re{G−}⊂[−γ−1γ+1−κγ+κ,−γ−2κγ+κ],Re{G+}⊂[γ−2κγ+κ,γ−1γ+1+κγ+κ].

Once γ>2κ, the regions become separated. Taking the intersection
between each disjoint Gershgorin region, the real line, and the spectral
norm disk |λi(K)|≤∥K∥ yields (21).
∎

With the distribution of eigenvalues characterized in Lemma 12,
we now consider solving each of the three accompanying approximation
problems in their most general form.

5.1. Chebyshev approximation over the real interval

The optimal approximation for an interval over the real line has a
closed-form solution due to a classic result attributed to Chebyshev.

Theorem 17.

Let I denote the interval [c−a,c+a]
on the real line. Then assuming that +1∉I, the polynomial
approximation problem has closed-form solution

(29)

minp∈Pkp(1)=1maxz∈I|p(z)|=1|Tk(1−ca)|≤2(√κI−1√κI+1)k,

where Tk(z) is the degree-k Chebyshev polynomial of the first
kind, and κI=(|1−c|+a)/(|1−c|−a) is the condition number
for the interval. The minimum is attained by the Chebyshev polynomial
p⋆(z)=Tk(z−ca)/|Tk(1−ca)|.

Proof.

Whereas approximating a general κ-conditioned region to ϵ-accuracy
requires an order O(κlogϵ−1) polynomial, approximating
a real interval of the same conditioning and to the same accuracy
only requires an order O(√κlogϵ−1) polynomial.
This is the underlying mechanism that grants the conjugate gradients
method a square-root factor speed-up over gradient descent; cf. [16, Ch. 3]
for a more detailed discussion. For future reference, we also note
the following identity.

Remark 18.

Given any ζ>+1, define the corresponding
condition number as ν=(ζ+1)/(ζ−1). Then

(30)

|ζ|k=(ν+1ν−1)k,12(√ν+1√ν−1)k≤|Tk(ζ)|≤(√ν+1√ν−1)k.

5.2. Real intervals symmetric about the imaginary axis

Now, consider the polynomial approximation problem for two real, non-overlapping
intervals with respect to the constraint point +1, illustrated
in Fig. 3, which arises as the eigenvalue distribution
(21) in Lemma 12.

Lemma 19.

Given a≥0 and c≥a, define the two closed
intervals

(31)

I−={z∈R:|z+c|≤a},I+={z∈R:|z−c|≤a},

such that +1∉I+ and I−∩I+=∅.
Then the following holds

(32)

(√κ++1√κ+−1)k≤minp∈Pkp(1)=1maxz∈I−∪I+|p(z)|≤2(√κ++1√κ+−1)0.317k

where κ+=(1−c+a)/(1−c−a) is the condition number for the
segment I+.

Of course, the union of the two intervals, i.e. I−∪I+,
lies within a single real interval with condition number κI=(1+c+a)/(1−c−a),
so Theorem 17 can also be used to obtain an estimate.
However, the explicit treatment of clustering in Lemma 19
yields a considerably tighter bound, because it is entirely possible
for each I− and I+ to be individually
well-conditioned while admitting an extremely ill-conditioned union.
For a concrete example, consider setting a=0 and taking the limit
c→1; the condition number for I+ is fixed at
κ+=1, but the condition number for the union I−∪I+
diverges κI→∞. In this case, Lemma 19
predicts extremely rapid convergence for all values of c, whereas
Theorem 17 does not promise convergence at all.

Lemma 20.

Define f(x)=log[(x−1)/(x+1)] with
domain x∈(1,∞). Then the quotient g(x)=f(x)/f(x2)
is monotonously increasing with infimum attained at the limit point
g(1)=1.

Proof.

By definition, we see that both f(x) and f(x2) are nonzero
for all x>1. Taking the derivatives

(33)

ddx[f(x)]=2x2−1=2x2+2x4−1,ddx[f(x2)]=4xx4−1,

reveals that f(x) is monotonously increasing for all x>1, so
we also have f(x)<f(x2). Finally, we observe that ddx[f(x)]>ddx[f(x2)]>0
for all x>1. Combining these three observations with the quotient
rule reveals that g(x) is monotonously increasing