Abstract

We propose AD3, a new algorithm for approximate maximum a
posteriori (MAP)
inference on factor graphs based on the alternating
directions method of multipliers.
Like dual decomposition algorithms,
AD3
uses worker nodes to iteratively solve local subproblems and a
controller node to combine these local solutions into a global update.
The key characteristic of AD3 is that
each local subproblem has a quadratic regularizer, leading to a faster
consensus than subgradient-based dual decomposition,
both theoretically and in practice.
We provide closed-form solutions for these AD3 subproblems
for binary pairwise factors and factors imposing first-order logic
constraints.
For arbitrary factors (large or combinatorial),
we introduce an active set method which requires
only an oracle for computing a local MAP configuration,
making AD3 applicable to a wide range of problems.
Experiments on synthetic and real-world problems show that AD3
compares favorably with the state-of-the-art.

Graphical models enable compact
representations of probability distributions, being widely used in
computer vision, natural language processing (NLP),
and computational biology (Pearl, 1988; Lauritzen, 1996; Koller and Friedman, 2009).
When using these models, a central problem is that of inferring the
most probable (a.k.a. maximum a posteriori – MAP) configuration.
Unfortunately, exact MAP inference is an intractable problem
for many graphical models of interest in
applications, such as those involving non-local features and/or
structural constraints.
This fact has motivated a significant research effort on approximate MAP inference.

A class of methods that proved effective for approximate inference
is based on linear programming relaxations of the MAP problem (LP-MAP) (Schlesinger, 1976).
Several message-passing algorithms have been proposed to address the resulting LP problems,
taking advantage of the underlying graph structure (Wainwright et al., 2005; Kolmogorov, 2006; Werner, 2007; Globerson and Jaakkola, 2008; Ravikumar et al., 2010).
In the same line, Komodakis et al. (2007) proposed a method
based on the classical dual decomposition(Dantzig and Wolfe, 1960; Everett III, 1963; Shor, 1985),
which breaks the original problem into a set of smaller subproblems; this set of subproblems,
together with the constraints that they should agree on the shared variables, yields a
constrained optimization problem, the Lagrange dual of which is solved
using a projected subgradient method. Initially proposed for computer vision
(Komodakis et al., 2007; Komodakis and Paragios, 2009), this technique has also been shown effective in
several NLP tasks that involve combinatorial constraints,
such as parsing and machine translation (Koo et al., 2010; Rush et al., 2010; Auli and Lopez, 2011; Rush and Collins, 2011; Chang and Collins, 2011; DeNero and Macherey, 2011).
The major drawback is that the subgradient algorithm is very slow to
achieve consensus in problems with a large number of overlapping components,
requiring O(1/ϵ2) iterations for an ϵ-accurate solution.
Jojic et al. (2010) addressed this drawback by introducing an accelerated method that
enjoys a faster convergence rate O(1/ϵ); however, that method is sensitive
to previous knowledge of ϵ and the prescription of a temperature parameter,
which may yield numerical instabilities.

In this paper, we ally the simplicity of dual decomposition (DD) with
the effectiveness of augmented Lagrangian methods, which have a
long-standing history in optimization (Hestenes, 1969; Powell, 1969).
In particular, we employ the alternating directions method of multipliers1 (ADMM), a method introduced in the optimization
community in the 1970s (Glowinski and Marroco, 1975; Gabay and Mercier, 1976) which
has seen a recent surge of interest in machine learning and signal
processing (Afonso et al., 2010; Mota et al., 2010; Goldfarb et al., 2010; Boyd et al., 2011).
The result is a new LP-MAP inference algorithm called AD3 (alternating directions dual decomposition),
which enjoys O(1/ϵ) convergence rate. We show that AD3 is able to solve the LP-MAP
problem more efficiently than other methods. Like the projected subgradient algorithm of Komodakis et al. (2007),
AD3 is an iterative “consensus algorithm,” alternating between a broadcast operation,
where subproblems are distributed across local workers, and a gather operation, where the local
solutions are assembled by a controller. The key difference is that AD3 also broadcasts
the current global solution, allowing the local workers to regularize their subproblems toward
that solution. This has the effect of speeding up consensus, without sacrificing the modularity of DD
algorithms.

Our main contributions are:

We derive AD3 and establish its convergence properties,
blending classical and newer results about ADMM
(Glowinski and Le Tallec, 1989; Eckstein and Bertsekas, 1992; Wang and Banerjee, 2012).
We show that the algorithm has the same form as
the DD method of Komodakis et al. (2007), with the local MAP
subproblems replaced by quadratic programs.

We show that these AD3 subproblems can be solved
exactly and efficiently in many cases of interest, including
Ising models and a wide range of hard factors representing arbitrary
constraints in first-order logic.
Up to a logarithmic term, the asymptotic cost is the same as
that of passing messages or doing local MAP inference.

For factors lacking a closed-form solution of the AD3
subproblems, we introduce a new active set method.
Remarkably, our method requires
only a black box that returns local MAP configurations for each factor
(the same requirement of the projected subgradient algorithm).
This paves the way for using AD3 with large dense or structured
factors, based on off-the-shelf combinatorial algorithms
(e.g., Viterbi, Chu-Liu-Edmonds, Ford-Fulkerson).

We show that AD3 can be wrapped into a branch-and-bound
procedure to retrieve the exact MAP configuration.

AD3 was originally introduced by Martins et al. (2010a, 2011a) (then
called DD-ADMM). In addition to a considerably more detailed presentation, this paper
contains contributions that substantially extend
that preliminary work in several directions: the O(1/ϵ) rate of
convergence, the active set method for general factors, and the
branch-and-bound procedure for exact MAP inference. It also reports
a wider set of experiments and the release of open-source code (available at
http://www.ark.cs.cmu.edu/AD3), which may be useful to other researchers in the field.

This paper is organized as follows. We start by providing background material:
MAP inference in graphical models and its LP-MAP relaxation (Section 2);
the projected subgradient algorithm for the DD of Komodakis et al. (2007) (Section 3).
In Section 4, we derive AD3 and analyze its convergence.
The AD3 local subproblems are addressed in Section 5,
where closed-form solutions are derived for Ising models and
several structural constraint factors. In Section 6,
we introduce an active set method to solve the AD3 subproblems for
arbitrary factors. Experiments with synthetic models, as well as in protein design
and dependency parsing (Section 7) testify for the success of
our approach. Finally, related work in discussed in Section 8,
and Section 9 concludes the paper.

2.1 Factor Graphs

Let Y:=(Y1,…,YM) be a tuple of random variables,
where each Yi takes values in a finite set Yi and Y
takes values in the corresponding Cartesian product set Y:=∏Mi=1Yi.
Graphical models (e.g., Bayesian networks, Markov random fields)
are structural representations of joint probability distributions, which
conveniently model statistical (in)dependencies among the variables.
In this paper, we focus on factor graphs,
introduced by Tanner (1981) and Kschischang et al. (2001).

Definition 1 (Factor graph.)

A factor graph is a bipartite graph G:=(V,F,E),
comprised of:

a set of variable nodesV:={1,…,M}, corresponding to the variables (Y1,…,YM);

a set of factor nodesF (disjoint from V);

a set of edges E⊆V×F linking variable nodes to factor nodes.

For notational convenience, we use Latin letters (i,j,...) and Greek letters (α,β,...)
to refer to variable and factor nodes, respectively. We denote by
N(⋅) the neighborhood set of its node argument, whose cardinality is
called the degree of the node. Formally, N(i):={α∈F|(i,α)∈E},
for variable nodes, and N(α):={i∈V|(i,α)∈E} for factor nodes.
We use the short notation Yα to denote the tuple of variables linked to factor α,
i.e., Yα:=(Yi)i∈N(α), and lowercase to denote instantiations
of random variables, e.g., Yi=yi and Yα=yα.

The joint probability distribution of (Y1,…,YM) is said to factor according to the factor graph
G=(V,F,E) if it can be written as

P(Y1=y1,…,YM=yM)∝exp(∑i∈Vθi(yi)+∑α∈Fθα(yα)),

(1)

where θi(⋅) and θα(⋅) are called, respectively,
the unary and higher-order log-potential functions.
To accommodate hard constraints, we allow these functions to
take values in R∪{−∞}. Fig. 1 shows examples
of factor graphs with hard constraint factors (to be studied in detail in
Section 5.3).
For notational convenience, we represent potential functions as
vectors, θi:=(θi(yi))yi∈Yi
and θα:=(θα(yα))yα∈Yα,
where Yα=∏i∈N(α)Yi.
We denote by θ the vector of dimension
∑i∈V|Yi|+∑α∈F|Yα|
obtained by stacking all these vectors.

\includegraphics

[width=0.9]figs/factor_graph

Figure 1: Constrained factor graphs, with soft factors shown as green squares
above the variable nodes (circles) and hard constraint factors as black squares
below the variable nodes. Left: a global factor that constrains the set of admissible
outputs to a given codebook. Right: examples of declarative constraints; one of them is a factor
connecting existing variables to an extra variable, allows scores depending on a logical functions
of the former.

2.2 MAP Inference

Given a factor graph G, along with all the log-potential functions—which fully specify the joint
distribution of Y1,…,YM—we are interested in finding an assignment with maximal probability
(the so-called MAP assignment/configuration):

ˆy

∈

argmaxy1,…,yMP(Y1=y1,…,YM=yM)

(2)

=

argmaxy1,…,yM∑α∈Fθα(yα)+∑i∈Vθi(yi).

This combinatorial problem, which is NP-hard in general (Koller and Friedman, 2009),
can be transformed into a linear program
by using the concept of marginal polytope(Wainwright and Jordan, 2008),
as next described.
Let ΔK denote the K-dimensional probability simplex,

ΔK:={u∈RK|u≥0,1⊤u=1}=conv{e1,…,eK},

(3)

where ej has all entries equal to zero except for the jth one, which is equal to one.
We introduce “local” probability
distributions over the variables and factors, pi∈Δ|Yi|
and qα∈Δ|Yα|.
We stack these distributions into vectors p and q,
with dimensions
P:=∑i∈V|Yi| and
Q:=∑α∈F|Yα|,
respectively, and denote
by (p,q)∈RP+Q
their concatenation.
We say that (p,q)
is globally consistent if
the local distributions
{pi} and {qα}
are the
marginals
of some global joint distribution.
The set of all globally consistent vectors (p,q)
is called the marginal polytope of G,
denoted as MARG(G).
There is a one-to-one mapping between the vertices
of MARG(G) and the
set of possible configurations Y: each
y∈Y corresponds
to a vertex (p,q) of MARG(G)
consisting of “degenerate” distributions
pi=eyi
and qα=eyα,
for each i∈V and α∈F.
The MAP problem in (2) is then
equivalent to the following linear program (LP):

MAP:maximize∑α∈Fθα⊤qα+∑i∈Vθi⊤piwith respect to(p,q)∈MARG(G).

(4)

The equivalence between (2)
and (4) stems from the fact that any linear program on
a bounded convex constraint polytope attains a solution at a vertex
of that polytope (see, e.g., Rockafellar (1970)).
Unfortunately, the linear program (4) is not easier to solve than
(2): for a graph G with cycles (which induce
global consistency constraints that are hard to specify concisely),
the number of linear inequalities that characterize MARG(G)
may grow superpolynomially with the size of G. As a consequence,
approximations of MARG(G) have been actively investigated in recent years.

2.3 LP-MAP Inference

Tractable approximations of MARG(G) can be built
by using weaker constraints that all realizable marginals must satisfy
to ensure local consistency:

Non-negativity: all local marginals pi(yi) and qα(yα) must be non-negative.

Normalization: local marginals must be properly normalized, i.e.,
∑yi∈Yipi(yi)=1, for all i∈V,
and ∑yα∈Yαqα(yα)=1, for all α∈F.

Marginalization: a variable participating in a factor must have a marginal which is consistent
with the factor marginals, i.e., pi(yi)=∑yα∼yiqα(yα),
for all (i,α)∈E and yi∈Yi.2

Note that some of these constraints are redundant: the non-negativity of
qα(yα) and the marginalization constraint
imply the non-negativity of pi(yi); the normalization of
qα and the marginalization constraints imply
∑yi∈Yipi(yi)=1. For convenience, we express those constraints
in vector notation.
For each (i,α)∈E, we define a (consistency) matrix
Miα∈R|Yi|×|Yα|
as follows:

Miα(yi,yα)={1,if yα∼yi0,otherwise.

(5)

The local marginalization constraints can be expressed as Miαqα=pi.
Combining all the local consistency constraints, and dropping redundant ones, yields
the so-called local polytope:

LOCAL(G)={(p,q)∈RP+Q∣∣∣qα∈Δ|Yα|,∀α∈FMiαqα=pi,∀(i,α)∈E}.

(6)

The number of constraints that define LOCAL(G) grows
linearly with P+Q,
rather than superpolynomially. The elements of LOCAL(G) are called pseudo-marginals.
Since any true marginal vector must satisfy the constraints above, LOCAL(G) is an outer approximation, i.e.,

Figure 2: Marginal polytope (in green) and its outer aproximation,
the local polytope (in blue).
Each element of the marginal polytope
corresponds to a joint distribution of Y1,…,YM,
and each vertex corresponds to a configuration y∈Y,
having coordinates in {0,1}.
The local polytope may have additional fractional vertices,
with coordinates in [0,1].

This approximation is tight for tree-structured graphs
and other special cases, but not in general (even for small
graphs with cycles) (Wainwright and Jordan, 2008). The LP-MAP problem
results from replacing MARG(G) by LOCAL(G) in (4) (Schlesinger, 1976):

LP-MAP:maximize∑α∈Fθα⊤qα+∑i∈Vθi⊤piwith respect to(p,q)∈LOCAL(G).

(8)

It can be shown that the points of LOCAL(G) that are integer are
the vertices of MARG(G); therefore, the solution of LP-MAP provides
an upper bound on the optimal objective of MAP.

2.4 LP-MAP Inference Algorithms

While any off-the-shelf LP solver can be used for solving (8),
specialized algorithms have been designed to exploit the graph structure,
achieving superior performance on several benchmarks (Yanover et al., 2006).
Most of these specialized algorithms belong to two classes:
block (dual) coordinate descent, which take the form of message-passing algorithms, and projected
subgradient algorithms, based on dual decomposition.

Block coordinate descent methods address the dual of (8)
by alternately optimizing over blocks of coordinates. Different choices of
blocks lead to different algorithms: max-sum diffusion(Kovalevsky and Koval, 1975; Werner, 2007); max-product sequential tree-reweighted belief propagation (TRW-S)
(Wainwright et al., 2005; Kolmogorov, 2006);
max-product linear programming (MPLP) (Globerson and Jaakkola, 2008).
These algorithms work by passing messages (that require computing
max-marginals) between factors and variables.
Under certain conditions, if the relaxation is tight, one may obtain optimality certificates.
If the relaxation is not tight, it is sometimes possible to reduce the problem size
or use a cutting-plane method to move toward the exact MAP (Sontag et al., 2008).
A disadvantage of block coordinate descent algorithms is that they
may stop at suboptimal solutions, since the objective is non-smooth
(Bertsekas et al. 1999, Section 6.3.4).

Projected subgradient based on the dual decomposition is a classical
technique (Dantzig and Wolfe, 1960; Everett III, 1963; Shor, 1985), which was first proposed
in the context of graphical models by Komodakis et al. (2007) and Johnson et al. (2007).
Due to its strong relationship with the approach pursued in this paper, that
method is described in detail in the next section.

The projected subgradient method for dual decomposition can be seen as an
iterative “consensus algorithm,” alternating between a broadcast operation,
where local subproblems are distributed across worker nodes,
and a gather operation, where the local solutions are assembled by a
master node to update the current global solution.
At each round, the worker nodes perform local MAP inference independently;
hence, the algorithm is completely modular and trivially parallelizable.

For each edge (i,α)∈E, define a potential function
θiα:=(θiα(yi))yi∈Yi that satisfies ∑α∈N(i)θiα=θi;
a trivial choice is θiα=|N(i)|−1θi.
The objective (8) can be rewritten as

∑α∈Fθα⊤qα+∑i∈Vθi⊤pi

=

∑α∈F⎛⎝θα+∑i∈N(α)M⊤iαθiα⎞⎠⊤qα,

(9)

because pi=Miαqα.
The LP-MAP problem (8) is then equivalent to the following
primal formulation, which we call LP-MAP-P:

Note that, although the p-variables do not appear in the
objective of (10),
they play a fundamental role through the constraints in the last line,
which are necessary to
ensure that the marginals encoded in the q-variables
are consistent on their overlaps.
Indeed, it is this set of constraints that complicate the optimization problem,
which would otherwise be separable into independent subproblems, one per factor.
Introducing a Lagrange multiplier λiα(yi) for each of these equality constraints leads
to the Lagrangian function

L(q,p,λ)

=

∑α∈F⎛⎝θα+∑i∈N(α)M⊤iα(θiα+λiα)⎞⎠⊤qα−∑(i,α)∈Eλiα⊤pi,

(11)

the maximization of which w.r.t. q and p will yield the (Lagrangian) dual objective.
Since the p-variables are unconstrained, we have

maxq,pL(q,p,λ)={g(λ)if λ∈Λ,+∞otherwise,

(12)

where g(λ)=∑α∈Fgα(λ) is the dual objective function,

Λ:=⎧⎨⎩λ∣∣∣∑α∈N(i)λiα=0,∀i∈V⎫⎬⎭,

(13)

and each gα(λ) is the solution of a local problem
(called the α-subproblem):

gα(λ)

:=

maxqα∈Δ|Yα|⎛⎝θα+∑i∈N(α)M⊤iα(θiα+λiα)⎞⎠⊤qα

(14)

=

maxyα∈Yα⎛⎝θα(yα)+∑i∈N(α)(θiα(yi)+λiα(yi))⎞⎠;

(15)

the last equality is justified by the fact that maximizing a linear objective over the
probability simplex gives the largest component of the score vector.
Finally, the dual problem is

LP-MAP-D:minimizeg(λ)with respect toλ∈Λ.

(16)

Problem (16) is often referred to as the master or controller,
and each α-subproblem (14)–(15) as a slave or worker.
Note that each of these α-subproblems is itself a local MAP problem,
involving only factor α.
As a consequence, the solution ˆqα
of (14)
is an indicator vector corresponding to a particular configuration
ˆyα (the solution of (15) ), that is, ˆqα=eˆyα.

The dual problem (16) can be solved with a projected subgradient algorithm.3 By Danskin’s rule (Bertsekas et al., 1999, p. 717), a subgradient of gα
is readily given by

∂gα(λ)∂λiα=Miαˆqα:=ˆqiα,∀(i,α)∈E;

(17)

and the projection onto Λ amounts to a centering operation.
The α-subproblems (14)–(15)
can be handled in parallel and then have their solutions gathered for
computing this projection and update the Lagrange variables.
Putting these pieces together yields Algorithm 1,
which assumes a black-box procedure ComputeMap that returns
a local MAP assignment (as an indicator vector), given log-potentials as input.
At each iteration, the algorithm broadcasts the current Lagrange multipliers
to all the factors.
Each factor adjusts its internal unary log-potentials (line 6) and
invokes the ComputeMap procedure (line 7).4 The solutions achieved by each factor are then gathered
and averaged (line 10),
and the Lagrange multipliers are updated with step size ηt (line 11).
The two following propositions establish the convergence
properties of Algorithm 1.

Proposition 2 (Convergence rate)

Let the non-negative step size sequence (ηt)t∈N be diminishing and nonsummable:
limηt=0 and ∑∞t=1ηt=∞.
Then, Algorithm 1
converges to the solution of LP-MAP-D (16).
Furthermore,
T=O(1/ϵ2) iterations of
Algorithm 1
are sufficient to achieve a
dual objective
which differs by ϵ
from the optimal value.

Proof:
If all local subproblems are in agreement,
then a vacuous update will occur in line 11,
and no further changes will occur. Since the algorithm is guaranteed to converge, the current
λ is optimal.
Also, if all local subproblems are in agreement,
the averaging in line
10
necessarily yields a binary vector p.
Since any binary solution of LP-MAP is
also a solution of MAP, the result follows.

Propositions 2–3 imply that, if the parameters
θ are such that LP-MAP is a tight relaxation, then Algorithm 1 yields
the exact MAP configuration along with a certificate of optimality.
According to Proposition 2, even if the relaxation is not tight,
Algorithm 1 still converges to a solution of LP-MAP.
In some problem domains, the LP-MAP is often tight (Koo and Collins, 2010; Rush et al., 2010);
for problems with a relaxation gap, techniques to tighten the relaxation have been developed
(Rush and Collins, 2011). However, in large graphs with many overlapping factors,
it has been observed that Algorithm 1 can converge quite slowly in practice
(Martins et al., 2011b). This is not surprising, given that it attempts to reach a consensus
among all overlapping components; the larger this number, the harder it is to achieve consensus.
In the next section, we propose a new LP-MAP inference algorithm that
is more suitable for this class of problems.

4.1 Addressing the Weaknesses of Dual Decomposition with Projected Subgradient

The main weaknesses of Algorithm 1 reside in the following two aspects.

The dual objective function g(λ) is non-smooth,
this being why “subgradients” are used instead of “gradients.”
It is well-known that non-smooth optimization lacks some of the good properties
of its smooth counterpart. Namely, there is no guarantee of monotonic improvement in
the objective (see Bertsekas et al. 1999, p. 611). Ensuring convergence requires
using a diminishing step size sequence, which leads to slow convergence rates.
In fact, as stated in Proposition 2, O(1/ϵ2) iterations
are required to guarantee ϵ-accuracy.

A close look at Algorithm 1 reveals that the consensus is
promoted solely by the Lagrange multipliers (line 6).
In a economic interpretation, these represent “price adjustments”
that lead to a reallocation of resources. However, no “memory” exists about past
allocations or adjustments, so the workers never know how far they are from consensus.
It is thus conceivable that a
smart use of these quantities could accelerate convergence.

To obviate the first of these problems, Johnson et al. (2007) proposed smoothing the objective
function through an “entropic” perturbation (controlled by a “temperature” parameter),
which boils down to replacing the max in (16) by a “soft-max.”
As a result, all the local subproblems become marginal (rather than MAP) computations.
A related and asymptotically faster method was proposed later by
Jojic et al. (2010), who address the resulting smooth optimization problem with an
accelerated gradient method(Nesterov, 1983, 2005).
That approach guarantees an ϵ-accurate solution after
O(1/ϵ) iterations, an improvement over the O(1/ϵ2) bound of
Algorithm 1. However, those methods have some drawbacks.
First, they need to operate at near-zero temperatures; e.g., the O(1/ϵ)
iteration bound of Jojic et al. (2010) requires setting the temperature to O(ϵ),
which scales the potentials by O(1/ϵ) and may lead to numerical instabilities
in some problems.
Second, the solution of the local subproblems are always dense; although
some marginal values may be low, they are never exactly zero. This contrasts with
the projected subgradient algorithm, for which the solutions of
the local subproblems are MAP configurations. These configurations can be
cached across iterations, leading to important speedups (Koo et al., 2010).

We will show that AD3 also yields a
O(1/ϵ) iteration bound without suffering from the two drawbacks above.
Unlike Algorithm 1, AD3 broadcasts the current global solution
to the workers, thus allowing them to regularize their subproblems toward that solution.
This promotes a faster consensus, without sacrificing the modularity of dual decomposition.
Another advantage of AD3 is that it keeps track of primal and dual residuals, allowing
monitoring the LP-MAP optimization process and stopping when a desired accuracy level is attained.

4.2 Augmented Lagrangians and the Alternating Directions Method of Multipliers

Let us start with a brief overview of augmented Lagrangian methods.
Consider the following general convex optimization problem with equality constraints:

maximizef1(q)+f2(p)with respect toq∈Q,p∈Psubject toAq+Bp=c,

(18)

where Q⊆RP and P⊆RQ
are convex sets and
f1:Q→R∪{−∞} and f2:P→R∪{−∞} are concave functions.
Note that the LP-MAP problem stated in (10) has this form.
For any η≥0, consider the problem

which differs from (18) in the extra term penalizing violations
of the equality constraints; since this term vanishes at feasibility, the two problems have the same solution.
The reason to consider (19) is that its objective is
η-strongly concave, even if f1+f2 is not.
The Lagrangian of problem (19),

Lη(q,p,λ)=f1(q)+f2(p)+λ⊤(Aq+Bp−c)−η2∥Aq+Bp−c∥2,

(20)

is the η-augmented Lagrangian of problem (18).
The so-called augmented Lagrangian (AL) methods (Bertsekas et al., 1999, Section 4.2) address problem
(18) by seeking a saddle point of Lηt, for some sequence
(ηt)t∈N. The earliest instance is the method of multipliers (MM)
(Hestenes, 1969; Powell, 1969), which alternates between a joint update of q and p through

(qt+1,pt+1)

:=

argmaxq,p{Lηt(q,p,λt)|q∈Q,p∈P}

(21)

and a gradient update of the Lagrange multiplier,

λt+1

:=

λt−ηt(Aqt+1+Bpt+1−c).

(22)

Under some conditions, this method is convergent, and even superlinear, if
the sequence (ηt)t∈N is increasing (Bertsekas et al. 1999, Section 4.2).
A shortcoming of the MM is that problem (21) may be difficult,
since the penalty term of the augmented Lagrangian couples the variables p and q.
The alternating directions method of multipliers (ADMM) avoids this shortcoming, by
replacing the joint optimization (21)
by a single block Gauss-Seidel-type step:

In general, problems (23)–(24) are
simpler than the joint maximization in (21).
ADMM was proposed by Glowinski and Marroco (1975) and Gabay and Mercier (1976) and
is related to other optimization methods, such as Douglas-Rachford splitting
(Eckstein and Bertsekas, 1992) and proximal point methods (see Boyd et al. 2011 for an
historical overview).

4.3 Derivation of AD3

The LP-MAP-P problem (10) can be cast into the form
(18) by proceeding as follows:

let Q in (18) be
the Cartesian product of simplices, Q:=∏α∈FΔ|Yα|,
and P:=RP;

let f1(q):=∑α∈F(θα+∑i∈N(α)M⊤iαθiα)⊤qα and f2:≡0;

let A in (18)
be a R×Q block-diagonal matrix, where
R=∑(i,α)∈E|Yi|,
with one block per factor, which is a vertical concatenation
of the matrices {Miα}i∈N(α);

let −B be a R×P matrix of grid-structured blocks, where
the block in the (i,α)th row and
the ith column is a negative identity matrix
of size |Yi|, and all the other blocks are
zero;

This is the standard Lagrangian in (11) plus the Euclidean
penalty term. The ADMM updates are

Broadcast:

q(t):=argmaxq∈QLηt(q,p(t−1),λ(t−1)),

(26)

Gather:

p(t):=argmaxp∈RPLηt(q(t),p,λ(t−1)),

(27)

Multiplier update:

λ(t)iα:=λ(t−1)iα−ηt(Miαq(t)α−p(t)i),∀(i,α)∈E.

(28)

We next analyze the broadcast and gather steps, and prove a proposition about the multiplier update.

Broadcast step. The maximization (26) can be carried out in parallel at the factors,
as in Algorithm 1. The only difference is that, instead of a local MAP computation,
each soft-factor worker now needs to solve a quadratic program of the form:

maxqα∈Δ|Yα|⎛⎝θα+∑i∈N(α)M⊤iα(θiα+λiα)⎞⎠⊤qα−η2∑i∈N(α)∥Miαqα−pi∥2.

(29)

The subproblem (29) differs from the linear
subproblem (14)–(15)
in the projected subgradient algorithm by including an Euclidean penalty term,
which penalizes deviations from the global consensus. Sections 5 and
6 propose procedures to solve the local subproblems (29).

Gather step. The solution of (27) has a closed form.
Indeed, this problem is separable into independent optimizations, one for each i∈V;
defining qiα:=Miαqα,

Assembling all these pieces together leads to the AD3 (Algorithm 2). Notice
that AD3 retains the modular structure of Algorithm 1: both are
iterative consensus algorithms, alternating between a broadcast operation,
where subproblems are distributed across local workers
(lines 5–9 in Algorithm 2),
and a gather operation, where the local solutions are assembled by a master node, which
updates the global solution (line 10)
and adjusts multipliers to promote a consensus (line 11).
The key difference is that AD3 also broadcasts the current global solution to the workers,
allowing them to regularize their subproblems toward that solution, thus
speeding up the consensus.
This is embodied in the procedure SolveQP (line 7),
which replaces ComputeMAP of Algorithm 1.

Although Proposition 5 guarantees convergence for any choice of η,
this parameter may strongly impact the behavior of the algorithm, as
illustrated in Fig. 3. In our experiments, we dynamically adjust
η in earlier iterations using the heuristic described in Boyd et al. (2011), and freeze
it afterwards, not to compromise convergence.

\includegraphics

[scale=0.6]figs/oscillations

Figure 3: Progress in the dual objective for different values of η
(the factor graph is a binarized 20×20 Potts grid with 8 states; see
Section 5.4 for details). For η=0.1,
progress is too slow; for η=10 there are strong oscillations,
which makes the algorithm slow to take off. In this example, convergence is faster with an
intermediate value, η=1.

Primal and dual residuals

Recall from Proposition 3 that the projected subgradient algorithm yields
optimality certificates, if the relaxation is tight (i.e., if the solution of the LP-MAP
problem is the exact MAP), whereas in problems with a relaxation gap, it is harder to decide when to stop.
It would be desirable to have similar guarantees concerning the relaxed primal.5 A general weakness of subgradient algorithms is that they do not have this capacity, being
usually stopped rather arbitrarily after a given number of iterations.
In contrast, ADMM allows keeping track of primal and dual residuals (Boyd et al., 2011),
based on which it is possible to obtain certificates, not only for the primal solution
(if the relaxation is tight), but also to terminate when a near optimal relaxed primal
solution has been found.
The primal residualr(t)P is the amount by which the agreement constraints are violated,

r(t)P=∑(i,α)∈E∥Miαq(t)α−p(t)i∥2∑(i,α)∈E|Yi|∈[0,1],

(34)

where the constant in the denominator ensures that r(t)P∈[0,1].
The dual residualr(t)D,

r(t)D=∑(i,α)∈E∥p(t)i−p(t−1)i∥2∑(i,α)∈E|Yi|∈[0,1],

(35)

is the amount by which a dual optimality condition is violated (Boyd et al., 2011). We adopt as stopping criterion that these two residuals fall
below a threshold, e.g., 10−6.

Approximate solutions of the local subproblems

The next proposition states that convergence may still hold, even
if the local subproblems are solved approximately, provided the sequence of
errors is summable. The importance of this result will be
clear in Section 6, where we describe
a general iterative algorithm for solving the local quadratic subproblems.
Essentially, Proposition 6 allows these subproblems to
be solved numerically up to some accuracy without compromising global convergence,
as long as the accuracy of the solutions improves sufficiently fast over ADMM iterations.

Let ηt=η, ˆq contain the exact solutions of
(29), and ~q those produced by an approximate algorithm. Then
Proposition 5 still holds, provided that

∞∑t=1∥ˆq−~q∥<∞.

(36)

Convergence rate

Although ADMM was invented in the 1970s, its convergence rate was unknown until recently.
The next proposition states the O(1/ϵ) iteration bound of AD3, asymptotically equivalent
to that of the algorithm of Jojic et al. (2010), therefore better than the O(1/ϵ2)
bound of Algorithm 1.

Proposition 7

Assume the conditions of Proposition 5.
Let λ∗ be a solution of the dual problem (16),
¯λT:=∑Tt=1λ(t) be the “averaged”
Lagrange multipliers after T iterations of AD3, and g(¯λT) the
corresponding estimate of the dual objective. Then, g(¯λT)−g(λ∗)≤ϵ
after T≤O(C/ϵ) iterations, where
C satisfies

C≤5η2∑i∈V|N(i)|×(1−|Yi|−1)+52η∥λ∗∥2.

(37)

Proof:
The proof (detailed in Appendix A)
uses recent results of He and Yuan (2011) and Wang and Banerjee (2012), concerning convergence of ADMM in a variational inequality setting.

As expected, the bound (37) increases
with the number of overlapping variables,
the size of the sets Yi, and the magnitude of the optimal dual vector λ∗.
Note that if there is a good estimate of ∥λ∗∥, then (37)
can be used to choose a step size η that minimizes the bound.
Moreover, unlike the O(1/ϵ) bound of the accelerated method of
Jojic et al. (2010), which requires specifying ϵ beforehand,
AD3 does not require pre-specifying a desired accuracy.

The bounds derived so far for all these algorithms are with respect to the dual problem—an open
problem is to obtain bounds related to primal convergence.

Runtime and caching strategies

In practice, considerable speed-ups can be achieved by caching the subproblems,
a strategy which has also been proposed for the projected subgradient algorithm
(Koo et al., 2010).
After a few iterations, many variables pi reach a consensus
(i.e., p(t)i=q(t+1)iα,∀α∈N(i))
and enter an idle state: they are left unchanged by the p-update (32),
and so do the Lagrange variables λ(t+1)iα (28)).
If at iteration t all variables in a subproblem at factor α are idle,
then q(t+1)α=q(t)α, hence the corresponding subproblem
does not need to be solved. Typically, many variables and subproblems enter this idle state
after the first few rounds.
We will show the practical benefits of caching in the experimental section (Section 7.5).

4.5 Exact Inference with Branch-and-Bound

Recall that
AD3, as just described, solves the LP-MAP relaxation of the actual problem.
In some problems, this relaxation is
tight (in which case a certificate of optimality will be obtained), but this is not always the case.
When a fractional solution is obtained, it is desirable to have a strategy to recover the
exact MAP solution.

Two observations are noteworthy.
First, as we saw in Section 2.3, the optimal value of the relaxed problem LP-MAP provides an upper bound to the
original problem MAP.
In particular, any feasible dual point
provides an upper bound to the original problem’s optimal value.
Second, during execution of the AD3 algorithm, we always keep track
of a sequence of feasible dual points (as guaranteed by Proposition 5, item 4).
Therefore, each iteration
constructs tighter and tighter upper bounds.
In recent work (?), we proposed a
branch-and-bound
search procedure that finds the exact solution of the ILP.
The procedure works recursively as follows:

Initialize L=−∞ (our best value so far).

Run Algorithm 2. If the solution p∗ is integer, return p∗
and set L to the objective value.
If along the execution we obtain an upper bound less than L, then
Algorithm 2 can be safely stopped and return “infeasible”—this is the bound part.
Otherwise (if p∗ is fractional) go to step 3.

Find the “most fractional” component of p∗ (call it p∗j(.))
and branch: for every yj∈Yj,
constrain pj(yj)=1 and go to step 2,
eventually obtaining an integer solution p∗|yj or infeasibility.
Return the p∗∈{p∗|yj}yj∈Yj
that yields the largest objective value.

Although this procedure has worst-case exponential runtime,
in many problems for which the relaxations are near-exact
it is found empirically very effective.
We will see one example in Section 7.4.

5.1 Introduction

This section shows how to solve the AD3 local subproblems (29) exactly and efficiently,
in several cases, including Ising models and logic constraint factors. These results will be complemented
in Section 6, where a new procedure to handle arbitrary factors widens
the applicability of AD3.

By subtracting a constant from the objective (29), re-scaling, and turning the
maximization into a minimization, the problem can be written more compactly as

minimize

12∑i∈N(α)∥qiα−ai∥2−bα⊤qα

with respect to

qα∈Δ|Yα|,qiα∈R|Yi|,∀i∈N(α)

subject to

qiα=Miαqα,∀i∈N(α).

(38)

where ai:=pi+η−1(θiα+λiα) and bα=η−1θα.

We show that (5.1) has a closed-form solution or
can be solved exactly and efficiently, in several cases; e.g.,
for Ising models, for factor graphs imposing first-order logic (FOL) constraints,
and for Potts models (after binarization).
In these cases, AD3 and the projected subgradient algorithm have (asymptotically)
the same computational cost per iteration, up to a logarithmic factor.

5.2 Ising Models

Ising models are factor graphs containing only binary pairwise factors.
A binary pairwise factor (say, α) is one connecting two binary variables
(say, Y1 and Y2); thus Y1=Y2={0,1} and Yα={00,01,10,11}.
Given that q1α,q2α∈Δ2, we can write
q1α=(1−z1,z1), q2α=(1−z2,z2).
Furthermore, since qα∈Δ4 and marginalization requires
that qα(1,1)+qα(1,0)=z1 and qα(0,1)+qα(1,1)=z2,
we can also write qα