Solving energies with higher order cliques

Solving Energies with Higher Order Cliques
Pushmeet Kohli M. Pawan Kumar Philip H.S. Torr
Department of Computing
Oxford Brookes University, UK
{pushmeet.kohli, pkmudigonda, philiptorr }@brookes.ac.uk
http://cms.brookes.ac.uk/computervision
Abstract Convergence is achieved when the energy cannot be mini-
mized further. At each step the optimal move (i.e. the move
In this paper we extend the class of energy functions for decreasing the energy of the labelling by the most amount)
which the optimal α-expansion and αβ-swap moves can be is computed in polynomial time. However, this can only be
computed in polynomial time. Speciﬁcally, we introduce a done for a certain class of energy functions.
class of higher order clique potentials and show that the ex- Boykov et al. [5] provided a characterization of clique
pansion and swap moves for any energy function composed potentials for which the optimal moves can be computed by
of these potentials can be found by minimizing a submodu- solving an st-mincut problem. However, their results were
lar function. We also show that for a subset of these poten- limited to potentials of cliques of size at most two. We call
tials, the optimal move can be found by solving an st-mincut this class of energy functions P 2 . In this paper we provide
problem. We refer to this subset as the P n Potts model. the characterization of energy functions involving higher or-
Our results enable the use of powerful move making al- der cliques i.e. cliques of sizes 3 and beyond for which the
gorithms i.e. α-expansion and αβ-swap for minimization of optimal moves can be computed in polynomial time. We
energy functions involving higher order cliques. Such func- refer to the class of functions deﬁned by cliques of size at
tions have the capability of modelling the rich statistics of most n as P n . It should be noted that this class is different
natural scenes and can be used for many applications in from the class F n of energy functions which involve only
computer vision. We demonstrate their use on one such ap- binary random variables [7, 11].
plication i.e. the texture based video segmentation problem.
Higher order cliques Most energy minimization based
methods for solving Computer Vision problems assume that
the energy can be represented in terms of unary and pairwise
1. Introduction clique potentials. This assumption severely restricts the rep-
resentational power of these models making them unable to
In recent years discrete optimization has emerged as an capture the rich statistics of natural scenes [14].
important tool in solving Computer Vision problems. This Higher order clique potentials have the capability to
has primarily been the result of the increasing use of energy model complex interactions of random variables and thus
minimization algorithms such as graph cuts [5, 11], tree- could overcome this problem. Researchers have long recog-
reweighted message passing [10, 24] and variants of belief nized this fact and have used higher order models to im-
propagation (BP) [16, 25]. These algorithms allow us to per- prove the expressive power of MRFs and CRFs [14, 18, 19].
form approximate inference (i.e. obtain the MAP estimate) The initial work in this regard has been quite promising and
on graphical models such as Markov Random Fields (MRF) higher order cliques have been shown to improve results.
and Conditional Random Fields (CRF) [13]. However their use has been quite limited due to the unavail-
α-expansion and αβ-swap are two popular move mak- ability of efﬁcient algorithms for minimizing the resulting
ing algorithms for approximate energy minimization which energy functions.
were proposed in [5]. They are extremely efﬁcient and have Traditional inference algorithms such as BP are quite
been shown to produce good results for a number of prob- computationally expensive for higher order cliques. Lan et
lems [22]. These algorithms minimize an energy function al. [14] recently made the ﬁrst steps in solving this problem.
by starting from an initial labelling and making a series They proposed approximation methods for BP to make ef-
of changes (moves) which decrease the energy iteratively. ﬁcient inference possible in higher order MRFs. However
1
their results indicate that BP gives comparable results to The maximum a posterior (MAP) labelling xmap of the ran-
naive gradient descent. In contrast, we provide a characteri- dom ﬁeld is deﬁned as
zation of energy functions deﬁned by cliques of size 3 (P 3 )
or more (P n ) which can be solved using powerful move xmap = arg max Pr(x|D) = arg min E(x). (3)
x∈L x∈L
making algorithms such as α-expansion and αβ-swaps. We
prove that the optimal α-expansion and αβ-swap moves for 2.1. Submodular Energy Functions
this class of functions can be computed in polynomial time.
We then introduce a new family of higher order potential Submodular set functions play an important role in en-
functions, referred to as the P n Potts model, and show that ergy minimization as they can be minimized in polynomial
the optimal α-expansion and αβ-swap moves for them can time [3, 9]. In this paper we will explain their properties in
be computed by solving an st-mincut problem. It should terms of functions of binary random variables which can be
be noted that our results are a generalization of the class of seen as set functions [11].
energy functions speciﬁed by [5].
Deﬁnition 1. A projection of a function f : Ln → R on s
variables is a function f p : Ls → R which is obtained by
Outline of the Paper In section 2, we provide the no- ﬁxing the values of n − s arguments of f (·). Here p refers
tation and discuss the basic theory of energy minimiza- to the set of variables whose values have been ﬁxed.
tion and submodular functions. Section 3 describes the α-
expansion and αβ-swap algorithms. Further, it provides Example 1. The function f p (x2 , . . . , xn ) =
constraints on the pairwise potentials which guarantee com- f (0, x2 , . . . , xn ) is a projection of the function
putation of the optimal move in polynomial time. In sec- f (x1 , x2 , . . . , xn ).
tion 4, we generalize this class to P n functions. We also
show that the optimal moves for a sub-class of these func- Deﬁnition 2. A function of one binary variable is always
tions, i.e. the P n Potts model, can be computed by solving submodular. A function f (x1 , x2 ) of two binary variables
an st-mincut problem. This enables us to address the tex- {x1 , x2 } is submodular if and only if:
ture based video segmentation problem (see section 5). We
f (0, 0) + f (1, 1) ≤ f (0, 1) + f (1, 0) (4)
conclude by listing some Computer Vision problems where
higher order clique potentials can be used. A function f : Ln → R is submodular if and only if all its
projections on 2 variables are submodular [3, 11].
2. Preliminaries
Consider a random ﬁeld X deﬁned over a lattice V = Minimizing submodular functions using graph cuts
{1, 2, . . . , N } with a neighbourhood system N . Each ran- Certain submodular functions can be minimized by solving
dom variable Xi ∈ X is associated with a lattice point i ∈ V an st-mincut problem [3]. Kolmogorov et al. [11] showed
and takes a value from the label set L = {l1 , l2 , . . . , lk }. that all submodular functions of binary variables which can
Given a neighborhood system N , a clique c is speciﬁed be written in terms of potential function of cliques of sizes
by a set of random variables Xc such that ∀i, j ∈ c, i ∈ 2 and 3 can be minimized in this manner. Freedman and
Nj and j ∈ Ni , where Ni and Nj are the sets of all neigh- Drineas [7] extended this result by characterizing the class
bours of variable Xi and Xj . of functions F n involving higher order cliques deﬁned on
Any possible assignment of labels to the random vari- binary variables whose minimization can be translated to
ables will be called a labelling (denoted by x). In other an st-mincut problem. The class of multi-label submodular
words, x takes values from the set L = LN . The posterior functions which can be translated into an st-mincut problem
distribution Pr(x|D) over the labellings of the random ﬁeld has also been characterized independently by [2] and [8].
is a Gibbs distribution if it can be written in the form:
2.2. Metric and Semi-metric Potential functions
1
Pr(x|D) = exp(− ψc (xc )), (1) In this subsection we provide the constraints for pairwise
Z
c∈C
potentials to deﬁne a metric or a semi-metric.
where Z is a normalizing constant known as the partition Deﬁnition 3. A potential function ψij (a, b) for a pairwise
function, and C is the set of all cliques. The term ψc (xc ) is clique of two random variables {xi , xj } is said to be a semi-
known as the potential function of the clique c where xc = metric if it satisﬁes
{xi , i ∈ c}. The corresponding Gibbs energy is given by
ψij (a, b) = 0 ⇐⇒ a = b (5)
E(x) = − log Pr(x|D) − log Z = ψc (xc ) (2)
ψij (a, b) = ψij (b, a) ≥ 0 (6)
c∈C
Deﬁnition 4. The potential function is metric if in addition 3.3. The α-expansion algorithm
to the above mentioned constraints it also satisﬁes
An α-expansion move allows any random variable to ei-
ψij (a, d) ≤ ψij (a, b) + ψij (b, d). (7) ther retain its current label or take label ‘α’. One iteration
of the algorithm involves performing expansions for all α
Example 2. The function ψij (a, b) = |a − b|2 is a semi- in L in some order successively. The transformation func-
metric but not a metric as it does not always satisfy condi- tion Tα (.) for an α-expansion move transforms the label of
tion (7). a random variable Xi as
3. Move Making Algorithms xi if ti = 0
Tα (xi , ti ) = (10)
α if ti = 1.
In this section we describe the move making algorithms
of [5] for approximate energy minimization and explain the The optimal α-expansion move can be computed in polyno-
conditions under which they can be applied. mial time if the energy function Eα (t) = E(Tα (x, t)) sat-
isﬁes constraint (9). Substituting the value of Eα in (9) we
3.1. Minimizing P 2 functions get the constraint
Boykov et al. [5] addressed the problem of minimizing E p (α, α) + E p (xi , xj ) ≤ E p (xi , α) + E p (α, xj ), ∀p ∈ V × V.
energy functions consisting of unary and pairwise cliques. (11)
These functions can be written as
3.4. The αβ -swap algorithm
E(x) = ψi (xi ) + ψij (xi , xj ). (8)
i∈V i∈V,j∈Nj An αβ-swap move allows a random variable whose cur-
rent label is α or β to either take label α or β. One itera-
They proposed two move making algorithms called α- tion of the algorithm involves performing swap moves for
expansions and αβ-swaps for this problem. These algo- all α, β in L in some order successively. The transforma-
rithms work by starting from a initial labelling x and mak- tion function Tαβ () for an αβ-swap transforms the label of
ing a series of changes (moves) which lower the energy it- a random variable xi as
eratively. Convergence is achieved when the energy cannot
be decreased further. At each step the move decreasing the α if xi = α, β and ti = 0,
Tαβ (xi , ti ) = (12)
energy of the labelling by the most amount is made. We β if xi = α, β and ti = 1.
will refer to such a move as optimal.
The optimal αβ-swap move can be computed in polynomial
Boykov et al. [5] showed that the optimal moves for cer-
time if the energy function Eαβ (t) = E(Tαβ (x, t)) satis-
tain energy functions of the form (8) can be computed in
ﬁes (9). As before, substituting the value of Eαβ in (9) we
polynomial time by solving an st-mincut problem. Specif-
get the constraint
ically, they showed that if the pairwise potential functions
ψij deﬁne a metric then the energy function in equation (8) E p (α, α) + E p (β, β) ≤ E p (β, α) + E p (α, β), ∀p ∈ V × V.
can be approximately minimized using α-expansion. Simi- (13)
larly if ψij deﬁnes a semi-metric, it can be minimized using In the next section we provide a class of energy functions
αβ-swap. which satisfy equation (11) and (13).
3.2. Binary Moves and Move Energies 4. Characterizing P n Functions
The moves of both the α-expansion and αβ-swap algo- Now we characterize a class of higher order clique po-
rithms can be represented as a vector of binary variables tentials for which the expansion and swap moves can be
t ={ti , ∀i ∈ V}. A transformation function T (x, t) takes computed in polynomial time. Recall that P n functions are
the current labelling x and a move t and returns the new la- deﬁned on cliques of size at most n. From the additivity
ˆ
belling x which has been induced by the move. The energy theorem [11] it follows that the optimal moves for all en-
of a move t (denoted by Em (t)) is deﬁned as the energy of ergy functions composed of these clique potentials can be
ˆ
the labelling x it induces i.e. Em (t) = E(T (x, t)). The computed in polynomial time. We constrain the clique po-
optimal move is deﬁned as t∗ = arg mint E(T (x, t)). tentials to take the form:
As discussed in section 2.1, the optimal move t∗ can be
computed in polynomial time if the function Em (t) is sub- ψc (xc ) = fc (Qc (⊕, xc )). (14)
modular. From deﬁnition 2 this implies that all projections
of Em (t) on two variables should be submodular i.e. where Qc (⊕, xc ) is a functional deﬁned as:
p p p p
Em (0, 0)+Em (1, 1) ≤ Em (0, 1)+Em (1, 0), ∀p ∈ V ×V. (9) Qc (⊕, xc ) = ⊕i,j∈c φc (xi , xj ). (15)
Here fc is an arbitrary function of Qc , φc is a pairwise func- As φc (β, α) = φc (α, β) from constraint (17) this condition
tion deﬁned on all pairs of random variables in the clique c, transforms to:
and ⊕ is an operator applied on these functions φc (xi , xj ).
2fc (φc (α, β) + Dα + Dβ + D) ≥ (22)
4.1. Conditions for αβ -swaps fc (φc (α, α) + 2Dα + D) + fc (φc (β, β) + 2Dβ + D).
We will now specify the constraints under which all αβ- To prove (22) we need lemma 1.
swap moves for higher order clique potentials can be com-
Lemma 1. For a non decreasing concave function f (x):
puted in polynomial time. For the moment we consider the
case ⊕ = , i.e. 2c ≥ a + b =⇒ 2f (c) ≥ f (a) + f (b). (23)
Qc (xc ) = φc (xi , xj ). (16) Proof in the appendix.
i,j∈c Using the above lemma together with the fact that
Theorem 1. The optimal αβ-swap move for any α, β ∈ L 2φc (α, β) ≥ φc (α, α) + φc (β, β) ∀α, β ∈ L (24)
can be computed in polynomial time if the potential function
ψc (xc ) deﬁned on the clique c is of the form (14) where fc (·) (see constraint (18)), we see that the theorem hold true.
is a concave1 non-decreasing function, ⊕ = and φc (·, ·) The class of clique potentials described by theorem 1 is a
satisﬁes the constraints strict generalization of the class speciﬁed by the constraints
of [5] which can be obtained by considering only pairwise
φc (a, b) = φc (b, a) ∀a, b ∈ L (17) cliques, choosing fc () as a linear increasing function2 , and
φc (a, b) ≥ φc (d, d) ∀a, b, d ∈ L (18) constraining φc (a, a) = 0, ∀a ∈ L.
Proof. To prove that the optimal swap move can be com- 4.2. Conditions for α-expansions
puted in polynomial time we need to show that all projec- In this subsection we characterize the higher order clique
tions on two variables of any αβ-swap move energy are sub- potentials for which the optimal α-expansion move can be
modular. From equation (13) this implies that ∀i, j ∈ c the computed in polynomial time for all α ∈ L, x ∈ L.
condition:
Theorem 2. The optimal α-expansion move for any α ∈ L
ψc ({α, α} ∪ xc\{i,j} ) + ψc ({β, β} ∪ xc\{i,j} ) ≤ can be computed in polynomial time if the potential function
ψc ({α, β} ∪ xc\{i,j} ) + ψc ({β, α} ∪ xc\{i,j} ) (19) ψc (xc ) deﬁned on the clique c is of the form (14) where
fc (·) is a increasing linear function, ⊕ = and φc (·, ·) is
should be satisﬁed. Here xc\{i,j} denotes the labelling of a metric.
all variables Xu , u ∈ c except i and j. The cost of any Proof. To prove that the optimal expansion move can be
conﬁguration {α, α} ∪ xc\{i,j} of the clique can be written computed in polynomial time we need to show that all pro-
as jections of any α-expansion move energy on two variables
of the clique are submodular. From equation (11) this im-
ψc ({xi , xj } ∪ xc\{i,j} ) = fc (Qc ({xi , xj } ∪ xc\{i,j} )) plies that ∀i, j ∈ c the condition
= fc (φc (xi , xj ) + Qc\{i,j} (xc\{i,j} ) +
ψc ({α, α} ∪ xc\{i,j} ) + ψc ({a, b} ∪ xc\{i,j} ) ≤
φc (xi , xu ) + φc (xj , xu )) (20)
ψc ({a, α} ∪ xc\{i,j} ) + ψc ({α, b} ∪ xc\{i,j} ) (25)
u∈c\{i,j} u∈c\{i,j}
is satisﬁed. Here a and b are the current labels of the vari-
Let D represent Qc\{i,j} (xc\{i,j} ), Dα repre- ables Xi and Xj respectively.
sent u∈c\{i,j} φc (α, xu ), and Dβ represent Let D represent Qc\{i,j} (xc\{i,j} ), and Dl represent
u∈c\{i,j} φc (β, xu ). Using equation (20), the equa- u∈c\{i,j} φc (l, xu ) for any label l. Then, using equation
tion (19) becomes (20) the constraint (25) becomes
fc (φc (α, β) + Dα + Dβ + D) (21) fc (φc (α, b) + Dα + Db + D) (26)
+ fc (φc (β, α) + Dβ + Dα + D) + fc (φc (a, α) + Da + Dα + D)
≥ fc (φc (α, α) + 2Dα + D) + fc (φc (β, β) + 2Dβ + D). ≥ fc (φc (α, α) + 2Dα + D)
+ fc (φc (a, b) + Da + Db + D).
1A function f (x) is concave if for any two points (a, b) and λ where
0 ≤ λ ≤ 1: λf (a) + (1 − λ)f (b) ≤ f (λa + (1 − λ)b). 2 All linear function are concave.
Let R1 = φc (α, b) + Dα + Db + D, R2 = φc (a, α) + Theorem 3. The optimal α-expansion move for any α ∈ L
Da + Dα + D, R3 = φc (α, α) + 2Dα + D, and R4 = can be computed in polynomial time if the potential function
fc (φc (a, b) + Da + Db + D). Since φc (·, ·) is a metric, we ψc (xc ) deﬁned on the clique c is of the form (14) where
observe that fc (·) is a increasing linear function, ⊕ = ’max’ and φc (·, ·)
deﬁnes a P 2 Potts Model.
φc (α, b) + φc (a, α) ≥ φc (α, α) + φc (a, b) (27)
⇒ R1 + R2 ≥ R3 + R4 . (28) Proof. The cost of any conﬁguration {α, α} ∪ xc\{i,j} of
the clique under ⊕ =‘max’ can be written as
Thus, we require a function f such that
ψc ({xi , xj } ∪ xc\{i,j} )
R1 +R2 ≥ R3 +R4 =⇒ f (R1 )+f (R2 ) ≥ f (R3 )+f (R4 ).
= fc (Qc ({xi , xj } ∪ xc\{i,j} )) (32)
(29)
The following lemma provides us the form of this function. = fc (max(φc (xi , xj ), Qc\{i,j} (xc\{i,j} ),
max φc (xi , xu ), max φc (xj , xu ))) (33)
Lemma 2. For a function f , y1 + y2 ≥ y3 + y4 =⇒ u∈c\{i,j} u∈c\{i,j}
f (y1 ) + f (y2 ) ≥ f (y3 ) + f (y4 ) if and only if f is linear.
Substituting this value of ψc in constraint (25) and again
Proof in appendix.
using D to represent Qc\{i,j} (xc\{i,j} ) and Dl represent
Since f (·) is linear, this proves the theorem. u∈c\{i,j} φc (l, xu ) for any label l, we get:
It should be noted that the class of clique potentials de- fc (max(φc (α, b), Dα , Db , D))
ﬁned by the above theorem is a small subset of the class + fc (max(φc (a, α), Da , Dα , D))
of functions which can be used under αβ-swaps. In fact it
≥ fc (max(φc (α, α), Dα , Dα , D))
is the same class of energy function as deﬁned by [5] i.e.
P 2 . This can be seen by observing that the potentials of + fc (max(φc (a, b), Da , Db , D)). (34)
the higher order cliques deﬁned by theorem 2 can be repre- As fc is a linear function, from lemma 2 we see that the
sented as a sum of metric pairwise clique potentials. This above condition is true if:
raises the question whether we can deﬁne a class of higher
order clique potentials which cannot be decomposed into a max(φc (α, b), Dα , Db , D) + max(φc (a, α), Da , Dα , D) ≥
set of P 2 potentials and be solved using α-expansions. To max(φc (α, α), Dα , Dα , D) + max(φc (a, b), Da , Db , D).
answer this we deﬁne the P n Potts model.
We only consider the case a = α and b = α. It can be easily
seen that for all other cases the above inequality is satisﬁed
4.2.1 P n Potts Model by a equality. As φc is a P 2 Potts model potential, the LHS
We now introduce the P n Potts model family of higher or- of the above inequality is always equal to 2γmax . As the
der clique potentials. This family is a strict generalization maximum value of the RHS is 2γmax the above inequality is
of the Generalized Potts model [5] and can be used for mod- always true.
elling many problems in Computer Vision. Note that the class of potentials described in above the-
We deﬁne the P n Potts model potential for cliques of orem is the same as the family of clique potentials deﬁned
size n as by the P n Potts model in equation (30) for a clique c of size
γk if xi = lk , ∀i ∈ c, n. This proves that for the P n Potts model the optimal α-
ψc (xc ) = (30)
γmax otherwise. expansion move can be solved in polynomial time. In fact
where γmax > γk , ∀lk ∈ L. For a pairwise clique this re- we will show that the optimal α-expansion and αβ-swap
duces to the P 2 Potts model potential deﬁned as ψij (a, b) = moves for this subset of potential functions can be found by
γk if a = b = lk and γmax otherwise. If we use γk = 0, for solving an st-mincut problem.
all lk , this function becomes an example of a metric poten- 4.3. Graph Cuts for P n Potts Model
tial function.
We now consider the minimization of energy functions
2
4.2.2 Going Beyond P for α-expansions whose clique potentials form a P n Potts model (see equa-
tion (30)). Speciﬁcally, we show that the optimal αβ-swap
We now show how the class of potential functions charac- and α-expansion moves can be obtained by solving an st-
terized in section 4.2 can be extended by using: ⊕ =‘max’ mincut problem. The graph corresponding to the st-mincut
instead of ⊕ = as in the previous subsection. To this end is shown for only one clique. However, the additivity the-
we deﬁne Qc (xc ) as orem [11] allows us to construct the graph for an arbitrary
Qc (xc ) = max φc (xi , xj ). (31) number of cliques by simply appending the graphs corre-
i,j∈c sponding to single cliques.
to the edge connecting Mt with the sink which has a
cost we = γα + κ.Recall that the cost of an st-mincut
is the sum of weights of the edges included in the st-
mincut which go from the source set to the sink set.
• ti = 1, ∀i ∈ c : In this case, the st-mincut corresponds
to the edge connecting the source with Ms which has
a cost wd = γβ + κ.
• All other cases: The st-mincut is given by the edges
connecting Mt with the sink and the source with Ms .
The cost of the cut is wd + we = γα + κ + γβ + κ =
γmax + κ (from equation (37)).
Thus, we can ﬁnd the optimal αβ-swap move for minimiz-
ing energy functions whose clique potentials form an P n
Potts model using an st-mincut operation.
α-expansion : Given a clique xc , our aim is to ﬁnd the
optimal α-expansion move t∗ . Again, since the clique po-
c
Figure 1. Graph construction for computing the optimal moves for the tential ψc (xc ) forms an P n Potts model, its value after an
P n Potts model. α-expansion move tc is given by

αβ-swap : Given a clique xc , our aim is to ﬁnd the opti-  γα if ti = 0, ∀i ∈ c,
mal αβ-swap move (denoted by t∗ ). Since the clique po- ψc (Tα (xc , tc )) = γ if ti = 1, ∀i ∈ c,
c 
tential ψc (xc ) forms a P n Potts model, its value after an γmax otherwise,
αβ-swap move tc = {ti , i ∈ c} where ti ∈ {0, 1} is given (38)
by where γ = γβ if xi = β for all i ∈ c and γ = γmax
8 otherwise. Note that the above clique potential is similar
< γα if ti = 0, ∀i ∈ c, to the one deﬁned for the αβ-swap move in equation (35).
ψc (Tαβ (xc , tc )) = γβ if ti = 1, ∀i ∈ c, Therefore, it can be represented using a graph by adding a
:
γmax otherwise. constant κ = γmax − γα − γ. This proves that the opti-
(35)
mal α-expansion move can be obtained using an st-mincut
Further, we can add a constant κ to all possible values of the operation.
clique potential without changing the optimal move t∗ . We
c
choose κ = γmax − γα − γβ . Note that since γmax ≥ γα
and γmax ≥ γβ , the following hold true:
5. Texture-based Video Segmentation
We now present experimental results which illustrates
γα + κ ≥ 0, γβ + κ ≥ 0, (36)
the advantage of higher order cliques. We observe that
γα + κ + γβ + κ = γmax + κ. (37) higher order cliques provide a probabilistic formulation for
exemplar based techniques. Exemplars are used in a wide
Without loss of generality, we assume tc = {t1 , t2 , . . . , tn }.
range of computer vision applications, from 3D reconstruc-
Fig. 1 shows the graph construction corresponding to the
tion [17] to object recognition [12, 21]. We consider one
above values of the clique potential. Here, the node vi cor-
such problem, i.e. video segmentation, which can be stated
responds to ti . In other words, after the computation of the
as follows. Given a video and a small number of keyframes
st-mincut vi is connected to the source (i.e. it belongs to
which have been segmented into k segments, the task is to
the source set) if ti = 0 and vi is connected to the sink
segment all the frames of the video.
(i.e. it belongs to the sink set) if ti = 1. In addition,
For this work, we do not incorporate the information
there are two extra nodes denoted by Ms and Mt respec-
provided by motion and segment each frame individually.
tively. The weights of the graph are given by wd = γβ + κ
The problem of segmenting a frame D (consisting of pix-
and we = γα + κ. Note that all the weights are positive
els Di , i ∈ {1, . . . , N }) can be modelled using the CRF
(see equations (36)). In order to show that this graph cor-
framework [13]. A CRF represents the conditional distribu-
responds to the clique potential in equation (35) (plus the
tion of a set of random variables X = {X1 , X2 , . . . , XN }
constant κ) we consider three cases:
given the data D. Each of the variables can take one label
• ti = 0, ∀i ∈ c : In this case, the st-mincut corresponds xi ∈ {1, 2, . . . , k}. In our case, each variable Xi represents
Figure 2. Segmented keyframe of the garden sequence. The left image
shows the keyframe while the right image shows the corresponding seg-
mentation provided by the user. The four different colours indicate pixels
belonging to the four segments namely sky, house, garden and tree.
a pixel Di and x = {x1 , x2 , . . . , xN } describes a segmen-
tation of the frame D. The most probable (i.e. maximum a
posterior) segmentation can be obtained by (approximately)
minimizing the corresponding Gibbs energy.
Pairwise CRF : For the problem of segmentation, it is
common practice to assume a pairwise CRF where the Figure 3. The ﬁrst row shows four frames of the garden sequence. The
cliques are of size at most two [1, 4, 20]. In this case, the second row shows the segmentation obtained by minimizing the energy of
the pairwise CRF (in equation (39)) using the αβ-swap algorithm. The
Gibbs energy of the CRF is of the form: four different colours indicate pixels belonging to the four segments. The
segmentations obtained using α-expansion to minimize the same energy
E(x) = ψi (xi ) + ψij (xi , xj ), (39) are shown in the third row. The fourth row shows the results obtained by
i i j∈Ni
minimizing the energy containing higher order clique terms which form
a P n Potts model (given in equation (42)) using the αβ-swap algorithm.
The ﬁfth row shows the results obtained using the α-expansion algorithm
where Ni is the neighbourhood of pixel Di (deﬁned as the to minimize the energy in equation (42). The use of higher order cliques
8-neighbourhood). The unary potential ψi (xi ) is deﬁned results in more accurate segmentation.
using the RGB distributions Ha , a = 1, . . . , ns of the seg-
ments as
Note that, as expected, the α-expansion algorithm performs
ψi (xi ) = − log p(Di |Ha ), when xi = a. (40) better than αβ-swap. Further, the α-expansion algorithm
takes an average of 3.7 seconds per frame compared to the
The distributions Ha are learnt using the segmented
4.7 seconds required by the αβ-swap algorithm. However,
keyframes provided by the user. The pairwise potentials
the segmentations obtained by both the algorithms are inac-
ψij (xi , xj ) are deﬁned such that they encourage contiguous
curate due to the use of small clique sizes.
segments whose boundaries lie on image edges, i.e.
(
−g 2 (i,j)
λ1 + λ2 exp 2σ 2
if xi = xj ,
ψij (xi , xj ) =
0 if xi = xj , Higher Order Cliques : We overcome the problem of the
(41) pairwise CRF framework by developing an exemplar based
where λ1 , λ2 and σ are some parameters. The term g(i, j) segmentation technique. Speciﬁcally, we use textons to rep-
represents the difference between the RGB values of pixels resent the appearance of image patches. Unlike the distrib-
Di and Dj . We refer the reader to [4] for details. Note that utions Ha which describe the potential for one variable Xi ,
the pairwise potentials ψij (xi , xj ) form a metric. Hence, textons capture rich statistics of natural images [15, 23]. In
the energy function in equation (39) can be minimized using this work, we deﬁne textons to be n × n patches. Using the
both αβ-swap and α-expansion algorithms. segmented keyframes, we obtain a dictionary of textons for
We use the following values of the parameters for all our each segment s = 1, . . . , k (denoted by Ps ). Note, how-
experiments: λ1 = 0.6, λ2 = 6 and σ = 5. Fig. 2 shows the ever, that our framework is independent of the representa-
segmented keyframe of the well-known garden sequence. tion of textons and is applicable wherever a dictionary of
Fig. 3 (row 2) shows the segmentation obtained for four exemplars can be learnt. As we will describe later, the like-
frames by minimizing the energy function in equation (39) lihood of a patch Dc belonging to the segment s can be
using the αβ-swap algorithm. Note that these frames are computed using the dictionary Ps . The resulting texture-
different from the keyframe (see Fig. 3 (row 1)). Fig. 3 (row based video segmentation problem can be formulated us-
3) shows the result when the α-expansion algorithm is used. ing a CRF composed of higher order cliques. We deﬁne the
Figure 4. The keyframe of the ‘Dayton’ video sequence segmented into
three segments.
Gibbs energy of this CRF as
X X X X
E(x) = ψi (xi ) + ψij (xi , xj ) + ψc (xc ), (42)
i i j∈Ni c∈C
where c is a clique which represents the patch Dc =
{Di , i ∈ c} of the frame D and C is the set of all cliques.
Note that we use overlapping patches Dc such that |C| = N .
The unary potentials ψi (xi ) and the pairwise potentials
ψij (xi , xj ) are given by equations (40) and (41) respec- Figure 5. Segmentation results of the ‘Dayton’ sequence. Rows 2 and 3
show the results obtained for the frames shown in row 1 by minimizing the
tively. The clique potentials ψc (xc ) are deﬁned such that energy function in equation (39) using αβ-swap and α-expansion respec-
2
they form a P n Potts model, i.e. tively. Row 4 and 5 show the segmentations obtained by minimizing the
energy in equation (42) using αβ-swap and α-expansion respectively. The
λ3 G(c, s) if xi = s, ∀i ∈ c, use of higher order cliques results in more accurate segmentation.
ψc (xc ) = (43)
λ4 otherwise.
Here G(c, s) is the minimum difference between the RGB also introduced the P n Potts model family of clique po-
values of patch Dc and all textons belonging to the dictio- tentials and showed that the optimal moves for it can be
nary Ps . Note that the above energy function encourages solved using graph cuts. Their use is demonstrated on
the patch Dc which are similar to a texton in Ps to take the the texture based video segmentation problem. The above
2
label s. Since the clique potentials form a P n Potts model, class of clique potentials can be used in various exemplar-
they can be minimized using the αβ-swap and α-expansion based Computer Vision applications such as object recogni-
algorithms as described in section 4.3. tion [12] and novel view synthesis [6].
We use λ3 = 0.6 and λ4 = 6.5 with textons of size 4×4. We conclude with the observation that computing the op-
Fig. 3 (row 4) shows the segmentations obtained when the timal moves for many interesting clique potentials such as
energy function in equation (42) is minimized using αβ- those that preserve planarity are NP-hard to compute (Proof
swap. Fig. 3 (row 5) shows the results obtained using the in appendix). Hence, they do not lend themselves to efﬁ-
α-expansion algorithm. Again, the α-expansion algorithm cient move making algorithms.
performs better than αβ-swap. On average, α-expansion
takes 4.42 seconds while αβ-swap takes 5 seconds which
is comparable to the case when the pairwise CRF is used.
References
In both cases the use of higher order cliques provides more [1] A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr. Interac-
accurate segmentation than the pairwise CRF formulation. tive image segmentation using an adaptive GMMRF model.
Fig. 4 shows another example of a segmented keyframe In ECCV, pages I: 428–441, 2004. 7
from a video sequence. The segmentations obtained for four [2] F. Boris. Strukturelle bilderkennung. Technical report,
frames of this video are shown in Fig. 5. Note that even Fakultat Informatik, Technische Universitat Dresden, Ger-
though we do not use motion information, the segmenta- many. Habilitation thesis, in German., 2002. 2
tions provided by higher order cliques are comparable to [3] E. Boros and P. Hammer. Pseudo-boolean optimization. Dis-
the methods based on layered motion segmentation. crete Applied Mathematics, 123(1-3):155–225, 2002. 2
[4] Y. Boykov and M. Jolly. Interactive graph cuts for optimal
6. Discussion and Conclusions boundary and region segmentation of objects in N-D images.
In ICCV, pages I: 105–112, 2001. 7
In this paper we have characterized a class of higher or- [5] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-
der clique potentials for which the optimal expansion and ergy minimization via graph cuts. PAMI, 23(11):1222–1239,
swap moves can be computed in polynomial time. We 2001. 1, 2, 3, 4, 5
[6] A. W. Fitzgibbon, Y. Wexler, and A. Zisserman. Image-based Appendix
rendering using image-based priors. In ICCV, pages 1176–
1183, 2003. 8 Lemma 1 For a non decreasing concave function f (·)
[7] D. Freedman and P. Drineas. Energy minimization via graph 2c ≥ a + b =⇒ 2f (c) ≥ f (a) + f (b). (44)
cuts: Settling what is possible. In CVPR, pages II:939–946,
2005. 1, 2 a+b
Proof. Given: c ≥ 2
.
[8] H. Ishikawa. Exact optimization for markov random ﬁelds
with convex priors. PAMI, 25:1333–1336, October 2003. 2 a+b
=⇒ f (c) ≥ f( ) (f is non decreasing) (45)
[9] S. Iwata, S. T. McCormick, and M. Shigeno. A strongly 2
polynomial cut canceling algorithm for the submodular ﬂow f (a) + f (b)
=⇒ f (c) ≥ (f is concave) (46)
problem. Lecture Notes in Computer Science, 1610, 1999. 2 2
[10] V. Kolmogorov. Convergent tree-reweighted message pass- =⇒ 2f (c) ≥ f (a) + f (b) (47)
ing for energy minimization. PAMI, 28(10):1568–1583,
2006. 1
[11] V. Kolmogorov and R. Zabih. What energy functions can be
minimizedvia graph cuts?. PAMI, 26(2):147–159, 2004. 1,
Lemma 2 For a function f (·),
2, 3, 5 y1 + y2 ≥ y3 + y4 =⇒ f (y1 ) + f (y2 ) ≥ f (y3 ) + f (y4 ) (48)
[12] M. P. Kumar, P. H. S. Torr, and A. Zisserman. Extending
pictorial structures for object recognition. In BMVC, pages if and only if f (·) is linear.
II: 789–798, 2004. 6, 8
[13] J. Lafferty, A. McCallum, and F. Pereira. Conditional ran- Proof. We only prove the ‘only if’ part. The proof for the forward
dom ﬁelds: Probabilistic models for segmenting and la- implication (‘if’) is trivial.
belling sequence data. In ICML, 2001. 1, 6
x + ≥ (x − ) + 2 (49)
[14] X. Lan, S. Roth, D. P. Huttenlocher, and M. J. Black. Ef-
ﬁcient belief propagation with learned higher-order markov f (x) + f ( ) ≥ f (x − ) + f (2 ) (50)
random ﬁelds. In ECCV (2), pages 269–282, 2006. 1 f (x) − f (x − ) ≥ f (2 ) − f ( ) (51)
[15] T. Leung and J. Malik. Recognizing surfaces using three-
Similarly, starting from x + ≤ (x − ) + 2 , we get
dimensional textons. In ICCV, pages 1010–1017, 1999. 7
[16] T. Meltzer, C. Yanover, and Y. Weiss. Globally optimal f (x) − f (x − ) ≤ f (2 ) − f ( ). (52)
solutions for energy minimization in stereo vision using
reweighted belief propagation. In ICCV, pages 428–435, From equations (51) and (53) we get:
2005. 1
[17] A. Neubeck, A. Zalesny, and L. van Gool. 3d texture recon- f (x) − f (x − ) = f (2 ) − f ( ). (53)
struction from extensive BTF data. In Texture, pages 13–18,
Taking limits with → 0 we get the condition that the derivative
2005. 6
(slope) of the function is constant. Hence, f (·) is linear.
[18] R. Paget and I. D. Longstaff. Texture synthesis via a non-
causal nonparametric multiscale markov random ﬁeld. IEEE Planarity Preserving Clique Potentials
Transactions on Image Processing, 7(6):925–931, 1998. 1
[19] S. Roth and M. J. Black. Fields of experts: A framework for Most approaches for disparity estimation assume a Potts model
learning image priors. In CVPR, pages 860–867, 2005. 1 prior on the disparities of neighbouring pixels. This prior favours
[20] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: inter- regions of constant disparity and penalizes discontinuities in the
active foreground extraction using iterated graph cuts. In disparity map. This has the severe side-effect of assigning a high
SIGGRAPH, pages 309–314, 2004. 7 cost to planar regions in the scene which are not orthogonal to the
[21] F. Schroff, A. Criminisi, and A. Zisserman. Single-histogram camera-axis.
class models for image segmentation. In ICVGIP, pages 82– The above problem can be rectiﬁed by using a higher order
93, 2006. 6 clique potential which is planarity preserving. We deﬁne this po-
[22] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kol- tential as follows. Consider a higher order clique consisting of
mogorov, A. Agarwala, M. F. Tappen, and C. Rother. A com- three variables X1 ,X2 , and X3 which can take a disparity label
parative study of energy minimization methods for markov from the set L ={1, 2, . . . , l}. As before we will use xi to de-
random ﬁelds. In ECCV (2), pages 16–29, 2006. 1 note a labelling of variable Xi . We say that the clique potential
[23] M. Varma and A. Zisserman. Texture classiﬁcation: Are ﬁl- function ψc () is planarity preserving if:
ter banks necessary? In CVPR, pages 691–698, 2003. 7
if |x2 − x1 | = |x3 − x2 | = δ, ∀δ
γ1
[24] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree- ψc (x1 , x2 , x3 ) =
γ2 otherwise.
based reparameterization for approximate inference on loopy (54)
graphs. In NIPS, pages 1001–1008, 2001. 1 where γ1 < γ2 . We will now show the optimal α-expansion
[25] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized moves for this family of clique potentials cannot be always com-
belief propagation. In NIPS, pages 689–695, 2000. 1 puted in polynomial time.
Theorem 4. The optimal expansion move for the clique potential
ψc of the form (54) cannot be always computed in polynomial time
for all α and conﬁgurations x.
Proof. We prove the theorem by contradiction. We need to show
that the move energy of a particular α-expansion move is non-
submodular.
Consider the conﬁguration {x1 , x2 , x3 } = {1, 2, 1} on which
we want to perform a α-expansion move with α = 3. The move
energy Em is a function of the three move variables t1 ,t2 , and t3
and is deﬁned as:
Em (0, 0, 0) Em (0, 0, 1)
123 Em (0, 1, 0) Em (0, 1, 1)
Em (t1 , t2 , t3 ) = Em =
Em (1, 0, 0) Em (1, 0, 1)
Em (1, 1, 1) Em (1, 1, 1)
which can be written in terms of ψc as:
ψc (Tα ({0, 0, 0}, x) ψc (Tα ({0, 0, 1}, x)
123 ψc (Tα ({0, 1, 0}, x) ψc (Tα ({0, 1, 1}, x)
Em =
ψc (Tα ({1, 0, 0}, x) ψc (Tα ({1, 0, 1}, x)
ψc (Tα ({1, 1, 0}, x) ψc (Tα ({1, 1, 1}, x)
For α = 3 and x ={1, 2, 1} this becomes:
ψc (1, 2, 1) ψc (1, 2, 3) γ2 γ1
ψc (1, 3, 1) ψc (1, 3, 3) γ2 γ2
=
ψc (3, 2, 1) ψc (3, 2, 3) γ1 γ2
ψc (3, 3, 1) ψc (3, 3, 3) γ2 γ1
From the deﬁnition of submodularity all projection of the move
123
energy Em need to be submodular. Consider the submodularity
constraints of the projection Em (0, t2 , t3 ):
Em (0, 0, 0) + Em (0, 1, 1) ≤ Em (0, 0, 1) + Em (0, 1, 0) (55)
or γ2 + γ2 ≤ γ1 + γ2 . This constraint is not satisﬁed as γ1 <
γ2 .
A similar result can also be proved for αβ-swap moves consid-
ering the conﬁguration x ={1, 2, 1} and the 1, 3-swap move.