Probabilistic Inference Using Markov Chain Monte Carlo Methods

Probabilistic inference is an attractive approach to uncertain reasoning and empirical
learning in artificial intelligence. Computational difficulties arise, however,
because probabilistic models with the necessary realism and
exibility lead to complex
distributions over high-dimensional spaces.

Abstract
Probabilistic inference is an attractive approach to uncertain reasoning and em-
pirical learning in arti

cial intelligence. Computational diculties arise, however,
because probabilistic models with the necessary realism and
exibility lead to com-
plex distributions over high-dimensional spaces.
Related problems in other

elds have been tackled using Monte Carlo methods based
on sampling using Markov chains, providing a rich array of techniques that can be
applied to problems in arti

cial intelligence. The \Metropolis algorithm" has been
used to solve dicult problems in statistical physics for over forty years, and, in the
last few years, the related method of \Gibbs sampling" has been applied to problems
of statistical inference. Concurrently, an alternative method for solving problems
in statistical physics by means of dynamical simulation has been developed as well,
and has recently been uni

ed with the Metropolis algorithm to produce the \hybrid
Monte Carlo" method. In computer science, Markov chain sampling is the basis
of the heuristic optimization technique of \simulated annealing", and has recently
been used in randomized algorithms for approximate counting of large sets.
In this review, I outline the role of probabilistic inference in arti

cial intelligence,
present the theory of Markov chains, and describe various Markov chain Monte
Carlo algorithms, along with a number of supporting techniques. I try to present a
comprehensive picture of the range of methods that have been developed, including
techniques from the varied literature that have not yet seen wide application in
arti

cial intelligence, but which appear relevant. As illustrative examples, I use the
problems of probabilistic inference in expert systems, discovery of latent classes from
data, and Bayesian learning for neural networks.
Acknowledgements
I thank David MacKay, Richard Mann, Chris Williams, and the members of my
Ph.D committee, Georey Hinton, Rudi Mathon, Demetri Terzopoulos, and Rob
Tibshirani, for their helpful comments on this review. This work was supported
by the Natural Sciences and Engineering Research Council of Canada and by the
Ontario Information Technology Research Centre.

1. Introduction
Probability is a well-understood method of representing uncertain knowledge and reasoning
to uncertain conclusions. It is applicable to low-level tasks such as perception, and to high-
level tasks such as planning. In the Bayesian framework, learning the probabilistic models
needed for such tasks from empirical data is also considered a problem of probabilistic in-
ference, in a larger space that encompasses various possible models and their parameter
values. To tackle the complex problems that arise in arti

cial intelligence,
exible meth-
ods for formulating models are needed. Techniques that have been found useful include
the speci

cation of dependencies using \belief networks", approximation of functions using
\neural networks", the introduction of unobservable \latent variables", and the hierarchical
formulation of models using \hyperparameters".
Such
exible models come with a price however. The probability distributions they give rise
to can be very complex, with probabilities varying greatly over a high-dimensional space.
There may be no way to usefully characterize such distributions analytically. Often, however,
a sample of points drawn from such a distribution can provide a satisfactory picture of it.
In particular, from such a sample we can obtain Monte Carlo estimates for the expectations
of various functions of the variables. Suppose X = fX1; . . . ; Xn g is the set of random
variables that characterize the situation being modeled, taking on values usually written as
x1; . . . ; xn, or some typographical variation thereon. These variables might, for example,
represent parameters of the model, hidden features of the objects modeled, or features of
objects that may be observed in the future. The expectation of a function a(X1 ; . . . ; Xn )
| it's average value with respect to the distribution over X | can be approximated by
X X
hai = a(~1 ; . . . ; xn) P (X1 = x1; . . . ; Xn = xn)
x ~ ~ ~ (1.1)
x1
~ xn
~
1N 1
X
N a(x(1t); . . . ; x(nt)) (1.2)
t=0
where x(t) ; . . . ; x(t) are the values for the t-th point in a sample of size N . (As above, I will
1 n
often distinguish variables in summations using tildes.) Problems of prediction and decision
can generally be formulated in terms of

nding such expectations.
Generating samples from the complex distributions encountered in arti

cial intelligence
applications is often not easy, however. Typically, most of the probability is concentrated
in regions whose volume is a tiny fraction of the total. To generate points drawn from
the distribution with reasonable eciency, the sampling procedure must search for these
relevant regions. It must do so, moreover, in a fashion that does not bias the results.
Sampling methods based on Markov chains incorporate the required search aspect in a
framework where it can be proved that the correct distribution is generated, at least in
the limit as the length of the chain grows. Writing X (t) = fX1t) ; . . . ; Xnt) g for the set of
( (
variables at step t, the chain is de