Ensemble MCMC

I’m glad to have managed, before teaching starts again, to have finished a Technical Report (available here or at arxiv.org) with what may be my most unwieldy title ever:

MCMC Using Ensembles of States for Problems with Fast and Slow Variables such as Gaussian Process Regression

I wanted the title to mention all three of the nested ideas in the paper. Actually, I wasn’t able to fit in a fourth, most general, idea, of MCMC methods based on caching and mapping (see here). Here is the abstract:

I introduce a Markov chain Monte Carlo (MCMC) scheme in which sampling from a distribution with density π(x) is done using updates operating on an “ensemble” of states. The current state x is first stochastically mapped to an ensemble, x(1),…,x(K). This ensemble is then updated using MCMC updates that leave invariant a suitable ensemble density, ρ(x(1),…,x(K)), defined in terms of π(x(i)) for i=1,…,K. Finally a single state is stochastically selected from the ensemble after these updates. Such ensemble MCMC updates can be useful when characteristics of π and the ensemble permit π(x(i)) for all i in {1,…,K} to be computed in less than K times the amount of computation time needed to compute π(x) for a single x. One common situation of this type is when changes to some “fast” variables allow for quick re-computation of the density, whereas changes to other “slow” variables do not. Gaussian process regression models are an example of this sort of problem, with an overall scaling factor for covariances and the noise variance being fast variables. I show that ensemble MCMC for Gaussian process regression models can indeed substantially improve sampling performance. Finally, I discuss other possible applications of ensemble MCMC, and its relationship to the “multiple-try Metropolis” method of Liu, Liang, and Wong and the “multiset sampler” of Leman, Chen, and Lavine.

I’ve also posted the programs used to produce the results. These haven’t been tested much beyond their use for the paper, but I hope to incorporate them into a general MCMC package in R (also including programs accompanying my review of Hamiltonian Monte Carlo). That’s my next project to do in whatever time I have available after teaching, administration, and a three-year-old daughter, along with more efforts to speed up R, so the that this MCMC package won’t be too slow.

You say in the TR that you find it hard to imagine how ensemble methods would be beneficial without some computational short-cut in computing the density. I was wondering if you might elaborate on this a bit.

It seems to me like it should be possible to exploit the more global picture of a distribution that a set of samples can give to construct good transitions. For a really simple example, imagine using x^(2), …, x^(K) to tune the variance of a Metropolis proposal to update x^(1). Is there a particular reason that you think such schemes wouldn’t be an improvement on updating x^(1) using a fixed proposal?

You might be right. You’ll notice that I didn’t phrase this comment as a theorem.

I’ve thought of trying to get an ensemble MCMC version of the Nelder-Mead simplex method for optimization to do something useful, but haven’t succeeded so far. It might be that there is some useful “adaptive” scheme using ensembles such as you suggest. There is a big overhead to overcome, though, if you’re keeping around K points in your state with no computational short-cut.

So five months later, here’s a link to a method that uses ensembles to adapt the proposal. It uses x^(2), …, x^(K) to tune the variance of a Metropolis proposal to update x^(1) in a particularly clever way: the update rule is to select two members of the ensemble at random and add their difference times a scalar factor (plus a small continuous random jitter) to the element being updated. The intuition is that if the elements of the ensemble were sampled from the target, then the difference of two elements would have mean 0 and covariance proportional to the covariance of the target distribution.

4.Simon | 2011-03-17 at 11:45 am

Just letting you know there’s a typo in equation 6: the argument of the delta function and its centre should be switched.

I don’t think so. The notation is supposed to represent a distribution for the k’th element of the ensemble that is concentrated at the point x that we are mapping from. The sum is due to the possibility of the point x occupying any of the K positions in the ensemble.

My mistake – the two forms are equivalent. Thanks for the quick reply.

7.Jacopo Pantaleoni | 2011-10-12 at 6:52 am

Just wanted to point out another great connection between this and some previous work of yours (in case you didn’t make it already): your Hamiltonian Monte Carlo variant with windowed acceptance of states can also be seen as an ensemble method (in the case of a full window L=W-1 there just happens to be no transition in the ensemble space).
Congratulations for all your excellent work, btw.

Although my personal experience is that on hard, massively modal target distributions (like the ones found in light transport simulation), ensemble methods (like this one or the DE-MC mentioned above) have a hard time to compete with plain Metropolis-Hastings unless, as you say, there is a speed advantage in taking K closely related samples – but of course there might still be exceptions. It’s particularly difficult when one’s task is to estimate marginal distributions, where higher ensemble correlation is directly visible in the results.