When working with Markov chain Monte Carlo to draw inference, we need a chain that mixes rapidly, i.e. moves throughly the support of the posterior distribution rapidly. But I don't understand why we need this property, because from what I understand, the accepted candiate draws should and will concentrated in the high density part of the posterior distribution. If what I understand is true, then do we still want the chain to move through the support ( which includes the low density part) ?

In addition, if I am using MCMC to do optimization, do I still need to care about rapid mixing and why?

$\begingroup$It is known in the MCMC literature that when a Markov chain is geometrically ergodic, it has exponentially fast alpha-mixing decay. I am unclear how X_{n} could converge rapidly to the target distribution and yet maintain high correlation between successive samples. Are there any simple examples? Thanks for any inputs!$\endgroup$
– user35260Nov 24 '13 at 20:34

3 Answers
3

The ideal Monte Carlo algorithm uses independent successive random values. In MCMC, successive values are not independant, which makes the method converge slower than ideal Monte Carlo; however, the faster it mixes, the faster the dependence decays in successive iterations¹, and the faster it converges.

¹ I mean here that the successive values are quickly "almost independent" of the initial state, or rather that given the value $X_n$ at one point, the values $X_{ń+k}$ become quickly "almost independent" of $X_n$ as $k$ grows; so, as qkhhly says in the comments, "the chain don’t keep stuck in a certain region of the state space".

Edit: I think the following example can help

Imagine you want to estimate the mean of the uniform distribution on $\{1, \dots, n\}$ by MCMC. You start with the ordered sequence $(1, \dots, n)$; at each step, you chose $k>2$ elements in the sequence and randomly shuffle them. At each step, the element at position 1 is recorded; this converges to the uniform distribution. The value of $k$ controls the mixing rapidity: when $k=2$, it is slow; when $k=n$, the successive elements are independent and the mixing is fast.

You can see that with $k=2$ (M1), the influence of the initial value after 100 iterations only gives you a terrible result. With $k=50$ it seems ok, with still greater standard deviation than with $k=99$. Here are the means and sd:

$\begingroup$I don't think the statement "the faster it mixes, the faster the dependence decays in successive iterations" is correct. Successive iterations will always be dependent using the Metropolis-Hastings algorithm, for example. Mixing has to do with how fast your samples converge to target distribution, not how dependent successive iterations are.$\endgroup$
– MacroJan 1 '12 at 20:23

$\begingroup$This is the same: if it converges fast to target distribution, the dependence from the initial state decays fast... of course this will be the same at any point of the chain (which could have been chosen as an initial state). I think the final part of the above example is enlightening for this aspect.$\endgroup$
– ElvisJan 1 '12 at 20:32

1

$\begingroup$Yes, dependence from the initial state decays, not necessarily dependence between successive iterations.$\endgroup$
– MacroJan 1 '12 at 22:22

$\begingroup$I think I understand what mixing rapidly means. It's not that the chain moves to every part of the support of target distribution. Rather, it's more about the chain not stuck in certain part of the support.$\endgroup$
– qkhhlyJan 2 '12 at 1:42

In completion of both earlier answers, mixing is only one aspect of MCMC convergence. It is indeed directly connected with the speed of forgetting the initial value or distribution of the Markov chain $(X_n)$. For instance,the mathematical notion of $\alpha$-mixing is defined by the measure

$$
\alpha(n) = \sup_{A,B} \left\{\,|P(X_0\in A,X_n\in\cap B) - P(X_0\in A)P(X_n\in B)\right\}\,, n\in \mathbb{N}\,,
$$
whose speed of convergence to zero is characteristic of the mixing. However, this measure is not directly related to the speed with which $(X_n)$ converges to the target distribution $\pi$. One may get very fast convergence to the target and still keep high correlation between the elements of the chain.

Furthermore, independence between the $X_n$'s is only relevant in some settings. When aiming at integration, negative correlation (a.k.a. antithetic simulation) is superior to independence.

About your specific comment that

...the accepted candidate draws should and will concentrated in the high density part of the posterior distribution. If what I understand is true, then do we still want the chain to move through the support ( which includes the low density part) ?

the MCMC chain explores the target in exact proportion to its height (in its stationary regime) so indeed spends more time in the higher density region(s). That the chain must cross lower density regions is relevant when the target has several high density components separated by low density regions. (This is also called a multimodal setting.) Slow mixing may prevent the chain from crossing such low density regions. The only regions $(X_n)$ the chain should never visit are the regions with zero probability under the target distribution.

$\begingroup$+1 Many thanks for the comment about antithetic simulation, this is cool$\endgroup$
– ElvisJan 2 '12 at 14:35

$\begingroup$@Xi'an (+1): this is the first clear definition of ($\alpha$-) mixing that I find, two questions (1) are there other types of mixing besided $\alpha-$mixing and (2) are there any practically usable measures because I don't see how I can compute the mixing of my chain with this supremum in the definition. Next I see that $\alpha \to 0$ is not sufficient for convergence, are there measures of convergence ?$\endgroup$
– user83346Sep 4 '16 at 7:53

$\begingroup$There are several types of mixing like $\rho$-mixing and $\beta$-mixing. In connection with MCMC, and quoting from Wikipedia, a strictly stationary Markov process is β-mixing if and only if it is an aperiodic recurrent Harris chain.$\endgroup$
– Xi'anSep 4 '16 at 10:34

The presumptions that motivate the desire for a rapidly mixing chain are that you care about computing time and that you want a representative sample from the posterior. The former will depend on the complexity of the problem: if you have a small/simple problem, it may not matter much whether your algorithm is efficient. The latter is very important if you are interested in posterior uncertainty or knowing the posterior mean with high precision. However, if you don't care about having a representative sample of the posterior because you are just using MCMC to do approximate optimization, this may not be very important to you.