Hamiltonian Monte Carlo explained

Dec 19, 2016 • Alex Rogozhnikov •

MCMC (Markov chain Monte Carlo) is a family of methods that are
applied in computational physics and chemistry and also widely used in bayesian machine learning.

It is used to simulate physical systems with
Gibbs canonical distribution:
$$
p(\vx) \propto \exp\left( - \frac{U(\vx)}{T} \right)
$$
Probability `$ p(\vx) $` of a system to be in the state `$ \vx $` depends on the energy of the state `$U(\vx)$` and temperature `$ T $`.

This distribution describes positions and velocities of particles in the gas, for instance.
In bayesian machine learning, it defines distribution of model parameters (such as weights of a neural network).

What is Monte Carlo (and why is it needed)?

Suppose that you want to study the properties of some model with thousands of variables
(for lattice models that's very few!).
For instance, average energy:
$$
< U > = \int U( \vx ) \, p( \vx ) d \vx
$$
is an integral over distribution.
Computing this integral isn't possible, even using numerical approximations (lattice over 1000 variables is incredibly large).

By sampling from `$ p(\vx) $` an empirical estimate can be obtained:
$$
< U > = \dfrac{1}{N} \sum_{i=1}^{N} U( \vx_i ), \qquad \vx_i \sim p ( \cdot )
$$
This is the idea of Monte Carlo: to compute average energy / speed of molecules in the gas,
take random molecules and average their energy / speed.

Integration with distribution pdf (left) is replaced with averaging over samples from distribution (blue points on the right).
Rightmost plot demonstrates true samples on the energy surface, thus we can see corresponding energy `$ U(\vx) $`.

In modern programming languages there are functions to sample from simple distributions: uniform, normal, Poisson...
There is no simple procedure to sample from Gibbs distribution: it is times more complicated, but possible using Markov Chains.

That's our goal: learn to sample from the canonical distribution.

How temperature influences canonical distribution:

unnormalized pdf `$ p(\vx) = p(x_1, x_2)$`

Energy `$ U(\vx) = U(x_1, x_2)$`

Distribution

Show true samplesfrom distribution

temperature T:

Play with a temperature and look at different distributions!

As intuition says, system has higher probability of staying in the states with lower energies.
As temperature goes down, imbalance becomes stronger.
When `$T$` tends to zero, the system stays in the state(s) with the lowest energy.

— So, we minimize energy?

Not really. Minimal-energy state is very important, but it is wrong to think that system can exist only in this state.
To get good estimates of the system properties, a representative sampling of possible states is required
(blue points demonstrate the desired distribution).

All we have is a following ratio of probabilities (which doesn't depend on `$Z$`)
$$
\dfrac{p(\vx_1)}{p(\vx_2)} = \dfrac{ \exp\left( - \frac{U(\vx_1)}{T} \right) }{ \exp\left(- \frac{U(\vx_2)}{T} \right) }
= \exp\left( \frac{U(\vx_2) - U(\vx_1)}{T} \right)
$$
and Markov chains can sample from the distribution using only this ratio. Let's see how.

Metropolis-Hastings algorithm for MCMC

The simplest Markov Chain process that can sample from the distribution
picks the neighbour of the current state and either accepts
it or rejects depending on the change in energy:

Algorithm produces a chain of states: `$ \vx_1, \vx_2, ..., \vx_n $` (green points).
Each time a candidate from a neighbourhood of the last state is selected
`$\vx_n' = \vx_n + noise $` (noise is usually taken to be gaussian with some spread `$\sigma$`).

Note, when energy is lower in new state `$ U(\vx'_n) < U(\vx_n)$` it is always accepted: `$ p = 1 $`.
This way we give preference to the states with lower energies, while not restricting the algorithm to always decrease the energy.
The lower temperature, the lower probability to increase energy.

Why does Metropolis-Hastings work?

The most interesting part is why this trivial algorithm is able to correctly sample from Gibbs distribution?

First thing to check is that single step of Metropolis-Hastings preserves canonical distribution.
Formally, if `$\vx_n \sim p(\cdot)$` is Gibbs-distributed, then on the next step
`$\vx_{n+1} \sim p(\cdot)$` is also Gibbs-distributed.

Problems of Metropolis-Hastings algorithm

MH is very simple and quite general. The only thing required by algorithm is ability to compute energy `$U(\vx)$`.
At the same time

too large step size `$ \sigma $` leads to a large fraction of rejected samples,
while too small `$ \sigma $` makes very small steps, thus it takes long time to 'explore the distribution' (check this in demonstration!)

in high-dimensional spaces (very important use-case),
MH explores the space very inefficiently because of it's random-walk behavior.
Guessing good direction in 1000 dimensions is incomparably harder than doing this for 2 dimensions!

— How can we perform joint sampling?

– Just run physics simulation and get canonical distribution?

Very close: after you initialize the system parameters `$\vx, \vv$`
and let it evolve ('slide') using physics equations
$$
\dot{x}_i = v_i, \qquad m \dot{v_i} = - \dfrac{ \partial U (\vx)}{ \partial x_i }
$$
during a long period of time, you'll not get a canonical distribution by collecting system states,
because energy `$ E $` is conserved in the system.
Physics allows only producing samples `$ (\vx, \vv) $` with the same energy `$ E(\vx, \vv) = E_0 $`,
so we need to add something to 'plain physics' to get correct sampling.

For instance, gas molecules are colliding, what changes their velocity and energy in an unpredictable way.
Similar idea can be used here, however there is a much simpler method – at some points in time velocity is resampled from `$p(\vv)$`,
thus changing the total energy
(this is how 'hit the puck in random direction' step appears).

Sampling from `$p(\vv)$` is very simple, because `$\vv$` is normally distributed.

— Is everything that simple?

Hit the puck, wait, stop it and then hit again.
This recipe sounds too simple to work, but if puck's trajectory could be computed precisely, it would be the solution.

However, the computer simulation is imprecise and this influences the result distribution.
For instance, the energy isn't conserved in the simulation.

Metropolis-Hastings rejections to compensate difference in energy between energy at the start and in the end.
This way (quite few) rejected samples appear in HMC.

Pros and cons of Hamiltonian Monte Carlo

There is no free lunch, and Hamiltonian MC has its price: HMC uses not only energy `$U(\vx)$`, but also it's gradient.
Hence possible applications are limited to the case when gradient exists and can be computed in reasonable time.

The major differences compared to Metropolis-Hastings are:

distances between successive generated points are typically large, so we need less iterations to get representative sampling

'price' of a single iteration is higher, but HMC is still significantly more efficient

Hamiltonian MC in most cases accepts new states (take a look at rejected samples in the demonstrations!)

still, HMC has problems with sampling from distributions with isolated local minimums
Investigate last distribution at low temperatures — 'puck' doesn't have enough energy to jump
from the first minimum to the second over the energy barrier.

Bonus part: Passing energy barriers in Hamiltonian MC

As you probably noticed, that at low temperatures both MH and HMC can't overcome energy barrier
(last distribution has two energy minima, separated by an energy barrier).
Once system reaches some well-isolated local minimum, it gets stuck inside
and probability to get to the different minimum becomes negligible, thus generated sample is 'incomplete'.

One empirical solution is manually restarting the generation from other start point, hoping to get stuck at the other minimum.
Another way is to develop some algorithm to jump over energy barriers. One of such approaches is called tempering during a trajectory:

Play with tempering `$\alpha$` and trajectory length to get to the other minimum point!
Can you make it jumping between minimums frequently? What if you decrease temperature?

Idea behind tempering:
the trajectory is split into two halves, which take equal time.
During the first half,
velocity is increased at each step by factor `$ \alpha $`: `$ \vv_\text{next} = \alpha \vv $`,
during the second half velocity is decreased at each step by the same factor: `$ \vv_\text{next} = \frac{1}{\alpha} \vv $`.

This trick increases the energy of system during first half of trajectory and then decreases it.

While this process is reversible and can jump over energy barriers, the energies at the beginning and at the end are quite different.
To correct this, rejection rule from Metropolis-Hastings algorithm is employed.
Rejections become very frequent at low temperatures, thus amount of 'useless' computations becomes significant.

One needs to (blindly!) guess both 'slide time' and `$\alpha$`.
An algorithm is quite sensible to both, in some cases producing too many rejections, in the other exploring the space inefficiently.
All in all: exploration is hard.