accelerating HMC by learning the leapfrog scale

In this new arXiv submission that was part of Changye Wu’s thesis [defended last week], we try to reduce the high sensitivity of the HMC algorithm to its hand-tuned parameters, namely the step size ε of the discretisation scheme, the number of steps L of the integrator, and the covariance matrix of the auxiliary variables. By calibrating the number of steps of the Leapfrog integrator towards avoiding both slow mixing chains and wasteful computation costs. We do so by learning from the No-U-Turn Sampler (NUTS) of Hoffman and Gelman (2014) which already automatically tunes both the step size and the number of leapfrogs.

The core idea behind NUTS is to pick the step size via primal-dual averaging in a burn-in (warmup, Andrew would say) phase and to build at each iteration a proposal based on following a locally longest path on a level set of the Hamiltonian. This is achieved by a recursive algorithm that, at each call to the leapfrog integrator, requires to evaluate both the gradient of the target distribution and the Hamiltonianitself. Roughly speaking an iteration of NUTS costs twice as much as regular HMC with the same number of calls to the integrator. Our approach is to learn from NUTS the scale of the leapfrog length and use the resulting empirical distribution of the longest leapfrog path to randomly pick the value of L at each iteration of an HMC scheme. This obviously preserves the validity of the HMC algorithm.

While a theoretical comparison of the convergence performances of NUTS and this eHMC proposal seem beyond our reach, we ran a series of experiments to evaluate these performances, using as a criterion an ESS value that is calibrated by the evaluation cost of the logarithm of target density function and of its gradient, as this is usually the most costly part of the algorithms. As well as a similarly calibrated expected square jumping distance. Above is one such illustration for a stochastic volatility model, the first axis representing the targeted acceptance probability in the Metropolis step. Some of the gains in either ESS or ESJD are by a factor of ten, which relates to our argument that NUTS somewhat wastes computation effort using a uniformly distributed proposal over the candidate set, instead of being close to its end-points, which automatically reduces the distance between the current position and the proposal.

9 Responses to “accelerating HMC by learning the leapfrog scale”

1. The log density is essentially free once you’re computing the gradient [you need to carry it forward to compute all the partials anyway], so the cost per leapfrog step is the same in static HMC and NUTS. Stan does reuse the log density values across leapfrog steps, then it recomputes the log density using double double values when producing the generated quantities block, but that’s much cheaper than doing the autodiff, so it’s a fraction of a gradient evaluation.

2. The sampling isn’t uniform over the trajectory, but weighted by the log density and biased toward the last doubling. And it does now use Michael Betancourt’s extension to multinomial sampling rather than slice sampling. This also changes how acceptance rates are calculated for warmup. [Aside 1: Andrew prefers warmup because it’s a better electronic analogy—we don’t run for a long time to burn in and see if a part fails, we run to warm it up into its ready-to-go state.] [Aside 2: Michael also updated the warmup stage relative to the original NUTS paper.]

3. The implementation does stop evaluating the trajectory when it hits a divergence. It also stops evaluating when the last doubling has an internal U-turn and it rejects the entire last doubling to maintain reversibility. Stan now outputs the actual number of leapfrog steps evaluated per iteration.

He’s also right that rejections need to be saved and counted to maintain detailed balance, but it looks like that’s been addressed. I don’t understand Chengye’s response in Xi’an’s reply above.

I’d be more convinced this is something we should add to Stan if three things were done to the evaluation. First, the diagonal metric adaptation should be used to make sure all the computation isn’t dominated by the choice of a poorly matched unit metric. It shouldn’t be hard as the output of adapation is available after warmup from Stan.

Second, I’d like to see multiple chains were run from diffuse starting points to monitor convergence and I’d like to use Stan’s multi-chain ESS calculations (I didn’t know where the ess() function being used came from). I think these comparisons only make sense when conditioned on convergence.

Finally, I’d like to see the models brought up to our best practices. This may be done anyway to make sure convergence monitoring doesn’t fail. For example, the IRT model is coded with a centered parameterization, which we know won’t even converge properly in even moderately high dimensions unless there’s a lot of data per group (or more technically if the posterior is tightly constrained—it can come from tight priors, too). Also, there isn’t a prior on the coefficients in the Bernoulli-logit example, which is super dangerous given issues of separability, etc., and can also lead to convergence issues. The stochastic volatility model should probably be reparameterized, too—as is, it introduces a lot of posterior correlation; I also don’t understand why the priors (?) are coded as they are.

I’m also worried about the low acceptance rate target. Stan’s default is 80% and that’s low for hard problems in that it leads to too many divergences for effective sampling (divergences can cause HMC to revert to a random walk or can cause it to freeze in some location of the posterior if they’re too bad—the right thing to do asymptotically for detailed balance). We want to get to the point where there are zero divergences to ensure we’re not biasing the posterior.

I’m not sure about all the commented code in things like the SIR model. Given that you do your own transforms without Jacobians, those priors wouldn’t be right as they are inside the comments. But then I see target += statements at the end which seem to include both the Jacobian and some expression for a prior. What’s the motivation for coding things that way?

Why not use minimum effective sample size per gradient eval? I don’t understand the normalization to use just ESS for comparison.

It’d also be nice to vectorize everything, but that’s just an issue of absolute time to make your lives easier (as in these things should run twice as fast or faster if vectorized unless the data sizes are very small). And we have a lot of functions to improve arithmetic stability vs. the handwritten stuff included in the model code.

P.S. Where was the ess() function coming from? Reading R is so confusing because nobody uses namespace qualifiers! Stan’s ESS calculations are much more conservative as they discount for non-convergence. The new ones Aki Vehtari wrote that can deal with anti-correlated draws have been better tested for calibation, so I’d suggest using the latest version (which may only be available in the GitHub version of RStan—we’re having issues getting 2.18 up on CRAN—sorry about that).

P.P.S. It would’ve helped me if all that R wasn’t cut-and-pasted, but rather sourced from one file and reused. It’s impossible to maintain something written this way. (I’m commenting on mattmgraham’s repo here in case his fork and fix are different significantly as far as that goes.)

Thanks for the comments and the evaluation, Bob! We agree that counting U and ∇ U as twice expensive as ∇ U was not fair, if only because of the dimension of ∇ U. We are going to rerun some in the comparisons towards some of your comments, including using a scaled distance matrix. Just pointing out here that, as you may have seen, the part on NUTS is directly off-the-shelf Stan, while using the Stan version of the ESS [why are there different versions of the ESS?!] for our R part did not look as straightforward as using ess from the mcmcse package. And commenting that the R code Changye designed was put on Github for open access and reproducibility, not for distribution purposes, hence the raw and undocumented format. Last thing, methinks the prior choices should not be an issue when discussing speed and performances, but a possible confusion stems from our reparameterisation of every constrained parameter into an unconstrained real parameter. Anyway, stay tuned for the incoming revised version!

I have a few comments about the experiments and evaluation. Apologies in advance if I’ve made any errors in what I say below.

It is argued in the paper that each NUTS integration step is twice as costly as in standard HMC as it requires evaluating both the potential energy (negative log target density) and its gradient at each step rather than just the potential energy gradient. This I think is not generally true as in most applications of HMC the potential energy gradient will be calculated using reverse-mode automatic differentiation which will generally require the original function (here the potential energy) to be calculated in a forward pass before the gradients are then calculated in the backwards pass. Each gradient evaluation therefore also gives the original function value for ‘free’ – for example this seems to be the case in Stan from the docstring of the Stan Math library `gradient` function (https://github.com/stan-dev/math/blob/develop/stan/math/rev/mat/functor/gradient.hpp).

Even if the gradients are manually coded, typically there will be a lot of shared computation between the original function and gradient calculations so the even if using a more optimised implementation the cost of evaluating the value and gradient of a (scalar) function together versus is unlikely to be twice as costly as evaluating just the gradient. For example in the logistic regression example in the paper the potential energy function and gradient are defined

Further in the evaluations of run-time it is assumed the cost of each NUTS transition is `[number of leapfrog steps] * 2 + 1` and for a standard HMC transition `[number of leapfrog steps] + 2`. I'm assuming the additive constants are to account for the potential energy evaluations at the final / initial states for the Metropolis accept step however in reality we always have already calculated the gradient at the initial and final states in the process of running the integrator (or will calculate the gradient in the next iteration) and so as we can evaluate both gradient and value together for the same cost as just the gradient, if we always cache the potential energy values when we calculate the gradient (which Stan does I think) then the number of additional potential energy and gradient evaluations per transition for both NUTS and standard HMC is just the number of leapfrog steps.

On a separate note, the implementations of the eHMC* methods in the accompanying code (https://github.com/wcythh/eHMC) have a potentially incorrect treatment of the HMC transitions which produce `NaN` values for the Metropolis acceptance ratio. Rather than reject in these cases it appears that the samplers 'retry' – the iteration counter `i` is not incremented in these cases and the sampled state (i.e. equal to the previous value) is not added to the chain. I'm not sure this implementation defines a valid MCMC algorithm (as we are 'downweighting' the states at which such rejections should occur – possibly wrong here though?) but more importantly from the evaluation perspective these failed transitions are also not included in the computational cost, which given, as implemented, the integrator still proceeds to run the full number of leapfrog steps even if there is a divergence in an earlier step which causes `NaN` values to be generated, means that the computed costs are not reflective of the actual number of gradient evaluations required. In Stan and other NUTS implementations such as in PyMC3 such integrator divergences lead to early termination of the trajectory building and so save on the cost of continuing to integrate once we know we will reject – this could be implemented similarly for the eHMC methods, but even so the cost of such transitions would still be non-zero (such shortened trajectories are reflected in the Stan `n_leapfrog__` statistics used to calculate the NUTS computational cost in the code).

Running the adjusted code there for a single target acceptance rate of 0.75 and for just 4 independent repeats (running the original script with 40 repeats in parallel wasn't feasible on my laptop due to requiring too much memory!) I get the following cost-normalised minimum ESS estimates (mean over 4 runs +/- standard error) for the four methods:

The German-credit `.Rdata` files used in the original script were not available in the repository so I created new versions (with the feature normalisation specified in the paper) using the Python script `download_and_preprocess_data.py` I added to the `GermanCredit` directory in same Github fork. The data file for the fixed chain start state used in the script also was not available so I used the standard Stan random initialisation.

Although this is just for a single target accept rate and the standard errors are quite high due to the small number of repeats, this suggests a much smaller difference in performance between NUTS and the eHMC* methods than shown in the paper for this Bayesian logistic regression case at least.

In the conclusion it is claimed that NUTS samples uniformly from the trajectory of states generated which is suggested is part of the reason for its relatively poorer performance. Although this is the case for the 'naive' implementation of NUTS given in Algorithm 2 in Hoffman and Gelman (2014) which is used to motivate the initial discussion of the algorithm, in practice a more efficient implementation detailed in Algorithm 3 is used. Rather than sampling *independently* from the uniform distribution on the set of candidate states from the trajectory which are within the slice, this variant instead uses a Markov transition kernel which leaves this uniform distribution invariant while favouring moves towards states near to the end-points of the trajectory. The NUTS implementation used in Stan (and I think PyMC3) actually no longer use a slice sampling step but instead use the multinomial / Rao-Blackwellised version described by Michael Betancourt in 'A conceptual introduction to Hamiltonian Monte Carlo' (https://arxiv.org/abs/1701.02434) and equivalently to the slice-sampling case use an efficient 'progressive' sampling implementation which leaves the multinomial distribution over the candidate state invariant while favouring moves closer to the trajectory end-points.

and the result is different from yours. I tested the programme twice. I hope that you can rerun it to check the result. Note that apply(R1_4,1,mean); apply(R1_4,1,sd) does not seem correct and gives a similar result to yours. For the above,

My apologies Changye and Christian, you were completely correct to get me to recheck the figures – I misunderstood the ordering of the results when computing the summary statistics as I had both 4 independent runs and 4 different methods, so that the figures I provided where incorrectly averaging over the method dimension not the runs. Sorry for this silly mistake – not really a valid excuse but I’ve rarely used R before so I am a bit unfamiliar with the data structures! Computing the summaries using the code you provided on a new run of 10 repeats this morning I get minimum ESS figures very close to those in your comment:

Once accounting for the adjusted computational cost calculations I proposed in my code as you say the ~1.5 to 2 times improvement in these figures is concordant with the ~3-4 times improvement in the corresponding values in figure 2 in your paper.

It’s interesting that the relative ordering of the performance of the eHMC* methods is quite different when comparing on the mean ESS (and similarly for the median / max)

NUTS (mean, median, max ESS):

0.1379 +/- 0.0009, 0.1482 +/- 0.0022, 0.2189 +/- 0.0050

eHMCq (mean, median, max ESS):

0.1348 +/- 0.0025, 0.1334 +/- 0.0029, 0.1837 +/- 0.0054

eHMC (mean, median, max ESS):

0.1727 +/- 0.0023, 0.1697 +/- 0.0024, 0.2664 +/- 0.0097

eHMCu (mean, median, max ESS):

0.1825 +/- 0.0013, 0.1879 +/- 0.0027, 0.2759 +/- 0.0068

From this it seems that the relatively poorer performance of eHMCu is due to lower ESSs in just a few components / dimensions. It seems that this might also explain some of the relatively poorer performance of NUTS as the performance improvements of the eHMC* methods are also a bit lower when comparing on the statistics other than the minimum, and intuitively at least eHMCu seems likely to give the most similar distribution of integration times to NUTS.

This might partly be explained by the use of an identity mass matrix as this means there is no adjustment for different relative scaling along the different dimensions, and so the lower minimum ESS for NUTS / eHMCu may be due to using integration times smaller than required for efficiently exploring the dimension(s) with the largest scale while still exploring the smaller scale dimensions well (as we may end up doing multiple traverses of these dimensions for even a partial traverse of the larger dimensions). It would be interesting to see if / how for example using an adaptively tuned diagonal mass matrix changes things.

As follow up to my previous comment I re-ran the same logistic regression experiment with an adaptively tuned diagonal mass matrix in the Stan NUTS runs and used this same diagonal mass matrix in the eHMC runs. Averaging over 10 independent runs gives the following ESS statistics (first table means, second table standard errors of mean)

The range of the per-dimension ESS estimates for the eHMCu runs has decreased from [0.083, 0.2759] to [0.1090, 0.2331] and similarly for NUTS from [0.056, 0.2189] to [0.0595, 0.1097]. Interestingly the performance improvement is more significant however for the eHMC method, with it now giving a 2.5 gain in efficiency over NUTS.

Here are further points from Changye: (1) About the “Retry” triggered by reaching “NaN” (divergence) in the leapfrog steps: as far as we can tell, such divergence events come from large step sizes, which make the leapfrog integrator unstable. While this could induce a resampling step instead, in practice, if such divergence warnings occur too often, the resulting samples will not be reliable and users need to switch to a smaller step size. Since we keep the fraction of “Retry” to a minimum, the corresponding computation cost of “Retry” is negligible. And it should not impact the stationary distribution in fine. When compared with the first two examples, the values of the targeted accept probabilities are larger than 0.45 in the last three examples, values chosen towards preventing too many “Retry”, with a negligible added cost. In our experiments, “Retry” is used to smooth out the code.

(2) Stan uses multinomial sampling, instead of slice sampling, and biased progressive sampling (if our understanding is correct), towards favouring candidates close to the endpoints. Thus, Stan favours large ESJD (maybe, and ESS) and shows better performances than the original efficient NUTS in Hoffman and Gelman (2014). However, even though these techniques alleviate the presence of autocorrelation in the samples, they cannot make sure that the current position and the proposal correspond to the endpoints. According to our experiments, NUTS still wastes more computation cost than eHMC*.