Abstract

Algorithms for simulating complex physical systems or solving difficult optimization problems often resort to an annealing process. Rather than simulating the system at the temperature of interest, an annealing algorithm starts at a temperature that is high enough to ensure ergodicity and gradually decreases it until the destination temperature is reached. This idea is used in popular algorithms such as parallel tempering and simulated annealing. A general problem with annealing methods is that they require a temperature schedule. Choosing well-balanced temperature schedules can be tedious and time-consuming. Imbalanced schedules can have a negative impact on the convergence, runtime and success of annealing algorithms. This article outlines a unifying framework, ensemble annealing, that combines ideas from simulated annealing, histogram reweighting and nested sampling with concepts in thermodynamic control. Ensemble annealing simultaneously simulates a physical system and estimates its density of states. The temperatures are lowered not according to a prefixed schedule but adaptively so as to maintain a constant relative entropy between successive ensembles. After each step on the temperature ladder an estimate of the density of states is updated and a new temperature is chosen. Ensemble annealing is highly practical and broadly applicable. This is illustrated for various systems including Ising, Potts, and protein models.

pacs:

Simulational science often involves the generation of configurations from high-dimensional probability distributions as well as the computation of ensemble averages and normalization constants. Numerous applications in statistical physics, biomolecular simulation and Bayesian inference illustrate the ubiquitous need for efficient sampling methods. Challenges are posed by the complexity of the system, its shear size, slow convergence and non-ergodicity.

To address these challenges, algorithms that work with modified versions of the system have been proposed. One idea is to simulate the system at multiple temperatures and utilize the enhanced flexibility at higher temperatures to avoid local free-energy minima at lower temperatures. This idea is the basis of sampling algorithms such as replica-exchange Monte Carlo Swendsen and Wang (1986) and simulated tempering Marinari and Parisi (1992) but also used in popular optimization algorithms such as simulated annealing Kirkpatrick et al. (1983).

Parallel tempering (PT) Geyer (1991), for example, considers a family of canonical ensembles at different temperatures. The ensembles are simulated independently and occasional swaps of configurations between ensembles at nearby temperatures allow the simulation to escape from metastable states. From a PT simulation thermodynamic quantities such as free energies and heat capacities can then be computed with high accuracy. But the success and convergence of a PT run depends critically on the choice of the temperature schedule. To choose a good temperature schedule can be highly non-trivial, especially for systems undergoing phase transitions. A well-balanced schedule entails overlap between ensembles at neighboring temperatures. This means that we have to use more and more replicas with increasing system size because energy is extensive Predescu et al. (2005). Moreover, PT explores temperature space on a fixed ladder. If we want to use multiple temperatures or control parameters as in multi-dimensional PT Sugita et al. (2000) we are suffering from the curse of dimensionality. Another source of inefficiency is the fact that configurations at high temperatures are constantly being produced but no longer needed once the simulation has converged.

Multi-canonical sampling algorithms Berg and Neuhaus (1992) are a powerful alternative to annealing methods. Rather than utilizing a temperature parameter to modify the system, multi-canonical algorithms draw configurations from an ensemble whose weight is inversely proportional to the density of states (DOS) of the system, such that ideally the energy histogram will be constant. However, this requires that we know the DOS before the actual simulation, which is rarely the case.

The Wang-Landau (WL) algorithm Wang and Landau (2001) is an ingenious variant of multi-canonical sampling that sidesteps this problem. The unknown density of states is estimated in the course of a WL run, configurational samples are generated as a by-product. The fact that the correct DOS should produce a flat energy histogram can be used to monitor the convergence of the method. By gradually decreasing the learning rate, the simulation is stabilized and converges.

WL sampling has originally been developed for discrete systems Wang and Landau (2001). Its direct extension to large or continuous systems requires choosing an energy range and binning. But there might be forbidden energy levels that cannot be visited, in which case the corresponding bins remain empty and the energy histogram will never be flat. These problems are aggravated for multi-dimensional DOS over more than one macrovariable because the number of bins grows exponentially in the number of macrovariables. In that case flatness of the energy histogram ceases to be a useful convergence criterion and must be replaced by other criteria Zhou et al. (2006). To apply these modifications in practice remains a challenge and involves parameter tweaking.

This article proposes an algorithm, ensemble annealing, that solves these issues and produces samples and an estimate of the DOS. Ensemble annealing is inspired by the nested sampling method for Bayesian computation Skilling (2006) and can be viewed as a generalization of nested sampling to the canonical or other ensembles. The algorithm applies both to discrete and continuous systems. In contrast to simulated annealing or parallel tempering, ensemble annealing constructs an optimal temperature protocol adaptively and has only few algorithmic parameters.

Ensemble annealing is a sequential algorithm that steps through iterations denoted by k. N non-interacting particles or walkers are employed to explore a series of ensembles typically starting in a high temperature ensemble, then cycling through ensembles at lower and lower temperatures, until the destination ensemble is reached. For each ensemble, the walkers produce configurations xkn where n=1,…,N. In contrast to other annealing and tempering methods only the start and final ensemble have to be chosen. The intermediate ensembles are found during the simulation by placing them such that a constant overlap between successive ensembles is maintained. To implement this approach, we need to agree on various concepts, mainly what kind of ensembles will be considered, and how to measure distances between ensembles.

ii.1 Ensembles

Let us denote the target ensemble from which we aim to generate configurations x by p(x). Often p(x)=π(x)e−E(x) with energy E(x)=−lnp(x). In Bayesian inference, for example, π(x) denotes the prior distribution and E(x) corresponds to negative log likelihood. In a physical simulation of particles in a box, π(x) will be uniform over the box and E will be the interaction energy between all particles. Note that in practice both π(x) and p(x) are often unnormalized.

To draw configurations from p(x) we consider a series of ensembles

pk(x)=1ckqk[E(x)]π(x)

(1)

where ck=∫qk[E(x)]π(x)dx normalizes the k-th ensemble. Here, we assume that ensemble pk depends on the configuration only through the macrovariable E(x), but the method works also for more general ensembles. Typically qk(E)=q(E;λk) where q(E;λ) is a parameterized family and λ a protocol parameter. The distributions qk(E)≥0 are intermediate helper or bridging distributions. In case of the canonical ensemble, configurations are weighted by the Boltzmann factor

qk(E)=exp{−βkE}

(2)

where βk is the inverse temperature and ck is the partition function.

Obviously the canonical ensemble is a widespread choice in annealing methods, but it might also be worthwhile to consider other ensembles. For example, in parallel tempering the use of the Tsallis ensemble

qk(E)=1{1+β(αk−1)(E−Emin)}1/(αk−1)

(3)

with control parameter αk≥11, inverse temperature β and minimum energy Emin≤E(x) has been proposed Hansmann and Okamoto (1997). A multi-parameter combination of the Boltzmann and Tsallis ensemble is used in complex Bayesian data analyses Habeck et al. (2005) to independently control the prior density and the likelihood function.

Another ensemble that is of potential interest is the Fermi distribution

qk(E)=11+exp{βk(E−ϵk)}

(4)

which has two control parameters: the inverse temperature βk and an energy cutoff ϵk. The zero-temperature Fermi ensemble approaches a stepfunction, i.e. configurations with energies greater than ϵk are assigned zero probability:

qk(E)=Θ(ϵk−E)

(5)

where Θ(⋅) is the Heaviside step function. This ensemble is used in the nested sampling method for Bayesian computation Skilling (2006) and also related to the microcanonical ensemble Ray (1991); Martin-Mayor (2007):

qk(E)=Θ(ϵk−E)(ϵk−E)d/2−1

(6)

where d is the dimension of configuration space (number of configurational degrees of freedom) and ϵk the total energy of the system (potential plus kinetic energy).

Note that the target ensemble p(x) does not necessarily need to be a member of the bridging family, i.e. there might be no pk(x) such that p(x)∝pk(x), which is the case, for example, in nested sampling and the microcanonical ensemble.

The density of states (DOS) over the prior or reference distribution π(x) is defined as

g(E)=∫δ(E−E(x))π(x)dx

(7)

with δ(⋅) denoting the delta function. With the help of the DOS it is straightforward to compute how the energies are distributed in the intermediate ensembles:

pk(E)=∫δ(E−E(x))pk(x)dx=1ckg(E)qk(E)

(8)

where pk(E) is a one-dimensional distribution.

ii.2 Relative entropy

When choosing the intermediate distributions that bridge between the initial and final ensemble, it is essential to control the “distance” or overlap between successive ensembles pk(x) and pk+1(x). We use the Kullback-Leibler (KL) divergence Kullback and Leibler (1951) or relative entropy

D(p∥q)=∫p(x)ln[p(x)/q(x)]dx

(9)

for this purpose. The relative entropy satisfies the Gibbs inequality D(p∥q)≥0 with equality only if p and q are identical. Therefore the Kullback-Leibler divergence qualifies as an “entropic distance” between ensembles p and q. However in contrast to a true distance the KL divergence is not symmetric under interexchange of p and q. It is only well-defined if q is “broader” than p, i.e. if the support of p is contained in the support of q, and therefore a directed divergence. In information theory, the KL divergence is used to quantify information gain.

Let us now consider the relative entropy between two members pk and pl of the family of bridging distributions [Eq. (1)]. With the help of the DOS we can reduce the high-dimensional configurational integral [Eq. (9)] to a one-dimensional integral over the energies:

D(pk∥pl)

=

∫pk(E)ln{qk(E)clql(E)ck}dE

(10)

=

⟨ln(qk/ql)⟩k−ln(ck/cl)

where ⟨⋅⟩k denotes an average over the k-th ensemble pk. For the canonical ensemble [Eq. (2)] the relative entropy reduces to

D(pk∥pl)=(βl−βk)⟨E⟩βk+βkF(βk)−βlF(βl)

(11)

where F(β)=−β−1logc(β) is the free energy at inverse temperature β.

Throughout this article, we will use the relative entropy to measure the distance between ensembles pk and pl. Other measures that quantify the overlap between different ensembles might also be useful. For example, we could use the exchange rate of a parallel tempering simulation

R(βk,βl)

=

1ckcl∫min{qk(E1)ql(E2),qk(E2)ql(E1)}

(12)

×g(E1)g(E2)dE1dE2

as a measure to compare ensembles. The Jensen-Shannon divergence Lin (1991), a symmetrized version of the relative entropy, has been used in thermodynamic control Crooks (2007). The Hellinger distance Gelman and Meng (1998) is a widespread distance used mainly in statistics and may also provide a useful measure for comparing ensembles. In this article, however, we have not explored measures for comparing ensembles other than the relative entropy.

Given a continuous bridging family, the optimal annealing protocol would involve infinitely many steps (adiabatic annealing). We want to reach the target ensemble in finitely many steps but produce intermediate ensembles that have a fixed and finite relative entropy D. We will later see that for small D this amounts to cooling with constant thermodynamic speed. As we move from ensemble pk to the next ensemble pk+1 we need to evaluate their relative entropy D(pk+1∥pk). Equation (10) shows that this involves the computation of ensemble averages as well as the estimation of free energy differences. These are challenging computational problems, which can be solved by the methods outlined in the next subsection.

ii.3 Estimation of the relative entropy

Because the relative entropy [Eq. (10)] both involves the normalization constants ck, cl as well as an ensemble average, it is computationally challenging to evaluate D(pk∥pl) accurately. However, if we know the density of states g(E), the configurational integrals can be reduced to low-dimensional integrals. Therefore, ensemble annealing estimates g(E) during the course of the simulation, similar to the Wang-Landau method Wang and Landau (2001) or nested sampling Skilling (2006). The estimation of the DOS relies on histogram methods Ferrenberg and Swendsen (1989); Habeck (2012a).

If we work with N non-interacting walkers at the k-th iteration, the configurations are denoted by xkn (i.e. the first index indicates the ensemble, whereas the second index enumerates the walkers). At each ensemble annealing iteration k, a non-parametric estimate of the DOS

g(k)(E)=N∑n=1k−1∑k′=0g(k)k′nδ(E−Ek′n)

(13)

is updated where Ekn=E(xkn) are the energies of the visited configurations. The discrete DOS g(k)ln assigns a weight to every configuration xln that has been generated by the walkers during the entire simulation up to the current ensemble pk. That is, the vector of all weights expands in each iteration and is constantly updated (which is indicated by the superscript).

With the help of the estimated DOS it is straightforward to compute the relative entropy between two ensembles pk and pl:

D(pk∥pl)≈k−1∑k′=0N∑n=1g(k)k′nqk(Ek′n)cklnqk(Ek′n)ql(Ek′n)−ln(ck/cl)

(14)

where

ck≈k−1∑k′=0N∑n=1g(k)k′nqk(Ek′n).

(15)

These relations are used in histogram methods for estimating free energy differences Habeck (2012b, a). The weights are obtained using the histogram iterations

g(k)k′n∝(k−1∑l=0ql(Ek′n)/cl)−1

(16)

in which each update of the weights g(k)k′n is followed by their normalization and a re-evaluation of the partition functions ck′ according to Eq. (15). We start the iteration from the previous DOS estimate (setting the weights of the new states xkn to zero), which speeds up the convergence of the histogram iterations.

ii.4 Initialization and equilibration of ensembles

The estimated DOS serves two purposes: First, to estimate the relative entropy between two ensembles reliably; second, to initialize the walkers to sample the next ensemble by recycling configurations that have been generated previously, which are then equilibrated in the new ensemble. In the k-th ensemble annealing iteration, ensemble pk(x) is approximated by

pk(x)≈1ckk−1∑k′=0N∑n=1g(k)k′nqk(Ek′n)δ(x−xk′n).

(17)

We use this approximation to generate N initial states for the walkers by the following scheme: First, we draw an energy level according to the probability g(k)k′nqk(Ek′n)/ck. Second, we randomly pick one among all configurations that map to the energy drawn in the first step. In continuous systems, it is very unlikely that two configurations were generated that have exactly matching energies. However, in discrete systems such as the two-dimensional Ising model there are only finitely many energy levels. In this case, we can speed up the DOS estimation [Eqs. (15) and (16)] by working with histograms as explained in Habeck (2012a). Due to the limitation of the approximation (17), the N recycled states need to be equilibrated in the correct ensemble [Eq. (1)] pk(x)∝π(x)qk(E(x)) using Monte Carlo or molecular dynamics simulations.

Figure 1: (Color online) Ensemble annealing of a one-dimensional particle in a Schwefel potential E(x)=−xsin(√|x|) (shown on the right) using 100 walkers. Every stripe marked by a dashed boundary shows the configurations of the walkers after the equilibration step. A random number has been added to the x-coordinates (iteration index) for better visualization. All Particles within each stripe experience the same temperature during equilibration.

ii.5 Algorithm

We have now all tools at hand to formulate the ensemble annealing algorithm. Ensemble annealing is an adaptive sequential Monte Carlo algorithm. The main parameters are the number of walkers N and the relative entropy D between successive ensembles pk and pk+1. Choosing ensembles with a constant relative entropy ensures that the annealing process proceeds at a constant thermodynamic speed. Iteration k comprises the following steps:

Initialization: Using the current estimate of the DOS g(k)(E) [Eq. (13)], the particles are initialized by drawing N energies E′kn from qk(E) [Eqs. (8) and (13)] and finding the corresponding configurations x′kn by a simple lookup in the energy table such that E(x′kn)=E′kn. Because g(k) is only an approximation, the initial states will not be equilibrated.

Equilibration: The states are equilibrated in the new ensemble pk(x) by running Monte Carlo or molecular dynamics simulations starting from x′kn and producing new states xkn. The new configurations and energies Ekn are added to the pool of all states visited so far.

DOS estimation: A new estimate of the DOS, g(k+1), is computed from all energies and temperatures using non-parametric histogram reweighting Habeck (2012a). To speed up the convergence, the previous DOS estimate is used to initialize the iterations.

Annealing: The next ensemble pk+1 is adjusted such that it has a desired relative entropy D with respect to the current ensemble pk, i.e. pk+1 satisfies D(pk+1∥pk)=D.

The algorithm has only few parameters, namely the initial and final ensemble, the number of walkers N and the target relative entropy D between successive ensembles. Evidently, D determines the cooling or compression rate. For smaller D the overlap between successive ensembles is larger and the annealing progresses more slowly. If we allow D to be large, we anneal faster but risk to fail to equilibrate.

ii.6 Application to the harmonic oscillator

Let us illustrate ensemble annealing for a simple system, the one-dimensional harmonic oscillator with energy E(x)=(x−x0)2/2 and ground state x0 in the canonical ensemble:

pk(x)≡p(x|βk)=√βk2πexp{−βk2(x−x0)2}.

The distance between two ensembles at inverse temperatures β and β′, β′≥β, is:

D(β′∥β)=12[(β/β′)−1−ln(β/β′)]

(18)

In this case the KL divergence depends only on the ratio between two successive temperatures. The constant relative entropy criterion yields a constant cooling rate ρ=β/β′,0≤ρ≤1 which is determined by

ρ−lnρ=2D+1.

(19)

This results in the geometric schedule βk=β0ρ−k. For D→0 we reach the adiabatic limit of infinitely slow cooling since ρ→1. Geometric schedules have been proposed for optimal simulated annealing Kirkpatrick et al. (1983).

Alternatively, we could consider the ground state a control parameter, β=x0, and let π(x)∝exp{−x2/2}, E(x)=−x with lnZ(β)=β2/2 and ⟨E⟩=−β. The relative entropy is now according to Eq. (11):

D(β′∥β)=−(β′)22+β22−(β−β′)β′=12(β′−β)2.

Constant steps in the relative entropy lead to a schedule that is linear in the inverse temperature: βk=β0±√2Dk. That is, the ground state is shifted either in the positive or the negative direction depending on the targeted ground state.

These examples highlight that it is not sufficient to prescribe the relative entropy to choose the next ensemble. We must also impose some sense of directionality in order to shift the ensemble closer to the target ensemble. This will become particularly important in multi-dimensional annealing.

We will now apply annealing of the canonical ensemble to various systems including discrete systems such as Ising and Potts models and a continuous protein model.

iii.1 One-dimensional example

To illustrate ensemble annealing, we first apply it to a system with a one-dimensional configuration space and a rugged energy function E(x)=−xsin(√|x|), the one-dimensional Schwefel function. We deliberately choose a large number of walkers for illustrative purposes (N=100); a smaller number of walkers would be sufficient in this one-dimensional example. At every iteration, equilibration is achieved by using a random walk Metropolis Monte Carlo scheme Metropolis et al. (1957) consisting of 10 random steps drawn from a uniform distribution. The relative entropy is set to D=10−3. Figure 1 shows the configurations at the various temperatures obtained by the constant relative entropy criterion. We start at βinitial=0 and target a final inverse temperature βfinal=1. As ensemble annealing progresses the walkers become more and more localized in the dominant modes of the target ensemble. The relative proportions are reproduced accurately.

This example also suggests that it should be possible to prune the number of the walkers during the annealing process. In the course of annealing, the ensemble becomes more and more concentrated (as monitored by a decrease in the entropy Sk=−∫pklnpk), and fewer walkers are needed to explore and represent it. Using the Boltzmann relation S=klogW where W is the number of accessible microstates, we could decrease the number of walkers in each iteration and thereby save computational resources. However, we have not explored this strategy further in this article.

iii.2 Ising and Potts model

We now apply ensemble annealing to the two-dimensional Ising and Potts model. Figure 2 shows simulation results for the 32×32 lattice. N=10 particles were used and the relative entropy was set to D=10−2. The equilibration step consisted of Metropolis Monte Carlo runs that randomly select lattice sites and try to flip the spin (Ising model) or draw a random color (Potts model). Figure 2(a) shows how the algorithm improves the initial DOS estimate. The algorithm starts with N random spin configurations from which the initial DOS covering only a limited energy range is derived. As the algorithm proceeds, lower energy states are generated and the DOS expands into the lower energy region. This process continues until the full energy range has been explored and a highly accurate estimate of the DOS is produced. The accuracy of the estimated DOS is illustrated in Fig. 2(b). The estimation error can be as small as with WL sampling Wang and Landau (2001). A similar accuracy is also obtained for the ten state Potts model which undergoes a first order phase transition. The DOS tends to be more accurate for the low energy states. In most situations this is desirable because one is primarily interested in the thermodynamic properties of the system at finite temperatures, at which the low energy states contribute most strongly.

(a) Ensemble annealing of the 32×32 Ising model

(b) DOS of the 32×32 Ising and Potts model

Figure 2: (Color online) (a) Improvement of the estimated DOS during ensemble annealing. Four equally spaced stages of ensemble annealing were picked; the iteration index k is indicated in the legend. The true DOS Beale (1996) is shown as a thick black curve; the estimated DOS as yellow [gray] area. (b) Estimated DOS for the Ising (left) and the ten state Potts model (right). Again the black line shows the true DOS and the yellow [gray] line the DOS produced during ensemble annealing. The insets show the relative error in the microcanonical entropy s(E)=lng(E).

Figure 3: (Color online) Ensemble annealing of the Gō model. Left panel: Density of states obtained with ensemble annealing (yellow [gray] line) and a parallel tempering simulation (thick black line). Right panel: Fraction of native contacts ⟨Q⟩β as a function of the inverse temperature β (black area). The heat capacity is indicated by the yellow [gray] line and has been scaled to match the ordinate range. It peaks at the folding temperature β≈1.3. The ribbon diagrams show configurations in the unfolded state (black ribbon on the left) and in the folded state (white ribbon on the right).

iii.3 Protein model

Ensemble annealing readily applies to continuous systems such as Gō models that have been used extensively to study protein folding (see e.g. Whitford et al. (2009)). In our version of the Gō model, the dihedral angles are the only conformational degrees of freedom; bond lengths and angles are fixed to ideal values. The energy function is comprised of a generic non-bonded energy potential and the Gō term. The non-bonded energy penalizes atom clashes using the same quartic repulsion term as in Ref. Habeck et al. (2005). The Gō term enforces the native structure by imposing a Lennard-Jones potential on the Cα distances between residues in contact in the native state. The inverse temperature β serves as the control parameter in ensemble annealing runs. As in Ref. Habeck et al. (2005), we used hybrid Monte Carlo Duane et al. (1987) for equilibration. We seeded the simulation with 100 random structures and annealed an ensemble of N=30 structures; the relative entropy was set to D=10−2. For reference, we also ran a parallel tempering simulation of the Gō model using 37 temperatures. The DOS obtained with ensemble annealing was used to optimize the temperatures to produce an exchange rate of 48% on average. 10000 replica transitions were simulated and an estimate of the DOS was obtained by running histogram reweighting.

We studied the Gō model derived for a small protein domain, the 59 amino-acid Fyn-SH3 domain (PDB code 1SHF). Figure 3(a) shows the density of states obtained with ensemble annealing and compares it to the reference computed with an exhaustive parallel tempering simulation. The agreement is very high over the entire energy range. In Figure 3(b) we study the characteristics of the Gō model as revealed by ensemble annealing. Shown is the average number of native contacts Q as a function of the inverse temperature. The folding transition is marked by a sudden increase in the number of native contacts. The heat capacity peaks at β=1.3 indicating a folding temperature of roughly 0.77.

(a) Energy histograms

(b) Temperature schedule and PT swap rates

Figure 4: (Color online) Ensemble annealing of ten state Potts model. (a) Energy histograms at the temperatures found by ensemble annealing. Only every twentieth histogram is shown for clarity. The temperature is indicated by the color. (b) Left: Temperature schedule. The thin yellow [gray] line indicates the schedule found by ensemble annealing. The thick black line corresponds to the protocol based on minimum entropy production Salamon et al. (1988). The dashed gray line indicates the critical temperature. Right: PT swap rates obtained with the temperature schedule. The heat capacity is shown as yellow [gray] curve.

We will now have a closer look at the schedules constructed by ensemble annealing and compare them to other schedules that have been proposed in the literature. Moreover, we discuss the possibility to use ensemble annealing as a numerical method to construct near-optimal thermodynamic paths.

iv.1 Temperature schedule

Figure 4 shows the energy histograms and temperature schedule found by ensemble annealing for the ten state Potts model. By way of construction of the schedule, the energy histograms of successive ensembles have a constant overlap (Fig. 4(a)). The temperature schedule is non-trivial and deviates from the linear, geometric, and logarithmic schedules that have been proposed in the literature Kirkpatrick et al. (1983); Gelman and Meng (1998). Initially, the inverse temperatures grow sublinearly. In this phase, the schedule constructed by ensemble annealing is reminiscent of the logarithmic schedule proposed by Geman and Geman Geman and Geman (1984), i.e. βk=β0ln(1+k). As ensemble annealing approaches the critical temperature, the cooling rate it slowed down automatically such that the system is not quenched and avoids being trapped in a metastable state. Beyond the critical point, the temperatures show a super-exponential increase (Fig. 4(b)).

Salamon and co-workers have proposed an adaptive version of simulated annealing more than two decades ago Salamon et al. (1988); Ruppeiner et al. (1991). Their algorithm finds the temperature schedule by minimizing the entropy production whereupon the temperature changes inversely proportional to the square root of the heat capacity. This rule follows directly from the constant relative entropy criterion. For small changes in inverse temperature, we have

D(β+δβ∥β)≈12(δβ)2∂2βlnc(β)

(20)

where

∂2βlnc(β)=⟨(E−⟨E⟩)2⟩β=C(β)/β2

is proportional to the heat capacity C(β). If the desired relative entropy D is small, the increment in inverse temperature is

δβ=β√2D/C(β).

Integration over the inverse temperature increments

∫δβ≈k∑l=1βl√2D/Cl

(21)

generates a schedule that is very close to the one found by ensemble annealing at finite D (Fig. 4(b)). Comparison with the schedule derived by Salamon et al. shows that √D is proportional to the thermodynamic speed of the annealing process. In the context of Bayesian computation, similar, but independent arguments have been put forward by Skilling Skilling (2006) who uses D to control the rate of compression as the system moves from the prior to the posterior probability.

From a practical point of view, an ensemble annealing run can be used to seed a parallel tempering simulation that has a well-balanced schedule and equilibrated initial states. The right panel in Fig. 4(b) illustrates that the exchange rates are indeed uniform for a PT simulation when using every fifth temperature of the ensemble annealing schedule. A drop in the swap rate is only observed close to the critical temperature where the heat capacity peaks.

iv.2 Minimal dissipation paths

Let us now see if the results of the previous section generalize to multiple temperatures. Although ensemble annealing can be applied to any family of bridging distributions [Eq. (1)], let us focus on parametric families of the form

p(x|λ)=1c(λ)q[E(x);λ]π(x)

where λ denotes the vector of all control parameters. The second order expansion of the relative entropy is Crooks (2007):

D(λ′∥λ)≈12(λ′−λ)TI(λ)(λ′−λ)

(22)

where the zero and first order term vanish because D(λ∥λ)=0 and λ′=λ is the global minimum of D(λ′∥λ) viewed as a function of λ′. Because the Fisher information matrix

I(λ)=∫[∇λlnp(x|λ)][∇λlnp(x|λ)]Tp(x|λ)dx

(23)

is positive definite, it defines a metric on the space of distributions parameterized by λ. Equation (20) is a special case of the general relation (22) for the Boltzmann ensemble with a single temperature, where the Fisher information is simply ∂2βlnc(β).

In statistics, the Fisher metric has been studied since the beginnings of information geometry. The Fisher information can also be used to define a thermodynamic length and action (see Crooks (2007) and references therein). Quasistatic processes that switch between two thermodynamic states follow minimal dissipation paths in λ space. These can be computed by minimizing the thermodynamic length (see, for example, Diósi et al. (1996); Crooks (2007); Sivak and Crooks (2012)). Therefore, the optimal path is a geodesic on the Riemanian manifold equipped with the Fisher information metric. Very similar results have been presented by Gelman and Meng in their work on bridge and path sampling Gelman and Meng (1998).

By taking constant but finite steps in relative entropy followed by an equilibration, ensemble annealing approximates a quasistatic process. After K successful equilibrations, the relative entropy accumulated during ensemble annealing is

KD=K−1∑k=0D(λk+1∥λk)≈12∫˙λTI(λ)˙λdt

(24)

and approximates the thermodynamic action due to Eq. (22). If we aim to optimize the use of computing resources, we have to minimize K, the number of bridging distributions. For a single control parameter this is straightforward: we have to follow the geodesic towards the destination ensemble. In the canonical ensemble, for example, if the destination temperature is lower than the initial temperature (annealing), we have to increase β such that βk+1>βk also for all intermediate temperatures. For ensembles with multiple control parameters the situation is more complicated because minimizing the accumulated relative entropy [Eq. (24)] requires the computation of a discrete geodesic. However the DOS is generally unknown, and we can compute I(λ) only in the vicinity of the current state. It is not possible to evaluate reliably the length of an entire path connecting the initial and the destination ensemble. We can only search locally without any guarantee that the generated path is close to the geodesic.

Figure 5: (color online) Annealing of the 32×32 Ising model in the Tsallis ensemble. (a) Probing different Emin at fixed α=1.0025. The thick black line shows the microcanonical temperature β(E). The dashed lines show the right hand side of Eq. (25) for different Emin. The curve produced with Emin set to the ground state energy E0 intersects β(E) twice corresponding to two peaks in the energy distribution g(E)q(E;α). (b) Comparison of ensemble annealing of the Tsallis ensemble with Emin=E0=−2L2 (dashed red line) and Emin=2E0=−4L2 (yellow solid line). The true DOS is shown as thick black line.

v.1 Tsallis ensemble

A major advantage of using the Boltzmann distribution (2) as bridging family is that many powerful methods to simulate the canonical ensemble exist. We can use these algorithms in the equilibration step. But it can be beneficial to consider also other ensembles, because they might bridge more efficiently between the initial and final ensemble. The Tsallis ensemble has been used previously in combination with parallel tempering Hansmann and Okamoto (1997); Habeck et al. (2005). The motivation for this choice is that due to the heavier tails of the Tsallis ensemble replicas have a larger overlap and can exchange states even if they show large energy differences. As a consequence, the number of intermediate replicas should be smaller than with the Boltzmann ensemble.

This is indeed confirmed by an analysis of the 32×32 Ising model. Test calculations based on the correct DOS show that the canonical ensemble requires 273 βk to reach the destination ensemble (βfinal=1) at a relative entropy of D=10−2, whereas the Tsallis ensemble needs only 85 αk values to bridge between αstart=1.06 (corresponding to a very high canonical temperature) and αfinal=1.0. However, in practice this apparent advantage does not hold up. The reason is that the Tsallis ensemble typically yields multimodal energy distributions at intermediate αk. To see this let us first consider the more general case where a parametric bridging family q(E;λ) is used. According to Eq. (8) the energy distribution at λ is proportional to g(E)q(E;λ) and peaks at ^E solving:

0=s′(^E)+q′(^E;λ)q(^E;λ)=β(^E)+q′(^E;λ)q(^E;λ)

where s(E)=lng(E) and β(E)=s′(E) are the microcanonical entropy and inverse temperature. In case of the canonical ensemble, this equation is simply β(^E)=β, that is the energy distribution peaks at the energy ^E whose microcanonical temperature matches the canonical temperature. In case of the Tsallis ensemble (3), we have:

β(^E)=β1+β(α−1)(^E−Emin)

(25)

This equation can have multiple solutions depending on α and Emin, which is why it is difficult to get annealing of the Tsallis ensemble running in a stable fashion. Not only the control parameter α, but also the minimum energy Emin plays a critical role (see Fig. 5). If Emin is exactly set to the energy of the ground state E0, the energy distribution of the Ising model becomes bimodal with a sharp peak around the ground state energy and a second peak corresponding to high temperatures. That is, in order to generate samples from this ensemble we have to simulate two phases simultaneously. As a consequence the DOS estimate produced by ensemble annealing shows systematic errors (Fig. 5(b)), despite producing an efficient schedule with 103 bridging distributions. If we lower Emin, the phase separation is less dramatic and consequently the DOS estimate is as accurate as with the Boltzmann ensemble. But we also lose the efficiency of the Tsallis ensemble in bridging large energy differences, which is reflected in the larger number of αk: 230 αk for Emin=2E0 which is similar to the 270 temperatures produced by Boltzmann annealing. This shows that Emin is an additional algorithmic parameter which is delicate to choose.

v.2 Microcanonical ensemble and nested sampling

Nested sampling has been invented by Skilling Skilling (2006) to solve Bayesian inference problems. Bayesian inference demands that we draw from a posterior distribution p(x) and compute its normalization constant, which are essentially the tasks that ensemble annealing addresses. In Bayesian inference π(x) is the prior, L(x)=e−E(x) the likelihood function; the destination ensemble that we aim to characterize is the posterior distribution over some inference parameter(s) x.

Nested sampling is based on the microcanonical ensemble q(E;ϵ)=Θ(ϵ−E) [Eq. (5)]; the control parameter ϵ is the maximum energy that the system is allowed to reach Habeck (2015). Therefore nested sampling can be viewed as a special case of ensemble annealing based on a zero-temperature Fermi or the microcanonical ensemble. The relative entropy between two ensembles [Eq. (10)] with energy levels ϵ′≤ϵ simplifies to:

D(ϵ′∥ϵ)=ln[c(ϵ)/c(ϵ′)]

(26)

where the normalization constant

c(ϵ)=∫ϵEming(E)dE

(27)

is the cumulative DOS or configuration space volume. From a Bayesian point of view, c(ϵ) is the prior mass enclosed by the likelihood contour L=e−ϵ. The control parameter is reduced from infinity to the energy of the ground state Emin.

There are several differences between nested sampling and annealing of the ensemble (5) using ϵ as control parameter. These differences result from the fact that all of the features that ensemble annealing aims to implement explicitly are built-in to nested sampling. In fact, nested sampling’s design principles served as a guide to develop the ensemble annealing algorithm.

Ensemble annealing uses histogram methods to estimate the DOS, whereas nested sampling utilizes order statistics due to the special form the of truncated ensemble (5). As a consequence of the truncation, c(E) will be uniformly distributed over pk(E) (defined for E<ϵk), which is clear from Eq. (8) 2. Therefore the configuration space volume associated with the maximum energy state follows the distribution c(Emax)/c(ϵ)∼NtN−1 where 0≤t≤1 and Emax<ϵ is the maximum energy among all N walkers. Based on this result from order statistics, nested sampling estimates c(E) at well-dispersed energy cutoffs ϵk.

Another elegant feature of nested sampling is that if D=1/N, the next ensemble achieving a compression of D is the one in which the energy is bounded by Emax. This results from the fact that ⟨D(Emax||ϵ)⟩=−⟨logt⟩=1/N where the average is over the Beta distribution NtN−1. Therefore the search for the next control parameter will simply yield ϵk+1=Emax, and we only have to resample the state with the highest energy.

Although nested sampling is much more efficient at cooling the truncated ensemble (5), it is also possible to run the ensemble annealing algorithm. Both methods produce comparable sequences of energy levels ϵk for the 16×16 Ising model with N=10 and D=1/N=0.1 (see Fig. 6). Also the estimated DOS is of similar accuracy. For this example, nested sampling runs at a speed that is three orders of magnitude faster than ensemble annealing. This is due to the fact that DOS estimation and annealing (i.e. the choice of the next energy limit) are instantaneous in nested sampling, because they are built-in to the method. Ensemble annealing, on the contrary, needs to run the histogram iterations for every energy contour. The histogram iterations converge only very slowly. Each iteration is dominated by DOS estimation because equilibration of the Ising model is very fast. For other systems such as proteins it will be the equilibration step rather than DOS estimation that consumes most of the computation time. In this situation, the discrepancy between nested sampling and ensemble annealing will not be as dramatic as for the Ising model.

For the d-dimensional harmonic oscillator we have g(E)=2dEd/2−1 and c(E)=Ed/2. As in the canonical ensemble, the relative entropy depends on the ratio of two successive control parameters: D(ϵ∥ϵ′)=d/2ln(ϵ′/ϵ). Therefore nested sampling and ensemble annealing progress geometrically according to ϵk=ϵ0ρk where ρ=e−2D/d. Let us compare this to the thermal approach using the inverse temperature as a control parameter. The compression rate of the canonical distribution is given by ρ−lnρ=2D/d+1 [Eq. (19)]. Therefore ρβ−lnρβ=−lnρϵ+1 where ρβ and ρϵ are the compression rates of thermal and microcanonical annealing. Rewritten we have ln(ρβ/ρϵ)=ρβ−1≤0, and therefore ρβ≤ρϵ≤1. This means that annealing the canonical ensemble compresses faster than annealing the microcanonical ensemble. We observe this for the application to the Ising model (Fig. 6). Canonical annealing with N=10 walkers and a relative entropy of D=0.1 requires only 42 iterations to reach the destination ensemble. Microcanonical annealing and nested sampling, on the contrary, need approximately 1800 iterations until convergence. The reason for this is that states accumulate at the maximum energy ϵ, and therefore nested sampling and microcanonical annealing will produce many intermediate ensembles in order to bridge between the initial and the destination ensemble.

Ensemble annealing is a Monte Carlo algorithm that steps through a sequence of ensembles and generates conformational samples. Along with the samples, it also estimates the density of states using histogram methods. The ensembles are placed in an adaptive manner so as to maintain a constant, pre-chosen relative entropy between successive ensembles. Ensemble annealing can be applied to a variety of bridging distributions, foremost the canonical ensemble but also to non-Boltzmann families such as the Tsallis or the microcanonical ensemble. There is a close connection to the nested sampling algorithm. In fact, ensemble annealing aims to implement the features that are built-in to nested sampling: control of the compression or thermodynamic speed, as well as reliable estimation of the compression based on the DOS or the configuration space volume. Nested sampling is intimately tied to the truncated ensemble (5), whereas ensemble annealing is more general in the choice of the ensemble itself, which can help to speed up the simulation and allows the use of samplers that work efficiently with a particular ensemble (such as, for example, hybrid Monte Carlo in the canonical ensemble).

Ensemble annealing is also related to previous work by Salamon and co-workers Salamon et al. (1988); Ruppeiner et al. (1991) on simulated annealing. Our approach is more general and gives richer results because it not only finds the system’s ground state but reconstructs the entire DOS. That way ensemble annealing can be used to both simulate thermodynamic systems and solve difficult optimization problems. By means of the DOS, all visited states contribute to the computation of ensemble averages making our approach more robust. Moreover, it is possible to work with multiple control parameters, which will be studied in future extensions of ensemble annealing.

Acknowledgements.

Footnotes

Normally, the control parameter of the Tsallis distribution is called qk. Because we already use qk for the bridging ensemble, we denote the Tsallis parameter by αk.

In statistics, the transformation from E to c(E) where c(E) is the cumulative distribution function of p(E) is called probability transform. It is a basic mathematical fact that if E follows p(E), c(E) will be uniformly distributed over [0,1].

In statistics, the transformation from E to c(E) where
c(E) is the cumulative distribution function of p(E) is called probability transform. It is a basic mathematical fact that if E
follows p(E), c(E) will be uniformly distributed over
[0,1].

Normally, the control parameter of the Tsallis distribution is called qk. Because we already use qk for the bridging ensemble, we denote the Tsallis parameter by αk.

In statistics, the transformation from E to c(E) where c(E) is the cumulative distribution function of p(E) is called probability transform. It is a basic mathematical fact that if E follows p(E), c(E) will be uniformly distributed over [0,1].