Full disclaimer: this tutorial is by no means the final word on how to sample mixture models! It also contains some personal biases with respect to the presentation and choice of methodology/technology. But it should provide a good starting point for those who are interested in using a mixture model and sampling it with MCMC. Despite the simplicity of the model/data considered here, the lessons are actually very deep and broadly applicable.

Don’t be scared by the amount of code. This notebook is *self-contained* and has a lot of tests. This is intentional, since I didn’t want to pretend one can write a complicated MCMC analysis without performing detailed tests and checks at all stages of the work.

What are mixture models?

In the realm of probabilistic models, mixture models are very common. They refer to describing a probability distribution as a linear sum of components

The kernels usually belong to some family of probability distributions (e.g., Gaussians). One is usually interested in inferring the amplitude coefficients as well as the parameters of the kernels (e.g., their positions and sizes). Note that to obtain a valid probability distribution function (PDF), both the kernels and the amplitudes must be normalized, e.g., sum to one:

The kernels do not have to be identical or even belong to the same family. Also note that the mixture could be infinite. Both of those extensions are the state of the art of modern statistical modeling, but are advanced topics which will not be covered here.

You might have already realized that this loose definition gives a lot of flexibility. As we will see, this creates numerous technical issues, notably in how mixture models are optimized or sampled when fitting data. In particular, there is the infamous label degeneracy, which refers to the ability to swap components/kernels and leave the total PDF invariant. There is also the mode collapse phenomenon, describing how using too many components might make them describe the data points separately (one component per datum). More generally, the presence components with low amplitudes, infinitely low or high widths, creates various new degeneracies in the posterior distribution. For instance, one can probably add many of those components without changing the quality of the fit to a data set. This is related to model selection, and choosing the number of mixtures. All of those issues are problematic if one attempts to naively sample the mixture parameters with MCMC. Luckily, there are standard tricks to get around them, as we will illustrate below.

One of the main restrictions of this notebook is to focus on one-dimensional mixtures. However, if were multidimensional, most of the discussion below would be unchanged - but many of the calculations and illustrations would be much more difficult.

Density modeling with Gaussian Mixtures

We will consider a set of points generated from the mixture of Gaussians, with amplitudes, means, and standard deviations denoted by . Thus, each is drawn from

We use the notation

For each , we observe a noisy version , with Gaussian noise of standard deviation , such that

which is our likelihood function for each data point.

The variables are such that and , i.e., they live on the -dimensional simplex.

As we can see, our noisy samples span a wider domain than the initial density. Because of the noise, the density is blurried, and we cannot resolve the two components. In fact, we barely guess that there are two components. But, as we will see below, because the noise is known, it can be deconvolved, and we can recover the two components pretty well. Note that this statement is only correct under the assumption of our two-component Gaussian Mixture model! We will see that things are not so simple once this assumption is relaxed and we consider a broader model space.

The posterior distribution(s)

Let us now write the posterior distribution of the parameters of this model. We are interested in recovering from the noisy data and their errors, . Unfortunately, the noiseless values are not observed; they must be modeled and marginalized over. By applying Bayes’ theorem twice and introducing latent ’s, we can write the posterior distribution of interest:

We have dropped the evidence at the denominator, since it is a normalization constant. It would be important if we were comparing models or fitting parameters accross models. For parameter estimation within one model (here with two components), it is unimportant.

By making the distributions explicit, our full posterior distribution reads

However, notice that we are in a very special context: both our distribution model and our likelihood are Gaussian. In other words, the expression for the
posterior probability contains an integral over a product of two
Gaussians. This can be resolved analytically, and leads to

If you are note sure how to do that, please look at the Matrix Cookbook! The reference is above.

This is a very convenient result since we have once and for all removed the latent variables and integrals, and we obtained a simple expression for the posterior distribution of interest. We will refer to this expression as the “simplified” posterior, while the previous one, with the set of latent variables and integrals, is the “full” posterior. Because this analytical elimination of the latent variables is not always possible, and because sometimes one does want to explicitly sample them, the full posterior is of great interest. Below we will see to correctly sample it and deal with the large number of parameters.

Why would we want to explore the full set of parameters, especially if we are able to eliminate a large number of them? This is because we might be interested in recovering the true noiseless values ’s. More precisely, we may want to look at the join posterior distribution on the ’s and the mixture model. We may find that some values are well recovered regardless of the mixture, while some other might not. More generally, this is because of a shrinkage property of Bayesian hierarchical models: by simulatenously recovering the population parameters and the object parameters ’s, we will obtain stronger constraints on than by using the objects individually (i.e., the likelihood function ). This is a natural yet powerful consequence of the hierarchical nature of the model and the parameter inference, which we cannot quite access if we analytically marginalize over the latent variables. We will illustrate this shrinkage property below.

Minor comment: we have integrated each in , but we will adopt bounded priors in the other numerical tests below. This difference does not affect our investigations. If we wanted to be very precise, we could easily include truncated gaussians in the simplified posterior distribution.

We will try sampling both of those posterior distributions below. This is because in many real-world problems this analytic simplification is not possible, and one has to work with the full posterior distribution, which might involve many latent variables and integrals. This would be the case if our population model and/or our noise were not Gaussian. We will work with this toy model because we have the simplified posterior distribution to verify our results.

First MCMC : raw model

Let us sample the simplified posterior distribution (which, for our fake data set, has 6 parameters: the amplitudes, means, and standard deviations of the two Gaussian components) with a very good sampler: emcee. The reasons why emcee is a good sampler are beyond the scope of this tutorial (but scale invariance and speed are two of them!). As mentioned before, we adopt uniform priors in for all parameters. We will manually impose by renormalizing them systematically at each random draw.

param_names = [r'$$\alpha_'+str(i+1)+'$$' for i in range(n_mixture)] +\
[r'$$\beta_'+str(i+1)+'$$' for i in range(n_mixture)] +\
[r'$$\gamma_'+str(i+1)+'$$' for i in range(n_mixture)]
_ = corner.corner(sampler.flatchain, truths=params, labels=param_names)

As we can see, the result is surprisingly nasty! In fact, running MCMC on this distribution is incorrect. The posterior distribution has strong degeneracies and is multimodal, so standard MCMC will not work. As an exercise, you can check that the sampler has not converged. Also, even if it looks like it is finding multiple modes, you can also check that the density of samples is incorrect at those locations, and is not proportional to the posterior distribution.

We briefly mentioned this phenomenon before. The degeneracy between ’s is due to them being normalized and constrained to sum to one. As a result, the posterior distribution, by construction, lives on a thin slice in space.

There are a few ways one could solve this problem. I like that introduced by M. Betancourt (arxiv:1010.3436) since it is simple to implement and it leads to simple derivatives. It establishes a bijective mapping between the hypercube and the simplex, by constructing coefficients ’s with for , and no normalization condition. In simpler terms, this means that we can use the coefficients ( of them, in ]) instead of the coefficients ( of them, on the simplex), which is very convenient. For each draw of ’s in the hypercube, the simplex coefficients (the amplitudes of our mixture components) are found with

There are other possibilities for mapping the simplex onto closes volumes, but not all of them lead to simple derivatives (via a change of variable), which is often critical for some inference methods, like the HMC we will use later on.

Moreover, there is an additional complication that made our sampling failed: the labeling degeneracy. That is, we haven’t imposed a specific order to the components, which means they can be swapped and lead to the same value of the posterior distribution. This degeneracy is well known and leads to a perfectly multimodal distribution (with modes!). Since the mapping between those modes is known, a simple solution is to tune the prior and only explore one of them. In one dimension, this can be achieved by enforcing the order of the components (for example their ’s) in the prior itself. Note that in multidimentional cases, this condition is more difficult to enforce.

Second MCMC run: emcee with reparametrized model

We will now order the component locations and use the hypercube-to-simplex mapping to run the emcee sampler once again. We will also make some tighted assumptions about the component amplitudes and widths. We don’t want those to be too small in order to avoid new sources of degeneracies in the posterior distribution, like the mode collapse. (This point will be illustrated in the next section).

param_names = [r'$$z_'+str(i+1)+'$$' for i in range(n_mixture-1)] +\
[r'$$\beta_'+str(i+1)+'$$' for i in range(n_mixture)] +\
[r'$$\gamma_'+str(i+1)+'$$' for i in range(n_mixture)]
_ = corner.corner(sampler.flatchain, truths=params, labels=param_names)

We now see that we have solved those issues and that we obtain a nicely sampled unimodal posterior distribution, with the true values correctly included.

(If you still see some strange tail and isolated samples that seem to indicate a secondary mode, please ignore this at the moment. This is real, and will be discussed later on.)

Note that our sampling procedure is very quick: the two runs merely took a few seconds. This is because emcee has many nice properties and is parallelized. Yet, as we will discuss below, more efficient approaches exist (in terms of the fraction of random proposed samples that are accepted), and those are essential for scaling up from a few parameters to… many millions!

A nice visualization of the result is in data space, where we evaluate our model for a few hundred samples of the posterior distribution, and see that they are nicely around the truth.

Visualizing the labelling degeneracy with PolyChord

Let us now investigate and try to understand the structure of our simplified posterior distribution. We will use the simplex-to-hypercube mapping, but no longer order the components.

Standard MCMC techniques like Metropolis-Hasting or Ensemble approaches, in their standard implementations, will only correctly sample uni-modal distributions. Some of them can be extended to deal with multimodality. Yet, the most widespead solution for fully mapping a multimodal distribution is nested sampling. PolyChord is one of the most sophisticated implementations of this sampling scheme, and is able to accurately map very complicated distributions.

There is a cost: nested sampling typically requires many more function evaluations than standard MCMC, and also has a few parameters that need to be tuned. We will not cover that last point here, but you should know that running nested sampling may require some tuning via trial and error. Finally, we won’t discuss or make use of the fact that nested sampling delivers the model evidence (or marginalized likelihood), which is essential for performing model comparison.

Let us run PolyChord on our data set, with the same posterior, no ordering constraints, and two components only.

# Save the data to numpy arrays since we will run polychord outside of this notebook.
np.save("yis", yis)
np.save("sigmais", sigmais)

We can now see the two modes caused by the labelling degeneracy! PolyChord nicely finds and map them. Since we have activated the clustering algorithm, we can restrict our attention to one of the modes and plot posterior samples. Indeed, we check that our inference is correct.

Post-processing the chain

You may wonder, since the mapping between the two modes is predictable, why we don’t simply post-process the chains to order the components and manually merge the modes? This is indeed a very good solution to return to a unimodal case. Unsurprisingly, it is more sampled that the plot we just made with PolyChord’s solution for a single cluster, since we’ve just merged all the clusters by reordering.

Note that this trick must be very carefully applied. One cannot just post-processed any MCMC chain to alleviate the multimodality problem (for mixture models only!). This will only work if the MCMC sampler has converged and properly explored the multiple modes it has mapped (though it is insensitive the the relative amplitudes in this exploration, for obvious geometrical reasons). In our case this is valid.

Note that this PolyChord run took a few minutes. This is because nested sampling is generally calls the posterior distribution more times than MCMC, and our calls were not parallelized here. We have also tuned the parameters to sample the posterior distribution in a conservative fashion. One could certainly find better parameters that lead to a more efficient sampling.

More components

Let’s do a more difficult run: perform the inference via PolyChord and three components instead of two. What will we see? Unfortunately, this run will be even slower and take several minutes, but the comments above apply again: this could be made more efficient if necessary.

If we plot the posterior distribution, we immediately see that the labeling degeneracy has gotten worse (since it scales as factorial ), and that the two-component model is clearly preferred (we nicely recover our input parameters). In other words, one of the components is constrained to be zero.

If we try the previous trick and post-process the chains to reorder the components, we see that the result is not satisfactory at all! This is because the parameters of the third (small) component are unconstrained because it doesn’t fit the data. It ends up interfering with the two others.

Unfortunately, we can only conclude that the labeling degeneracy is not that easily killed. We have seen that nested sampling is a natural solution to map out the full posterior distribution. But isn’t there another way to do standard MCMC and map a single of the modes? This is what we will do now with HMC.

HMC with automatic differentiation

On top of our wish to efficiently sample one single mode of the posterior distribution, let’s return to a comment we made at the start: there are few real-world situations where the population and likelihood probabilities are Gaussian and one can integrate over the latent variables analytically. Therefore, we would like to see if we can sample from the full posterior distribution. In other words, we will sample from

By histogramming the samples only in the dimensions, we should recover the previous results from the simplified posterior distribution since this is equivalent to numerically integrating out the ’s. This is a classic MCMC trick: if one draws samples from a distribution and only look at the dimension (by histogramming the values and ignoring ), one is effectively looking at the marginalized posterior .

Since this model has many more parameters ( instead of ), standard MCMC methods like emcee will struggle. This is because those have a typical acceptance rate of (at best), which is the cornerstone of standard MCMC algorithms like Metropolis-Hastings. This can only be increased by using extra information such as gradients, which we will do below. Moreover, due to volume effects, the space to explore is very large, so there is a chance the relevant portions of the full posterior distribution will rarely be explored. This is known as the curse of dimensionality.

A common way to get around this problem is to adopt a more efficient MCMC method: Hamiltonian or Hybrid Monte Carlo. It has high acceptance rate (close to 1), and this method is designed to explore large parameter spaces and deal with nasty degenerate distributions. This is exactly what we need. Note that it cannot deal with multimodal distribution - we will come back to this point later in this tutorial.

One of the peculiarities of HMC is that it requires gradients of the posterior with respect to the parameters of interest. In our case, we could write those analytically. But in real world problems this might be cumbersome and error-prone. Thus, we will use a powerful tool: automatic differentiation. This originates from the observation that differentiation rules are simple and can be propagated via the chain rule. If one can write the posterior distribution from elementary analytical or numerical operations, then finding the gradients is most likely a mechanical, algorithmic task that doesn’t need to be done by hand. And indeed, there are automatic differentiation packages that will go through a piece of code and infer the analytic form of the gradient from elementary functions. Autograd is one of those, and only require that you write your function (here the posterior distribution) with numpy (or rather, its wrapped version of numpy, as you can see from how we have imported numpy and autograd at the start of this notebook).

We will now ask autograd to create a function that computes the analytic gradient of our posterior distribution with automatic differentiation! At the moment, we will focus on our simplified posterior distribution and check that the HMC approach works, before moving to the full posterior distribution.

lnprob_hyperonly1_grad = grad(lnprob_hyperonly1)

We can numerically verify that it is indeed the gradient of our function:

Of course, this gradient works for our original simplified posterior distribution, not the reparametrized one. So it will suffer from the degeneracies and multimodality we’ve already discussed. We want to avoid that and use our reparametrized posterior with the simplex-to-hypercube mapping.

If you try to apply autograd to the latter, you will get an error message saying that ‘numpy.cumprod’ is not yet implemented in autograd. This is the limitation of automatic differentiation: sometimes the operations used in a model are not supported. In our case, we can easily get around this problem by performing the change of variables manually (also explained in arxiv:1010.3436). In most cases, one can use automatic differentiation for large chunks of a model, and implement the rest by hand. Note that there are other frameworks supporting automatic differentiation. Autograd is convenient because it works on top of numpy. For large data sets and complicated likelihoods, I recommend Google’s TensorFlow. It requires rewriting some of your code in a particular format, but is incredibly powerful.

Note that autograd also allows us to extract the Hessian of our distribution of interest. We will use that to tune our HMC run and make it blazingly efficient. This is explained in arxiv:1206.1901 and also many other HMC papers.

Running HMC on the simplified posterior

The code below is an implementation of the classic HMC algorithm with bounds and a mass matrix. It is not parallelized, so we expect it will be slower than emcee. All the HMC runs below will take a few minutes to complete. However, by using analytic derivatives and Hessians, we should hope that it performs well and samples the target distribution very well. We will use random step sizes and lengths, too. Those technical choices are fairly common and discussed in Neal (2006).

The mass matrix is important in HMC, because it informs the sampler of the dimensions of the various parameters, and how constrained they are, so that the random moves are made in the most efficient directions. The most efficient choice for the mass matrix is to use the inverse hessian at the parameter optimum (i.e., the covariance matrix of the parameters). In our case we will also update the Hessian every 500 iterations, evaluated at the posterior mean. This tends to work better (than finding the optimum and computing the Hessian once) for nasty distributions.

Because this approach tells the algorithm about the local curvature, it will implicitly restrict the sample to a single mode in the posterior distribution.

Note that our sampler will throw a warning every time it rejects a sample (which is possible due to numerical approximations involved in the HMC algorithm). But this will be very rare; our acceptance rate will be very close to one.

As we can see, the HMC run was successful, which is good! However, it is slower than emcee. This is because, despite using gradient information to obtain a higher acceptance rate (close to 1, as opposed to 0.3), our HMC sampler is not parallelized. And emcee is generally blazingly fast. We could make HMC much more efficient if we wanted.

However, we will now sample the full posterior distribution, which emcee cannot deal with since it has way too many parameters.

The marginalized posterior distribution looks good. For the population parameters , it matches what we obtained with the simplified posterior distribution. This is remarkable because we have added a ton of parameters to the model, and explored a much bigger parameter volume. Thanks to HMC, this exploration was efficient, and we were able to numerically marginalize over all the latent variables ’s and recover the results from the analytic solution, the simplified posterior distribution.

Hierarchical uncertainty shrinkage

The other panels in the figure above show a comparison of the posterior distribution on each versus the likelihood function . This is a typical illustration of Bayesian shrinkage: by fitting a model to the population of objects, we in turn improve the individual estimates. It doesn’t matter that those parameters are latent - they are connected via a hierarchical probabilistic model, so the information gained in one part of the model might actually affect the rest of the model. This is a theme that we have explored (and will continue to explore) in other tutorials.

Number of components and multimodality

We have not discussed a number of additional issues that may arise, including what happens if the data are very noisy and/or the components significantly overlap. Those are common with mixture models.

Indeed, those can create new degeneracies in the posterior distribution, since components can more easily be interchanged if they overlap or in the presence of significant noise. One can still find a prior that alleviates those degeneracies, but this might not be sufficient. In that case, an HMC sampler that is carefully tuned should still work. But if the distribution is really multimodal, then one could adopt a “tempered” HMC sampler. For example, the approach of https://arxiv.org/abs/1704.03338 is interesting.

(By tuning, I refer to the parameters of the HMC algorithm: the step size, number of steps, and mass matrix. There is a lot of litterature on how to tune this to make the exploration of complicated posterior distribution more efficient. In particular, one wants an acceptance rate close to one, and the maximum step size and minimum number of steps that yield to a good MCMC chain. A good metric for that is the correlation between samples - we want samples that are as independent as possible, and fully explore the posterior distribution.)

Another related question is how to find a suitable prior to alleviate labelling degeneracies in dimensions. This is possible, but may require some careful thinking. Again, the key is to order the components in a quasi-unique fashion.

But the most critical question we have not mentioned so far is the choice of number of components. In most real-world problems, one does not know how many components to use. This is a vast, deep question which is addressed in countless papers, many of them beyond the scope of this tutorial. The classical approach boils down to doing multiple runs with different number of components, and performing model selection (in a Bayesian fashion) or cross-validation (in a frequentist fashion) to select the best number of components. Indeed, the denominator of our full posterior distribution (neglected previously) is the evidence

In other words, one way or another, a huge multidimensional integral is needed. Calculating evidences accurately is a research topic on its own. One can run a very long MCMC chain and perform this integral numerically. Or adopt a more adequate sampler that explicitly targets the evidence (e.g., via nested sampling). Those options are good. For two runs with two different numbers of components and , the evidence ratio (also called Bayes factor) will allow one to perform model selection and pick the preferred number of components. Even better, one could simply consider both models simultaneously, and weight the samples in each model by their respective evidences. All of those could be performed with nested sampling.

Note that with mixture models, we are in a very particular situation: by fixing one or multiple or to zero or one, the number of components has effectively changed. We say that the models are nested. Nested models are very convenient and are worth learning about. Many physical models are nested, and this hugely simplifies model comparison, via the Savage Dickey density ratio, for example.

Equivalently, one could adopt a trans-dimensional approach and explore the space of models with a variable number of components. It is in fact equivalent to multiple-run solution we just discussed. Either way, there are various pieces of technology to achieve that, such as trans-dimensional MCMC (which explicitely adds and removes parameters), which we will not discuss here. But you should immediatly spot that we have already done that: when we ran nested sampling with three components, we found that the posterior distribution was naturaly peaked at two components, with the third one often set to zero. The point is: for nested models, it is not that hard to perform a trans-dimensional sampling. One only needs to write the posterior distribution in a convenient, nested form, and think carefully about what priors would help pinning down (i.e., sampling) the most probably number of components.

This brings us to our final point of discussion and experiment.

Interpretability of mixture models.

Let’s run PolyChord with more components - say four - and show the result in data space, i.e., plot the true density of objects and that infered with our Gaussian mixture model.

In this figure, we have separated the models into their number of components (specifically: any component having an amplitude ten times smaller than the largest component is not counted). Since we randomly picked posterior stamples, the number of curves arriving in each subpanel is also telling us something about the most probably number of components. In fact, we could relate those relative densities to Bayes factors for the nest models with 1, 2, 3 and 4 components. What we are doing is much more generic and straightforward: we have implicitly performed a trans-dimensional sampling since we have allowed some of the compoments to be zero.

One final comment: you can see that except for the one-component case, most of our model samples indeed capture the true bimodal nature of the density. However, why are some of those models made of three and four components? This happens because we have not restriced the components to overlap. This creates yet another degeneracy on top of the labeling one: the fact that any component can be decomposed into sub-components with lower amplitudes. When considering a larger space of models (i.e., trans-dimensional), this is a complicated problem. This is related to the interpretability of mixture models. Luckily for us, it can be solved in two ways: in a fully Bayesian inference setting like we described in this tutorial, one can easily adopt a prior that mitigates this effect, for instance by forcing the components to be separated by some amount. An alternative which is particularly popular in the realm of empirical Bayesian inference and maximum a-posteriori solutions (i.e., finding the optimum of the posterior distribution only) is a set of “split and merge” methods, allowing to litteraty split and merge some of the components when relevant. The paper about extreme deconvolution by Bovy et al (2009) is a good place to learn about that.

If we adopt a large number of components, what do you think will happen? The answer is simple: the components will position themselves at the location of the true values . The phenomenon is known as mode collapse. Worse, the ’s are latent variables, the collapsed components will never properly converge.
Another final comment: if the number of objects is large and the gradients intensive to compute, standard HMC might be less compelling. Luckily, there is a natural solution to this issue: not using all of the data when calculating the gradient. However, a naive implementation of this idea is not recomment. Indeed, it has been shown by https://arxiv.org/abs/1402.4102 that this can yield arbitrarily bad results. The correct approach is presented in that same paper: the HMC must include an extra term to capture the extra noise introduced by the approximate gradient. This idea works beautifully well, even on large models and data sets, such as deep neural networks.