astroABC: ABC SMC sampler for cosmological parameter estimation

“…the chosen statistic needs to be a so-called sufficient statistic in that any information about the parameter of interest which is contained in the data, is also contained in the summary statistic.”

Elise Jenningsa and Maeve Madigan arXived a paper on a new Python code they developed for implementing ABC-SMC, towards astronomy or rather cosmology applications. They stress the parallelisation abilities of their approach which leads to “crucial speed enhancement” against the available competitors, abcpmc and cosmoabc. The version of ABC implemented there is “our” ABC PMC where particle clouds are shifted according to mixtures of random walks, based on each and every point of the current cloud, with a scale equal to twice the estimated posterior variance. (The paper curiously refers to non-astronomy papers through their arXiv version, even when they have been published. Like our 2008 Biometrika paper.) A large part of the paper is dedicated to computing aspects that escape me, like the constant reference to MPIs. The algorithm is partly automated, except for the choice of the summary statistics and of the distance. The tolerance is chosen as a (large) quantile of the previous set of simulated distances. Getting comments from the designers of abcpmc and cosmoabc would be great.

“It is clear that the simple Gaussian Likelihood assumption in this case, which neglects the effects of systematics yields biased cosmological constraints.”

The last part of the paper compares ABC and MCMC on a supernova simulated dataset. Which is somewhat a dubious comparison since the model used for producing the data and running ABC is not the same as the Gaussian version used with MCMC. Unsurprisingly, MCMC then misses the true value of the cosmological parameters and most likely and more importantly the true posterior HPD region. While ABC SMC (or PMC) proceeds to a concentration around the genuine parameter values. (There is no additional demonstration of how accelerated the approach is.)

Thank you very much for your interest in our paper, and for your suggestions and questions.

As the astrophysics community is generally unaware of ABC methods, this paper was very much intended as a simple introduction to ABC.
As my background is in cosmology I am not familiar with the standard tests in statistics which are used to illustrate the speed of the code. I’ll certainly look into this or if you can suggest some tests that would be very much appreciated.

The supernovae example at the end is a toy example meant to illustrate the fact that when a complete analytical model in the likelihood is missing, the MCMC approach can yield bias constraints. We felt this was an important point to make as using a MCMC approach with an incomplete model is currently state of the art in the community.

As mentioned above MPI stands for message passing interface, which allows you to distribute jobs across separate nodes rather than many current implementations of ABC SMC which are confined to a single node and so are limited to using ~16 cores to run. Aggregating results from all the cores is very efficient although there is some slowdown once the number of cores increases substantially (above ~100). This is certainly not the bottleneck in the code when compared with the time each particle takes running the simulation.

To address the real slow down in the algorithm we added another crucial new feature which is the ability to split up nodes and allocate some to the particles in ABC and some to the simulation each particle launches. Depending on the simulation used this can yield a large speed up. So a single particle on a single core could launch its simulation to e.g. 4 separate cores and have a factor 4 speed up in creating the mock dataset and evaluating the metric. This can be done for each particle.
As the speed up would depend on how the simulation can be parallelized it was difficult for me to give representative times here. I will think about including another example.

This was an interesting paper for me. The long passages introducing ABC and MCMC (which I understand from the authors are targeted for an astronomy audience) are quite ‘rambly’ and some parts do not quite parse from a statistical perspective. But the code could indeed be useful as a platform to run simulations efficiently on a cluster, which at present I do in a simplistic way: ie., call 64 jobs, every 5 seconds check the number of running jobs, if <64, start a new job, and hope that somehow the system is allocating them sensibly.

What I would really liked to have seen is a NIPS-style section illustrating the scaling of runtimes vs nodes vs Nparticles for a couple of computationally-challenging standard problem in the field, like inference on a suitably-sized Potts model. Presumably in some regimes the scaling is most constrained by waiting for the last order statistics of N iid run times, and for some regimes it's probably much better.

MPI is “Message Passing Interface” (https://www.open-mpi.org/), the standard for constructing highly distributed applications in supercomputing contexts that can leverage thousands to millions of cores. Still looking at the code on Github, but at first glance this is good stuff, because most ABC codes available online have been parallelized across cores and small clusters but not optimized for very large applications. Thanks for calling out the reference, I missed it in Arxiv when it appeared!