The CRiSM workshop on estimating constants which took place here in Warwick from April 20 till April 22 was quite enjoyable [says most objectively one of the organisers!], with all speakers present to deliver their talks (!) and around sixty participants, including 17 posters. It remains a exciting aspect of the field that so many and so different perspectives are available on the “doubly intractable” problem of estimating a normalising constant. Several talks and posters concentrated on Ising models, which always sound a bit artificial to me, but also are perfect testing grounds for approximations to classical algorithms.

On top of [clearly interesting!] talks associated with papers I had already read [and commented here], I had not previously heard about Pierre Jacob’s coupling SMC sequence, which paper is not yet out [no spoiler then!]. Or about Michael Betancourt’s adiabatic Monte Carlo and its connection with the normalising constant. Nicolas Chopin talked about the unnormalised Poisson process I discussed a while ago, with this feature that the normalising constant itself becomes an additional parameter. And that integration can be replaced with (likelihood) maximisation. The approach, which is based on a reference distribution (and an artificial logistic regression à la Geyer), reminded me of bridge sampling. And indirectly of path sampling, esp. when Merrilee Hurn gave us a very cool introduction to power posteriors in the following talk. Also mentioning the controlled thermodynamic integration of Chris Oates and co-authors I discussed a while ago. (Too bad that Chris Oates could not make it to this workshop!) And also pointing out that thermodynamic integration could be a feasible alternative to nested sampling.

Another novel aspect was found in Yves Atchadé’s talk about sparse high-dimension matrices with priors made of mutually exclusive measures and quasi-likelihood approximations. A simplified version of the talk being in having a non-identified non-constrained matrix later projected onto one of those measure supports. While I was aware of his noise-contrastive estimation of normalising constants, I had not previously heard Michael Gutmann give a talk on that approach (linking to Geyer’s 1994 mythical paper!). And I do remain nonplussed at the possibility of including the normalising constant as an additional parameter [in a computational and statistical sense]..! Both Chris Sherlock and Christophe Andrieu talked about novel aspects on pseudo-marginal techniques, Chris on the lack of variance reduction brought by averaging unbiased estimators of the likelihood and Christophe on the case of large datasets, recovering better performances in latent variable models by estimating the ratio rather than taking a ratio of estimators. (With Christophe pointing out that this was an exceptional case when harmonic mean estimators could be considered!)

Approximate Bayesian computation techniques are 2000’s successors of MCMC methods as handling new models where MCMC algorithms are at a loss, in the same way the latter were able in the 1990’s to cover models that regular Monte Carlo approaches could not reach. While they first sounded like “quick-and-dirty” solutions, only to be considered until more elaborate solutions could (not) be found, they have been progressively incorporated within the statistican’s toolbox as a novel form of non-parametric inference handling partly defined models. A statistically relevant feature of those ACB methods is that they require replacing the data with smaller dimension summaries or statistics, because of the complexity of the former. In almost every case when calling ABC is the unique solution, those summaries are not sufficient and the method thus implies a loss of statistical information, at least at a formal level since relying on the raw data is out of question. This forced reduction of statistical information raises many relevant questions, from the choice of summary statistics to the consistency of the ensuing inference.

In this paper of the special MCMSki 4 issue of Statistics and Computing, Stoehr et al. attack the recurrent problem of selecting summary statistics for ABC in a hidden Markov random field, since there is no fixed dimension sufficient statistics in that case. The paper provides a very broad overview of the issues and difficulties related with ABC model choice, which has been the focus of some advanced research only for a few years. Most interestingly, the authors define a novel, local, and somewhat Bayesian misclassification rate, an error that is conditional on the observed value and derived from the ABC reference table. It is the posterior predictive error rate

integrating in both the model index m and the corresponding random variable Y (and the hidden intermediary parameter) given the observation. Or rather given the transform of the observation by the summary statistic S. The authors even go further to define the error rate of a classification rule based on a first (collection of) statistic, conditional on a second (collection of) statistic (see Definition 1). A notion rather delicate to validate on a fully Bayesian basis. And they advocate the substitution of the unreliable (estimates of the) posterior probabilities by this local error rate, estimated by traditional non-parametric kernel methods. Methods that are calibrated by cross-validation. Given a reference summary statistic, this perspective leads (at least in theory) to select the optimal summary statistic as the one leading to the minimal local error rate. Besides its application to hidden Markov random fields, which is of interest per se, this paper thus opens a new vista on calibrating ABC methods and evaluating their true performances conditional on the actual data. (The advocated abandonment of the posterior probabilities could almost justify the denomination of a paradigm shift. This is also the approach advocated in our random forest paper.)

Julien Stoehr, Pierre Pudlo, and Lionel Cucala (I3M, Montpellier) arXived yesterday a paper entitled “Geometric summary statistics for ABC model choice between hidden Gibbs random fields“. Julien had presented this work at the MCMski 4 poster session. The move to a hidden Markov random field means that our original approach with Aude Grelaud does not apply: there is no dimension-reduction sufficient statistics in that case… The authors introduce a small collection of (four!) focussed statistics to discriminate between Potts models. They further define a novel misclassification rate, conditional on the observed value and derived from the ABC reference table. It is the predictive error rate

integrating in both the model index m and the corresponding random variable Y (and the hidden intermediary parameter) given the observation. Or rather the transform of the observation by the summary statistic S. In a simulation experiment, the paper shows that the predictive error rate decreases quite a lot by including 2 or 4 geometric summary statistics on top of the no-longer-sufficient concordance statistics. (I did not find how the distance is constructed and how it adapts to a larger number of summary statistics.)

“[the ABC posterior probability of index m] uses the data twice: a first one to calibrate the set of summary statistics, and a second one to compute the ABC posterior.” (p.8)

It took me a while to understand the above quote. If we consider ABC model choice as we did in our original paper, it only and correctly uses the data once. However, if we select the vector of summary statistics based on an empirical performance indicator resulting from the data then indeed the procedure does use the data twice! Is there a generic way or trick to compensate for that, apart from cross-validation?

This may sound like a paradoxical title given my recent production in this area of ABC approximations, especially after the disputes with Alan Templeton, but I have come to the conclusion that ABC approximations to the Bayes factor are not to be trusted. When working one afternoon in Park City with Jean-Michel and Natesh Pillai (drinking tea in front of a fake log-fire!), we looked at the limiting behaviour of the Bayes factor constructed by an ABC algorithm, ie by approximating posterior probabilities for the models from the frequencies of acceptances of simulations from those models (assuming the use of a common summary statistic to define the distance to the observations). Rather obviously (a posteriori!), we ended up with the true Bayes factor based on the distributions of the summary statistics under both models! Continue reading →

“One aim is to extend the approach of Sisson et al. (2007) to provide an algorithm that is robust to implement.”

C.C. Drovandi & A.N. Pettitt

Apaper by Drovandi and Pettit appeared in the Early View section of Biometrics. It uses a combination of particles and of MCMC moves to adapt to the true target, with an acceptance probability

where is the proposed value and is the current value (picked at random from the particle population), while q is a proposal kernel used to simulate the proposed value. The algorithm is adaptive in that the previous population of particles is used to make the choice of the proposal q, as well as of the tolerance level . Although the method is valid as a particle system applied in the ABC setting, I have difficulties to gauge the level of novelty of the method (then applied to a model of Riley et al., 2003, J. Theoretical Biology). Learning from previous particle populations to build a better kernel q is indeed a constant feature in SMC methods, from Sisson et al.’s ABC-PRC (2007)—note that Drovandi and Pettitt mistakenly believe the ABC-PRC method to include partial rejection control, as argued in this earlier post—, to Beaumont et al.’s ABC-PMC (2009). The paper also advances the idea of adapting the tolerance on-line as an quantile of the previous particle population, but this is the same idea as in Del Moral et al.’s ABC-SMC. The only strong methodological difference, as far as I can tell, is that the MCMC steps are repeated “numerous times” in the current paper, instead of once as in the earlier papers. This however partly cancels the appeal of an O(N) order method versus the O(N²) order PMC and SMC methods. An interesting remark made in the paper is that more advances are needed in cases when simulating the pseudo-observations is highly costly, as in Ising models. However, replacing exact simulation [as we did in the model choice paper] with a Gibbs sampler cannot be that detrimental.

I was re-reading the recently arXived paper by Toni and Stumpf on ABC based model choice and, besides noticing that their Gibbs random field example (§3.2) is the same as ours (§3.1), down to the prior choice, this led me to wonder about the choice of the ABC distance in those settings. On the one hand, the statistical perspective is to compare the predictive performances of different models and hence use the same distance for all models. On the other hand, the ABC perspective implies using different summary statistics for different models, hence using different distances… In a “true” model there is no issue because we end up comparing (margina) likelihoods but in an approximation like ABC, given that we replace the data with a summary statistic, then the distribution of a summary statistic with an indicator of proximity, we end up with paradoxes like this, where we compare pseudo-distributions of objects of different dimensions. (Toni and Stumpf made the choice in their paper of pulling all summary statistics together into an overall distance, while in ours we had the special Gibbs property of a sufficient statistic across models…)