“We argue that SAME is beneficial for Gibbs sampling because it helps to reduce excess variance.”

Still, I am a wee bit surprised at both the above statement and at the comparison with a JAGS implementation. Because SAME augments the number of latent vectors as the number of iterations increases, so should be slower by a mere curse of dimension,, slower than a regular Gibbs with a single latent vector. And because I do not get either the connection with JAGS: SAME could be programmed in JAGS, couldn’t it? If the authors means a regular Gibbs sampler with no latent vector augmentation, the comparison makes little sense as one algorithm aims at the MAP (with a modest five replicas), while the other encompasses the complete posterior distribution. But this sounds unlikely when considering that the larger the number m of replicas the better their alternative to JAGS. It would thus be interesting to understand what the authors mean by JAGS in this setup!

Taking advantage of his visit to Paris this month, Shravan Vasishth, from University of Postdam, Germany, will give a talk at 10.30am, next Friday, October 24, at ENSAE on:

Using Bayesian Linear Mixed Models in Psycholinguistics: Some open issues

With the arrival of the probabilistic programming language Stan (and JAGS), it has become relatively easy to fit fairly complex Bayesian linear mixed models. Until now, the main tool that was available in R was lme4. I will talk about how we have fit these models in recently published work (Husain et al 2014, Hofmeister and Vasishth 2014). We are trying to develop a standard approach for fitting these models so that graduate students with minimal training in statistics can fit such models using Stan.

I will discuss some open issues that arose in the course of fitting linear mixed models. In particular, one issue is: should one assume a full variance-covariance matrix for random effects even when there is not enough data to estimate all parameters? In lme4, one often gets convergence failure or degenerate variance-covariance matrices in such cases and so one has to back off to a simpler model. But in Stan it is possible to assume vague priors on each parameter, and fit a full variance-covariance matrix for random effects. The advantage of doing this is that we faithfully express in the model how the data were generated—if there is not enough data to estimate the parameters, the posterior distribution will be dominated by the prior, and if there is enough data, we should get reasonable estimates for each parameter. Currently we fit full variance-covariance matrices, but we have been criticized for doing this. The criticism is that one should not try to fit such models when there is not enough data to estimate parameters. This position is very reasonable when using lme4; but in the Bayesian setting it does not seem to matter.

I am currently preparing a survey paper on the present state of computational statistics, reflecting on the massive evolution of the field since my early Monte Carlo simulations on an Apple //e, which would take a few days to return a curve of approximate expected squared error losses… It seems to me that MCMC is attracting more attention nowadays than in the past decade, both because of methodological advances linked with better theoretical tools, as for instance in the handling of stochastic processes, and because of new forays in accelerated computing via parallel and cloud computing, The breadth and quality of talks at MCMski IV is testimony to this. A second trend that is not unrelated to the first one is the development of new and the rehabilitation of older techniques to handle complex models by approximations, witness ABC, Expectation-Propagation, variational Bayes, &tc. With a corollary being an healthy questioning of the models themselves. As illustrated for instance in Chris Holmes’ talk last week. While those simplifications are inevitable when faced with hardly imaginable levels of complexity, I still remain confident about the “inevitability” of turning statistics into an “optimize+penalize” tunnel vision… A third characteristic is the emergence of new languages and meta-languages intended to handle complexity both of problems and of solutions towards a wider audience of users. STAN obviously comes to mind. And JAGS. But it may be that another scale of language is now required…

If you have any suggestion of novel directions in computational statistics or instead of dead ends, I would be most interested in hearing them! So please do comment or send emails to my gmail address bayesianstatistics…

At MCMSki IV, I attended (and chaired) a session where Martyn Plummer presented some developments on cut models. As I was not sure I had gotten the idea [although this happened to be one of those few sessions where the flu had not yet completely taken over!] and as I wanted to check about a potential explanation for the lack of convergence discussed by Martyn during his talk, I decided to (re)present the talk at our “MCMSki decompression” seminar at CREST. Martyn sent me his slides and also kindly pointed out to the relevant section of the BUGS book, reproduced above. (Disclaimer: do not get me wrong here, the title is a pun on the infamous “drill, baby, drill!” and not connected in any way to Martyn’s talk or work!)

I cannot say I get the idea any clearer from this short explanation in the BUGS book, although it gives a literal meaning to the word “cut”. From this description I only understand that a cut is the removal of an edge in a probabilistic graph, however there must/may be some arbitrariness in building the wrong conditional distribution. In the Poisson-binomial case treated in Martyn’s case, I interpret the cut as simulating from

instead of

hence loosing some of the information about φ… Now, this cut version is a function of φ and θ that can be fed to a Metropolis-Hastings algorithm. Assuming we can handle the posterior on φ and the conditional on θ given φ. If we build a Gibbs sampler instead, we face a difficulty with the normalising constant m(y|φ). Said Gibbs sampler thus does not work in generating from the “cut” target. Maybe an alternative borrowing from the rather large if disparate missing constant toolbox. (In any case, we do not simulate from the original joint distribution.) The natural solution would then be to make a independent proposal on φ with target the posterior given z and then any scheme that preserves the conditional of θ given φ and y; “any” is rather wistful thinking at this stage since the only practical solution that I see is to run a Metropolis-Hasting sampler long enough to “reach” stationarity… I also remain with a lingering although not life-threatening question of whether or not the BUGS code using cut distributions provide the “right” answer or not. Here are my five slides used during the seminar (with a random walk implementation that did not diverge from the true target…):

Already on the final day..! And still this frustration in being unable to attend three sessions at once… Andrew Gelman started the day with a non-computational talk that broached on themes that are familiar to readers of his blog, on the misuse of significance tests and on recommendations for better practice. I then picked the Scaling and optimisation of MCMC algorithms session organised by Gareth Roberts, with optimal scaling talks by Tony Lelièvre, Alex Théry and Chris Sherlock, while Jochen Voss spoke about the convergence rate of ABC, a paper I already discussed on the blog. A fairly exciting session showing that MCMC’ory (name of a workshop I ran in Paris in the late 90’s!) is still well and alive!

After the break (sadly without the ski race!), the software round-table session was something I was looking for. The four softwares covered by this round-table were BUGS, JAGS, STAN, and BiiPS, each presented according to the same pattern. I would have like to see a “battle of the bands”, illustrating pros & cons for each language on a couple of models & datasets. STAN got the officious prize for cool tee-shirts (we should have asked the STAN team for poster prize tee-shirts). And I had to skip the final session for a flu-related doctor appointment…

I called for a BayesComp meeting at 7:30, hoping for current and future members to show up and discuss the format of the future MCMski meetings, maybe even proposing new locations on other “sides of the Italian Alps”! But (workshop fatigue syndrome?!), no-one showed up. So anyone interested in discussing this issue is welcome to contact me or David van Dyk, the new BayesComp program chair.

Last week in Aachen was the 3rd Edition of the Bayes(Pharma) workshop. Its specificity: half-and-half industry/academic participants and speakers, all in Pharmaceutical statistics, with a great care to welcome newcomers to Bayes, so as to spread as much as possible the love where it will actually be used. First things first: all the slides are available online, thanks to the speakers for sharing those. Full disclaimer: being part of the scientific committee of the workshop, I had a strong subjective prior.

3 days, 70 participants, we were fully booked, and even regretfully had to refuse inscriptions due to lack of room-space (!! German regulations are quite… enforced). Time to size it up for next year, maybe?

My most vivid impression overall: I was struck by the interactivity of the questions/answers after each talk. Rarely fewer than 5 questions per talk (come on, we’ve all attended sessions where the chairman is forced to ask the lone question — no such thing here!), on all points of each talk, with cross-references from one question to the other, even from one *talk* to the other! Seeing so much interaction and discussion in spite of (or, probably, thanks to ?) the diversity of the audience was a real treat: not only did the questions bring up additional details about the talk, they were, more importantly, bringing very precious highlight on the questioners’ mindsets, their practical concerns and needs. Both academics and industrials were learning on all counts — and, for having sometimes seen failed marriages of the kind in the past (either a French round-table degenerating in nasty polemic on “research-induced tax credit”, or just plain mismatch of interests), I was quite impressed that we were purely and simply all interested in multiple facets of the very same thing: the interface between pharma and stats.

As is now a tradition, the first day was a short course, this time by Pr. Emmanuel Lessaffre: based on his upcoming book on Bayesian Biostatistics (Xian, maybe a review someday?), it was meant to be introductory for newcomers to Bayes, but was still packed with enough “tricks of the trades” that even seasoned Bayesians could get something out of it. I very much appreciated the pedagogy in the “live” examples, with clear convergence caveats based on traceplots of common software (WinBUGS). The most vivid memory: his strong spotlight on INLA as “the future of Bayesian computation”. Although my research is mostly on MCMC/SMC, I’m now damn curious to give it a serious try — this was further reinforced by late evening discussions with Gianluca BaioM, who revealed that all his results that were all obtained in seconds of INLA computing.

Day 2 and half-day 3 were invited and contributed talks, all motivated by top-level applications. No convergence theorems here, but practical issues, with constraints that theoreticians (including myself!) would hardly guess exist: very small sample sizes, regulatory issues, concurrence with legacy methodology with only seconds-long runtime (impossible to run 1 million MCMC steps!), and sometimes even imposed software due to validation processes! Again, as stated above, the number and quality of questions is really what I will keep from those 2 days.

If I had to state one regret, maybe, it would be this unsatisfactory feeling that, for many newcomers, MCMC = WinBUGS — with its obvious restrictions. The lesson I learned: all the great methodological advances of the last 10 years, especially in Adaptive MCMC, have not yet reached most practitioners yet, since they need *tools* they can use. It may be a sign that, as methodological researchers, we should maybe put a stronger emphasis on bringing software packages forward (for R, of course, but also for JAGS or OpenBUGS!); not only a zip-file with our article’s codes, but a full-fledged package, with ongoing support, maintenance, and forum. That’s a tough balance to find, since the time maintaining a package does not count in the holy-bibliometry… but doesn’t it have more actual impact? Besides, more packages = less papers but also = more citations of the corresponding paper. Some do take this road (Robert Gramacy’s packages were cited last week as examples of great support, and Andy Gelman and Matt Hoffman are working on the much-expected STAN, and I mentioned above Havard Rue’s R-INLA), but I don’t think it is yet considered “best practices”.

As a conclusion, this Bayes-Pharma 2012 workshop reminded me a lot of the SAMSI 2010 Summer Program: while Bayes-Pharma aims to be much more introductory, they had in common this same success in blending pharma-industry and academy. Could it be a specificity of pharma? In which case, I’m looking very much forward opening ISBA’s Specialized Section on Biostat/Pharmastat that a few colleagues and I are currently working on (more on this here soon). With such a crowd on both sides of the Atlantic, and a looming Bayes 2013 in the Netherlands, that will be exciting.

Peng Yu sent me an email about the conditions for convergence of a Gibbs sampler:

The following statement mentions convergence. But I’m not familiar what the regularity condition is.

“But it is necessary to have a finite probability of moving away from the current state at all times in order to satisfy the regularity conditions on which the whole MCMC theory depends.”

Slice sampler is discussed in your book Monte Carlo Statistical Methods. I think that the “regularity condition” may have been discussed in your book. If so, would you please let me know where it is? Thanks and look forward to hearing from you!

The quote is from Martyn Plummer and deals with a stopping rule in JAGS implementation of the slice sampler. (The correct wording should be “strictly positive probability” rather than “finite probability”, I think.) However, this has nothing to do with a “regularity condition” on the irreducibility of a Markov chain: if a slice sampler is implemented for an unbounded density target, say a Beta(1/2,1/2), there is no irreducibility condition connected with the infiniteness of the density. In theory, (a) the chain never visits the “state” where the density is infinite (if only because we are dealing with a continuous state space) and (b) after visiting a value x with a large density f(x), the slice sampler allows for a move away from it since the slice involves a uniform simulation over (0,f(x)). Deeper properties of the slice sampler (like geometric ergodicity) are explored in, e.g., this JRSS B paper by Gareth Roberts and Jeff Rosenthal and this one in the Annals of Statistics by Radford Neal. In practice, the problem is caused by values of f(x) that cannot be computed and hence produce an error message like

Singularity in likelihood found by Slicer.

If those singularities can be localised, a neighbourhood excluding them should be introduced. (More easily said than done, obviously!)

Here is an example of a slice sampler with the Beta(1/2,1/2) distribution: