“From a Bayesian perspective, perhaps the most natural approach is to treat the numberof components like any other unknown parameter and put a prior on it.”

Another mixture paper on arXiv! Indeed, Jeffrey Miller and Matthew Harrison recently arXived a paper on estimating the number of components in a mixture model, comparing the parametric with the non-parametric Dirichlet prior approaches. Since priors can be chosen towards agreement between those. This is an obviously interesting issue, as they are often opposed in modelling debates. The above graph shows a crystal clear agreement between finite component mixture modelling and Dirichlet process modelling. The same happens for classification. However, Dirichlet process priors do not return an estimate of the number of components, which may be considered a drawback if one considers this is an identifiable quantity in a mixture model… But the paper stresses that the number of estimated clusters under the Dirichlet process modelling tends to be larger than the number of components in the finite case. Hence that the Dirichlet process mixture modelling is not consistent in that respect, producing parasite extra clusters…

In the parametric modelling, the authors assume the same scale is used in all Dirichlet priors, that is, for all values of k, the number of components. Which means an incoherence when marginalising from k to (k-p) components. Mild incoherence, in fact, as the parameters of the different models do not have to share the same priors. And, as shown by Proposition 3.3 in the paper, this does not prevent coherence in the marginal distribution of the latent variables. The authors also draw a comparison between the distribution of the partition in the finite mixture case and the Chinese restaurant process associated with the partition in the infinite case. A further analogy is that the finite case allows for a stick breaking representation. A noteworthy difference between both modellings is about the size of the partitions

in the finite (homogeneous partitions) and infinite (extreme partitions) cases.

An interesting entry into the connections between “regular” mixture modelling and Dirichlet mixture models. Maybe not ultimately surprising given the past studies by Peter Green and Sylvia Richardson of both approaches (1997 in Series B and 2001 in JASA).

Today, I attended a “miniworkshop” on Bayesian nonparametrics in Paris (Université René Descartes, now located in an intensely renovated area near the Grands Moulins de Paris), in connection with one of the ANR research grants that support my research, BANHDITS in the present case. Reflecting incidentally that it was the third Monday in a row that I was at a meeting listening to talks (after Hong Kong and Newcastle)… The talks were as follows

While most talks were focussing on contraction and consistency rates, hence far from my current interests, both talk by Judith and Elisabeth held more appeal to me. Judith gave conditions for an empirical Bayes nonparametric modelling to be consistent, with examples taken from Peter Green’s mixtures of Dirichlet, and Elisabeth concluded with a very generic result on the consistent estimation of a finite hidden Markov model. (Incidentally, the same BANHDITS grant will also support the satellite meeting on Bayesian non-parametric at MCMSki IV on Jan. 09.)

More than a year ago Michael Sørensen (2013 EMS Chair) and Fabrizzio Ruggeri (then ISBA President) kindly offered me to deliver the memorial lecture on Thomas Bayes at the 2013 European Meeting of Statisticians, which takes place in Budapest today and the following week. I gladly accepted, although with some worries at having to cover a much wider range of the field rather than my own research topic. And then set to work on the slides in the past week, borrowing from my most “historical” lectures on Jeffreys and Keynes, my reply to Spanos, as well as getting a little help from my nonparametric friends (yes, I do have nonparametric friends!). Here is the result, providing a partial (meaning both incomplete and biased) vision of the field.

Since my talk is on Thursday, and because the talk is sponsored by ISBA, hence representing its members, please feel free to comment and suggest changes or additions as I can still incorporate them into the slides… (Warning, I purposefully kept some slides out to preserve the most surprising entry for the talk on Thursday!)

Although it is known that Bayesian estimators may be inconsistent if the model is misspecified, it is also a popular belief that a “good” or “close” enough model should have good convergence properties. This paper shows that, contrary to popular belief, there is no such thing as a “close enough” model in Bayesian inference in the following sense: we derive optimal lower and upper bounds on posterior values obtained from models that exactly capture an arbitrarily large number of finite-dimensional marginals of the data-generating distribution and/or that are arbitrarily close to the data-generating distribution in the Prokhorov or total variation metrics; these bounds show that such models may still make the largest possible prediction error after conditioning on an arbitrarily large number of sample data. Therefore, under model misspecification, and without stronger assumptions than (arbitrary) closeness in Prokhorov or total variation metrics, Bayesian inference offers no better guarantee of accuracy than arbitrarily picking a value between the essential infimum and supremum of the quantity of interest. In particular, an unscrupulous practitioner could slightly perturb a given prior and model to achieve any desired posterior conclusions.ink

The paper is both too long and too theoretical for me to get into it deep enough. The main point however is that, given the space of all possible measures, the set of (parametric) Bayes inferences constitutes a tiny finite-dimensional that may lie far far away from the true model. I do not find the result unreasonable, far from it!, but the fact that Bayesian (and other) inferences may be inconsistent for most misspecified models is not such a major issue in my opinion. (Witness my post on the Robins-Wasserman paradox.) I am not so much convinced either about this “popular belief that a “good” or “close” enough model should have good convergence properties”, as it is intuitively reasonable that the immensity of the space of all models can induce non-convergent behaviours. The statistical question is rather what can be done about it. Does it matter that the model is misspecified? If it does, is there any meaning in estimating parameters without a model? For a finite sample size, should we at all bother that the model is not “right” or “close enough” if discrepancies cannot be detected at this precision level? I think the answer to all those questions is negative and that we should proceed with our imperfect models and imperfect inference as long as our imperfect simulation tools do not exhibit strong divergences.

“We congratulate the authors for this very pleasant overview of the type of problems that are currently tackled by Bayesian nonparametric inference and for demonstrating how prolific this field has become. We do share the authors viewpoint that many Bayesian nonparametric models allow for more flexible modelling than parametric models and thus capture finer details of the data. BNP can be a good alternative to complex parametric models in the sense that the computations are not necessarily more difficult in Bayesian nonparametric models. However we would like to mitigate the enthusiasm of the authors since, although we believe that Bayesian nonparametric has proved extremely useful and interesting, we think they oversell the “nonparametric side of the Force”! Our main point is that by definition, Bayesian nonparametric is based on prior probabilities that live on infinite dimensional spaces and thus are never completely swamped by the data. It is therefore crucial to understand which (or why!) aspects of the model are strongly influenced by the prior and how.

As an illustration, when looking at Example 1 with the censored zeroth cell, our reaction is that this is a problem with no proper solution, because it is lacking too much information. In other words, unless some parametric structure of the model is known, in which case the zeroth cell is related with the other cells, we see no way to infer about the size of this cell. The outcome produced by the authors is therefore unconvincing to us in that it seems to only reflect upon the prior modelling (α,G*) and not upon the information contained in the data. Now, this prior modelling may be to some extent justified based on side information about the medical phenomenon under study, however its impact on the resulting inference is palatable.

Recently (and even less recently) a few theoretical results have pointed out this very issue. E.g., Diaconis and Freedman (1986) showed that some priors could surprisingly lead to inconsistent posteriors, even though it was later shown that many priors lead to consistent posteriors and often even to optimal asymptotic frequentist estimators, see for instance van der Vaart and van Zanten (2009) and Kruijer et al. (2010). The worry about Bayesian nonparametrics truly appeared when considering (1) asymptotic frequentist properties of semi-parametric procedures; and (2) interpretation of inferential aspects of Bayesian nonparametric procedures. It was shown in various instances that some nonparametric priors which behaved very nicely for the estimation of the whole parameter could have disturbingly suboptimal behaviour for some specific functionals of interest, see for instance Arbel et al. (2013) and Rivoirard and Rousseau (2012). We do not claim here that asymptotics is the answer to everything however bad asymptotic behaviour shows that something wrong is going on and this helps understanding the impact of the prior. These disturbing bad results are an illustration that in these infinite dimensional models the impact of the prior modelling is difficult to evaluate and that although the prior looks very flexible it can in fact be highly informative and/or restrictive for some aspects of the parameter. It would thus be wrong to conclude that every aspect of the parameter is well-recovered because some are. It has been a well-known fact for Bayesian parametric models, leading to extensive research on reference and other types of objective priors. It is even more crucial in the nonparametric world. No (nonparametric) prior can be suited for every inferential aspect and it is important to understand which aspects of the parameter are well-recovered and which ones are not.

We also concur with the authors that Dirichlet mixture priors provide natural clustering mechanisms, but one may question the “natural” label as the resulting clustering is quite unstructured, growing in the number of clusters as the number of observations increases and not incorporating any prior constraint on the “definition” of a cluster, except the one implicit and well-hidden behind the non-parametric prior. In short, it is delicate to assess what is eventually estimated by this clustering methods.

These remarks are not to be taken criticisms of the overall Bayesian nonparametric approach, just the contrary. We simply emphasize (or recall) that there is no such thing as a free lunch and that we need to post the price to pay for potential customers. In these models, this is far from easy and just as far from being completed.”

The third day of this rich Padova workshop was actually a half-day which, thanks to a talk cancellation, I managed to attend completely before flying back to Paris. The first talk by Matteo Botai was about the appeal of using quantile regression, as opposed to regular (or mean) regression. The talk was highly pedagogical and enthusiastic, hence enjoyable!, but I did not really buy the argument: if one starts modelling more than the conditional mean, the whole conditional distribution should be the target of the inference, rather than an arbitrary collection of quantiles, esp. if those are estimated marginaly and not jointly. There could be realistic exceptions, for instance legit 95% bounds/quantiles in medical trials, but they are certainly most rare (as exceptions should be!). This talk however led me to ponder about a possible connection with the g-and-k quantile distributions (whose dedicated monograph I did not really appreciate!) even though I had no satisfactory answer by the end of the talk. The second talk by Eva Cantoni dealt with a fishery problem—an ecological model close to my interests—that had nice hierarchical features and [of course] a possible Bayesian analysis of the random effects. This was not the path followed though and the likelihood analysis had to rely on bootstrap and other approximations. The motivation was provided by the very recent move of the hammerhead shark (among several species of shark) to the endangered species list and the data came from reported catches by commercial fishermen vessels. I have always wondered about the reliability of such data, unless there is a researcher on-board the vessel. Indeed, while the commercial catches are presumably checked upon arrival to comply with the quotas (at least in European waters), unintentional catches are presumably thrown away on the spot (maybe not since this is high quality flesh) and not at a time when careful statistics can be saved…

Actually, the whole fishing concept eludes me, even though I can see the commercial side of it: this is the only large-scale remainder of the early hunter-gatherer society and there is no ethical reason it should persist (well, other than feeding coastal populations that rely solely on fish catches, and even then…). The last two centuries have provided many instances of species extinction resulting from unlimited commercial fishing, but fishing is still going on… End of the parenthesis.

The last talk was by Aad van der Vaart, on non-parametric credible sets, i.e. credible sets on curves. Most of the talk was dedicated to the explanation of why there was an issue with those credible sets, that is, why they could be incredibly slow in catching the true curve and in shedding away the impact of the prior. This was most interesting, obviously, if ultimately not that surprising: the prior brings an amount of information that is infinitely larger than the one carried by a finite sample. The last part of the talk showed that the resolution of the difficulty was in selecting priors that avoid over-smoothing (although this depends on an unknown smoothness quantity as well). I liked very much this soft entry to the problem as it showed that all is not that rosy with the Bayesian non-parametric approach, whose foci on asymptotics or computation generally occult this finite sample issue.

Overall, I enjoyed very very much those three days in Padova, from the pleasant feeling of the old city and of the local food (best risottos in the past six months!, and a very decent Valpolicella as well) to the great company of old and new friends—making plans for a model choice brainstorming week in Paris in June—and to the new entries on Bayesian modelling and in particular Bayesian model choice I gathered from the talks. I am thus grateful to my friends Laura Ventura and Walter Racugno for their enormous investment in organising this workshop and in making it such a profitable and rich time. Grazie mille!

More news about MCMSki IV! Remember, the call is still open for contributed sessions for a few more weeks, till March. 20 to be precise(make sure to contact me at bayesianstatistics@gmail.com if you are considering putting one session together). To all those who already submitted a session, thanks a lot, please stay tuned, and we will contact you very soon after March 20!

One exciting item is that there will be a satellite workshop on January 9, on Bayesian non-parametrics and semi-parametrics, organised by Judith Rousseau. BNPski, anyone?! Details are not yet available, but anyone registered for MCMSki IV and interested should be free to attend this workshop, free of charges. (It will take place at the conference centre as well.)

Another item is that we managed to get a cheaper offer for the ski race, reaching an entry prize of 10 euros. Or less if we manage to find this sponsor… Not that bad when considering the high probability competitors have to win a pair of skis!!!

Last item for today: the list and rate of hotels available thru the conference centre is as follows

However, many other options are available in the vicinity, from hotels to B&B, to rental apartments and chalets, with a wide range of prices if you pre-book early (like now!) See the links on our webpage.