Are you paying attention? Good. If you are not listening carefully, you will miss things. Important things.

With colleagues Stefano Favaro and Bernardo Nipoti from Turin and Yee Whye Teh from Oxford, we have just arXived an article on discovery probabilities. If you are looking for some info on a space shuttle, a cycling team or a TV channel, it’s the wrong place. Instead, discovery probabilities are central to ecology, biology and genomics where data can be seen as a population of individuals belonging to an (ideally) infinite number of species. Given a sample of size , the -discovery probability is the probability that the next individual observed matches a species with frequency in the -sample. For instance, the probability of observing a new species is key for devising sampling experiments.

By the way, why Alan Turing? Because with his fellow researcher at Bletchley Park Irving John Good, starred in The Imitation Game too, Turing is also known for the so-called Good-Turing estimator of the discovery probability

which involves , the number of species with frequency in the sample (ie frequencies frequency, if you follow me). As it happens, this estimator defined in Good 1953 Biometrika paper became wildly popular among ecology-biology-genomics communities since then, at least in the small circles where wild popularity and probability aren’t mutually exclusive.

Simple explicit estimators of discovery probabilities in the Bayesian nonparametric (BNP) framework of Gibbs-type priors were given by Lijoi, Mena and Prünster in a 2007 Biometrika paper. The main difference between the two estimators of is that Good-Turing involves and only, while the BNP involves , (instead of ), and , the total number of observed species. It has been shown in the literature that the BNP estimators are more reliable than Good-Turing estimators.

How do we contribute? (i) we describe the posterior distribution of the discovery probabilities in the BNP model, which is pretty useful for deriving exact credible intervals of the estimates, and (ii) we investigate large asymptotic behavior of the estimators.

We are not aware of any non-asymptotic method for deriving credible interval for the BNP estimators. We fill this gap by describing the posterior distribution of . More specifically, we derive all posterior moments of . Since this random variable has a compact support, , it is characterized by its moments. So one can use a moment-based technique for sampling draws, see e.g. our momentify R package written for another article. We also show that the posterior distribution is explicit in two special cases of Gibbs-type priors known as the two parameter Poisson-Dirichlet prior and the normalized generalized Gamma prior. The posterior distribution is in fact shamelessly simple (once you know it) since it essentially amounts to [[a random] fraction of] a Beta distribution [with random coefficients].

As for large asymptotic behavior of the estimators, we prove the following asymptotic equivalences, denoted by ,

and for ,

where is a parameter of the Gibbs-type prior. These can serve as approximations. In the cases of the two parameter Poisson-Dirichlet prior and the normalized generalized Gamma prior, we provide also a second order term to the asymptotic expansion of the estimators

and for ,

where the second order is either a constant, or a quantity which converges almost surely to a random variable. In both cases, we show that it involves the second (and last) parameter of the priors, whereas the asymptotic equivalence given before involves only . Whether similar asymptotic expansions also hold in the whole Gibbs-type class remains an open problem!

]]>https://statisfaction.wordpress.com/2015/06/18/turing-revisited-in-turin-and-oxford/feed/0julyanarbelAre you paying attention? Good. If you are not listening carefully, you will miss things.Who is Julia ?https://statisfaction.wordpress.com/2015/06/04/who-is-julia/
https://statisfaction.wordpress.com/2015/06/04/who-is-julia/#commentsThu, 04 Jun 2015 13:35:51 +0000http://statisfaction.wordpress.com/?p=2970]]>

Hi there !

Unfortunately this post is indeed about statistics…

If you are randomly walking around the statistics blogs, you probably have certainly heard of this new language called Julia. It is said by the developers to be as easy to write as R and as fast as C (!) which is quite a catchy way of selling their work. After talking with a Julia enthusiastic user in Amsterdam, I decided to give it a try. And here I am sharing my first impressions.

Fist thing first, the installation is as easy as any other language, plus there is a neat Package management that allows you to get started quite easily. In this respect it is very similar to R.
On the minus side I became a big fan of RStudio Julian (… oupsy Julyan) told you about a long time ago. These kind of programs really make your life easier. I thus tried Juno which turned out to be cumbersome and terribly slow. I would have loved to have an IDE for Julia that would be up to the RStudio standard. Nevermind.

No lets talk a little about what is really interesting : “Is their catch phrase false advertising or not?!”.

There is a bunch of relatively good tutorials online which are really helpful to learn the basic vocabulary, but indeed if like me you are use to code in R and/or Python, you should get it pretty fast and can almost copy-paste your favourite code into Julia and with a few adjustments, it will work. So as easy to write as R : quite so.

I then tried to compare computational times for some of my latest codes and there came the good surprise ! A code that would take a handful of minutes to run in R mainly due to unavoidable loops took a couple of seconds to run in Julia, without any other sorts of optimization. The handling of big objects is smooth and I did not ran into memory problems that R was suffering from.

So far so good ! But of course there has to be some drawbacks. The first one is the poor package repository compare to CRAN or even what you can get for Python. This might of course improve in the next few years as the language is still quite new. However, it is bothering to have to re-code something when you are used to simply load a package in R. Another, probably less important problem, is the lack of data visualization methods and especially the absence of ggplot2 that we have grown quite found of around here. There is of course Gadfly, which is quite close but once again, it is up to now very limited compared to what I was used to…

All in all, I am happy to have tried Julia, and I am quite sure that I will be using it quite a lot from now on. However, even if from a efficiency point of view, it is great, and it is way easier to learn than C (which I should have done a while ago), R and its tremendous package repository is far from beaten.

Oh and by the way, it uses PyPlot based on MatplotLib that allow you to make some xkcd-like plots, which can make your presentations a lot more fun.

Bayes factor between two hidden Markov models, against number of assimilated observations. Values near zero support the simpler model while values larger than one support the more complex model.

Hello hello,

I have just arXived a review article, written for ESAIM: Proceedings and Surveys, called Sequential Bayesian inference for implicit hidden Markov models and current limitations. The topic is sequential Bayesian estimation: you want to perform inference (say, parameter inference, or prediction of future observations), taking into account parameter and model uncertainties, using hidden Markov models. I hope that the article can be useful for some people: I have tried to stay at a general level, but there are more than 90 references if you’re interested in learning more (sorry in advance for not having cited your article on the topic!). Below I’ll comment on a few points.

Hidden Markov models are very flexible tools to model time series: the observations are assumed to be noisy measurements of a Markov process. The Markov process can represent the complex dynamics of the underlying phenomenon (in the example of the article, it is a prey-predator model for the population growth of planktons). The noise in the measurements accounts for the error of the measuring devices, the fact that the underlying process is partially observed, etc.

The term “implicit”, introduced in Time series analysis via mechanistic models, refers to models where the latent process is a “black box”: we can simulate it, but that’s it. On the other hand, we assume that we can evaluate the probability density function of the measurement distribution.

Sequential inference refers to the ability to update the estimation as new observations arrive. For instance, the observations might be acquired on a daily basis, and thus we might want to update our predictions every day. If the predictions were obtained using “batch techniques” (e.g. MCMC), we would need to re-run the algorithms “from scratch” every day. With sequential methods (such as SMC), we can assimilate the latest observation, for a hopefully small cost every day. Unfortunately, even the recent techniques reviewed in the article fail to be “truly online”, in the sense that the statistical error will eventually blow up when parameter uncertainty is taken into account. If the parameters are kept as fixed values, then the problem becomes easier and can be dealt with in a truly online way. This is one of the current challenges that I’m discussing in the article.

I also want to comment on model uncertainty: in many scenarios, we have a few possible models to deal with a given time series. People often quote George E. P. Box: “all models are wrong, but some are useful”. Wrong here means that the observations actually are not realizations of one of the models, and this is undoubtedly correct. A common belief is that statistical inference naively assumes that one of the models is true; this is far from correct. A lot of articles have investigated the “misspecified setting” in details, and many statistical procedures (including MLE, Bayesian inference and model comparison techniques) provide justifiable answers without assuming that the model is true. In the article, I discuss the Bayes factor between two models. It has a perfectly reasonable justification as a prior predictive criterion, in other words, it compares models on the grounds of how likely the observations are under the prior distributions. Thus one does not need to assume anything about the data-generating process in order to use Bayes factors.

Another interesting aspect of the Bayes factor, by the way, is Occam’s razor principle. The simpler models are favoured over more complex models until enough data have been gathered to say otherwise. The same principle motivates AIC and BIC criteria, which can be seen as asymptotic approximations of Bayes factors for particular choices of priors. In the figure, we can see that a “wrong model” is better than the true data generating model until about 50 to 100 observations are assimilated (in the prior predictive sense of the Bayes factor). The figure shows the estimated Bayes factors for five independent runs of one the numerical methods reviewed in the article (SMC^2): we see that the runs diverge because the errors accumulate over time, hence the method is not “online”.

]]>https://statisfaction.wordpress.com/2015/05/19/sequential-bayesian-inference-for-time-series/feed/2pierrejacobBayes factor between two hidden Markov modelsReading Bayesian classics — presentationshttps://statisfaction.wordpress.com/2015/04/21/reading-bayesian-classics-presentations/
https://statisfaction.wordpress.com/2015/04/21/reading-bayesian-classics-presentations/#commentsTue, 21 Apr 2015 16:39:04 +0000http://statisfaction.wordpress.com/?p=2901]]>The students did a great job in presenting some Bayesian classics. I enjoyed reading the papers (pdfs can be found here), most of which I hadn’t read before, and enjoyed also the students’ talks. I share here some of the best ones, as well as some demonstrative excerpts from the papers. In chronological order (presentations on slideshare below):

In this paper, we shall consider Markov chain methods of sampling that are generalizations of a method proposed by Metropolis et al. (1953), which has been used extensively for numerical problems in statistical mechanics.

I think that the main point to stress about this interesting and important paper is its significance for the philosophical questions underlying the acceptance of the Bayesian standpoint as the true foundation for inductive reasoning, and in particular for statistical inference. So far as I can remember, the present paper is the first to emphasize the role of the Bayesian standpoint as a logical framework for the analysis of intricate statistical situation. […] I would like to express my warmest congratulations to my friend Lindley and his valiant collaborator Smith.

Where do probability models come from? To judge by the resounding silence over this question on the part of most statisticians, it seems highly embarrassing. In general, the theoretician is happy to accept taht his abstract probability triple was found under a gooseberry bush, while the applied statistician’s model “just growed”. (quoting A. P. Dawid, 1982)

William H. Jefferys and James O. Berger. Ockham’s razor and Bayesian analysis. American Scientist, 64–72, 1992.

William of Ockham, the 14th-century English philosopher, stated the principle thus: “Pluralitas non est ponenda sine necessitate”, which can be translated as: “Plurality must not be posited without necessity.” […] Ironically, whereas Bayesian methods have been criticized for introducing subjectivity into statistical analysis, the Bayesian approach can turn Ockham’s razor into a less subjective and even “automatic” rule of inference.

All 72 [Pillow Problems]are claimed to have been formulated and worked out at night while in bed, mentally, and the answer written down afterward. [C. L. Dodgson, a.k.a. Lewis Carroll] work reflects the nature, standing and understanding of probability within the wider English mathematical community of the time.

James O. Berger. Bayesian analysis: A look at today and thoughts of tomorrow. Journal of the American Statistical Association, 95(452):1269– 1276, 2000.

Life was simple when I became a Bayesian in the 1970s; it was possible to track virtually all Bayesian activity. Preparing this paper on Bayesian statistics was humbling, as I realized that I have lately been aware of only about 10% of the ongoing activity in Bayesian analysis.

Xian blogged recently on the incoming RSS read paper: Statistical Modelling of Citation Exchange Between Statistics Journals, by Cristiano Varin, Manuela Cattelan and David Firth. Following the last JRSS B read paper by one of us! The data that are used in the paper (and can be downloaded here) are quite fascinating for us, academics fascinated by academic rankings, for better or for worse(ironic here). They consist in cross citations counts for 47 statistics journals (see list and abbreviations page 5): is the number of citations from articles published in journal in 2010 to papers published in journal in the 2001-2010 decade. The choice of the list of journals is discussed in the paper. Major journals missing include Bayesian Analysis (published from 2006), The Annals of Applied Statistics (published from 2007).

I looked at the ratio of Total Citations Received by Total Citations made. This is a super simple descriptive statistic which happen to look rather similar to Figure 4 which plots Export Scores from Stigler model (can’t say more about it, I haven’t read in detail). The top five is the same modulo the swap between Annals of Statistics and Biometrika. Of course a big difference is that the Cited/Citation ratio isn’t endowed with a measure of uncertainty (below, left is my making, right is Fig. 4 in the paper).

I was surprised not to see a graph / network representation of the data in the paper. As it happens I wanted to try the gephi software for drawing graphs, used for instance by François Caron and Emily Fox in their sparse graphs paper. I got the above graph, where:

for the data, I used the citations matrix renormalized by the total number of citations made, which I denote by . This is a way to account for the size (number of papers published) of the journal. This is just a proxy though since the actual number of papers published by the journal is not available in the data. Without that correction, CSDA is way ahead of all the others.

the node size represents the Cited/Citing ratio

the edge width represents the renormalized . I’m unsure of what gephi does here, since it converts my directed graph into an undirected graph. I suppose that it displays only the largest of the two edges and .

for a better visibility I kept only the first decile of heaviest edges.

the clusters identified by four colors are modularity classes obtained by the Louvain method.

Some remarks

The two software journals included in the dataset are quite outliers:

the Journal of Statistical Software (JSS) is disconnected from the others, meaning it has no normalized citations in the first decile. Except from its self citations which are quite big and make it the 4th Impact Factor from the total list in 2010 (and apparently the first in 2015).

the largest is the self citations of the STATA Journal (StataJ).

Centrality:

CSDA is the most central journal in the sense of the highest (unweighted) degree.

Some further thoughts

All that is just for the fun of it. As mentioned by the authors, citation counts are heavy-tailed, meaning that just a few papers account for much of the citations of a journal while most of the papers account for few citations. As a matter of fact, the total of citations received is mostly driven by a few super-cited papers, and also is the Cited/Citations matrix that I use throughout for building the graph. A reason one could put forward about why JRSS B makes it so well is the read papers: for instance, Spiegelhalter et al. (2002), DIC, received alone 11.9% of all JRSS B citations in 2010. Who’d bet the number of citation this new read paper (JRSS A though) will receive?

This week I’ll start my Bayesian Statistics master’s course at the Collegio Carlo Alberto. I realized that some of last year students got PhD positions in prestigious US universities. So I thought that letting this year’s students have a first grasp of some great Bayesian papers wouldn’t do harm. The idea is that in addition to the course, the students will pick a paper from a list and present it (or rather part of it) to the others and to me. Which will let them earn some extra points for the final exam mark. It’s in the spirit of Xian’s Reading Classics Seminar (his list here).

Inspired by established blogs, such as the popular Statistical Modeling, Causal Inference, and Social Science or Xi’an’s Og, each of us began blogging as a way to diarize our learning adventures, to share bits of R code or LaTeX tips, and to advertise our own papers and projects. Along the way we’ve come to a new appreciation of the world of academic blogging: a never-ending international seminar, attended by renowned scientists and anonymous users alike. Here we share our experiences by weighing the pros and cons of blogging from the point of view of young researchers.

[To blog…]

At least at face value blogging has some notable advantages over traditional academic communication: publication is instantaneous and thus proves efficient in sparking discussions and debates; it allows all sorts of technological sorcery (hyperlinks, animations, applications), while many journals are still adapting to grayscale plots; and it allows for humorous and colourful writing styles, freeing the writer from the constraints of the impersonal academic prose. Last but not least, it is acceptable to blog about almost any topic, from office politics to funding bodies, from complaints about the absurdity of p-values to debates on the net profits of publishing companies, not to mention quarrels about the term “data science”.

For young researchers, some aspects are particularly appealing. By putting academics directly in touch with one another through comments and replies, young researchers are given the opportunity to “talk” directly on technical subjects to some of the most renowned names in their fields—and indeed a surprising number of senior researchers are avid blog readers! This often proves much more efficient than trying to awkwardly stalk the same professors at conferences. Through such interactions, young academics can show off their many interests and skills, which can do much to fill out the picture painted by their academic CV.

Beyond those low and careerist considerations, we see blogging as a good tool to learn and to share scientific ideas. According to popular belief, only a third of all started research projects end up in a publication; but all of them can at least end up on a blog. So if you indulge in a bit of off-topic study or burn a few hours playing around with a new methodology it need not fuel your performance anxiety: a blog post explaining it will still feel like a delivered product. And you will very likely get some interesting feedback—though rarely to the depth given in journal reviews.

Finally, using blogs to advertise articles and packages seems particularly useful at the early stage of a career, where you might not be invited to that many conferences, or might only be given some dark corner of a giant poster session to talk about your work.

[… or not to blog ?]

Some cautionary notes now, blogging can be risky! As the adage goes, “better to keep your mouth shut and appear a fool than to open it and remove all doubt”. Beyond the quality of the content being shared, blogs are also sometimes disregarded by academics as a frivolous medium; there is a risk that your colleagues will see your blogging hobby as a pure waste of time.

A second risk is to disclose too much information about promising research leads. There should be some balance between ideas shared and ideas kept secret, so that blogging does not jeopardize publication. Other platforms that formally establish precedence (such as arXiv) might be better suited for the initial presentation of new and exciting work. For this reason it seems wisest to blog a posteriori, though the interest of these blogs will be less than their potential to function as real-time research diaries.

A third risk is genuine time-wasting. For those who have never tried, it can be surprising to discover how many hours are needed to write each post. It can be frustrating in the beginning when reader statistics indicate an audience of just one or two spam-bots and some curious relatives. On the other hand there are still a limited number of academic blogs on statistics so far, so the market is far from saturation: any new blog can quickly garner a decent amount of attention. Of course it can be hard to keep a regular posting schedule, which is necessary to maintain a stable reading base.

To conclude, blogging can be a clever way to bypass the hierarchical structure of academia. It gives everyone a direct and fast access to everyone else. In some respects it helps to alleviate key problems affecting young researchers, such as the lengthy reviewing process of top journals and the lack of communication space.

]]>https://statisfaction.wordpress.com/2014/12/11/meta-blogging-as-young-researchers/feed/0pierrejacobSummary of the article.Unfortunate typos in read paperhttps://statisfaction.wordpress.com/2014/12/03/unfortunate-typos-in-read-paper/
https://statisfaction.wordpress.com/2014/12/03/unfortunate-typos-in-read-paper/#commentsWed, 03 Dec 2014 16:07:51 +0000http://statisfaction.wordpress.com/?p=2879]]>Mathieu and I have just realised that the version of our SQMC paper made available on the RSS web site contains several unfortunate typos. In particular, the symbol for “small o” has been replaced by a “big O” by editors. For instance, Theorem 9 should state the QMC beats standard SMC; i.e. the MSE (mean square error) of an SQMC estimator is

but in the RSS version, it reads

.

Well, that’s a bummer. For now, I recommend anyone to read instead the arxiv version (updated on Monday).

Almost 10 months since my latest post? I guess bloggin’ ain’t my thing… In my defense, Mathieu Gerber and I were quite busy revising our SQMC paper. I am happy to announce that it has just been accepted as a read paper in JRSSB. If all goes as planned, we should present the paper at the RSS ordinary meeting on Dec 10. Everybody is welcome to attend, and submit an oral or written discussion (or both). More details soon, when the event is officially announced on the RSS web-site.

What is SQMC? It is a QMC (Quasi-Monte Carlo) version of particle filtering. For the same CPU cost, it typically generates much more accurate estimators. Interested? consider reading the paper here (more recent version coming soon), checking this video where I present SQMC, or, even better, attending our talk in London!

]]>https://statisfaction.wordpress.com/2014/10/29/sqmc-read-paper/feed/4nicolaschopinthe typical RSS meeting momentify R package at BAYSM14https://statisfaction.wordpress.com/2014/09/20/momentify/
https://statisfaction.wordpress.com/2014/09/20/momentify/#commentsSat, 20 Sep 2014 00:19:43 +0000http://statisfaction.wordpress.com/?p=2842]]>I presented an arxived paper of my postdoc at the big success Young Bayesian Conference in Vienna. The big picture of the talk is simple: there are situations in Bayesian nonparametrics where you don’t know how to sample from the posterior distribution, but you can only compute posterior expectations (so-called marginal methods). So e.g. you cannot provide credible intervals. But sometimes all the moments of the posterior distribution are available as posterior expectations. So morally, you should be able to say more about the posterior distribution than just reporting the posterior mean. To be more specific, we consider a hazard (h) mixture model

where is a kernel, and the mixing distribution is random and discrete (Bayesian nonparametric approach).

We consider the survival function which is recovered from the hazard rate by the transform

and some possibly censored survival data having survival . Then it turns out that all the posterior moments of the survival curve evaluated at any time can be computed.

The nice trick of the paper is to use the representation of a distribution in a [Jacobi polynomial] basis where the coefficients are linear combinations of the moments. So one can sample from [an approximation of] the posterior, and with a posterior sample we can do everything! Including credible intervals.

I’ve wrapped up the few lines of code in an R package called momentify (not on CRAN). With a sequence of moments of a random variable supported on [0,1] as an input, the package does two things:

evaluates the approximate density

samples from it

A package example for a mixture of beta and 2 to 7 moments gives that result: