“John Tukey’s definition of `Big Data’ was `anything that won’t fit on one device’.”

The complaint that data science is essentially statistics that does not dare to spell out statistics as if it were a ten letter word (p.5) is not new, if appropriate. In this paper, David Donoho evacuates the memes that supposedly separate data science from statistics, like “big data” (although I doubt non-statisticians would accept the quick rejection that easily, wondering at the ability of statisticians to develop big models), skills like parallel programming (which ineluctably leads to more rudimentary algorithms and inferential techniques), jobs requiring such a vast array of skills and experience that no graduate student sounds properly trained for it…

“A call to action, from a statistician who fells `the train is leaving the station’.” (p.12)

One point of the paper is to see 1962 John Tukey’s “The Future of Data Analysis” as prophetical of the “Big Data” and “Data Science” crises. Which makes a lot of sense when considering the four driving forces advanced by Tukey (p.11):

formal statistics

advanced computing and graphical devices

the ability to face ever-growing data flows

its adoption by an ever-wider range of fields

“Science about data science will grow dramatically in significance.”

David Donoho then moves on to incorporate Leo Breiman’s 2001 Two Cultures paper. Which separates machine learning and prediction from statistics and inference, leading to the “big chasm”! And he sees the combination of prediction with “common task framework” as the “secret sauce” of machine learning, because of the possibility of objective comparison of methods on a testing dataset. Which does not seem to me as the explanation for the current (real or perceived) disaffection for statistics and correlated attraction for more computer-related solutions. A code that wins a Kaggle challenge clearly has some efficient characteristics, but this tells me nothing of the abilities of the methodology behind that code. If any. Self-learning how to play chess within 72 hours is great, but is the principle behind able to handle go at the same level? Plus, I remain worried about the (screaming) absence of model (or models) in predictive approaches. Or at least skeptical. For the same reason it does not help in producing a generic approach to problems. Nor an approximation to the underlying mechanism. I thus see nothing but a black box in many “predictive models”, which tells me nothing about the uncertainty, imprecision or reproducibility of such tools. “Tool evaluation” cannot be reduced to a final score on a testing benchmark. The paper concludes with the prediction that the validation of scientific methodology will solely be empirical (p.37). This leaves little ground if any for probability and uncertainty quantification, as reflected their absence in the paper.

Bayesian Data Analysis advocates in Chapter 6 using posterior predictive checks as a way of evaluating the fit of a potential model to the observed data. There is a no-nonsense feeling to it:

“If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution.”

And it aims at providing an answer to the frustrating (frustrating to me, at least) issue of Bayesian goodness-of-fit tests. There are however issues with the implementation, from deciding on which aspect of the data or of the model is to be examined, to the “use of the data twice” sin. Obviously, this is an exploratory tool with little decisional backup and it should be understood as a qualitative rather than quantitative assessment. As mentioned in my tutorial on Sunday (I wrote this post in Duke during O’Bayes 2013), it reminded me of Ratmann et al.’s ABCμ in that they both give reference distributions against which to calibrate the observed data. Most likely with a multidimensional representation. And the “use of the data twice” can be argued for or against, once a data-dependent loss function is built.

“One might worry about interpreting the significance levels of multiple tests or of tests chosen by inspection of the data (…) We do not make [a multiple test] adjustment, because we use predictive checks to see how particular aspects of the data would be expected to appear in replications. If we examine several test variables, we would not be surprised for some of them not to be fitted by the model-but if we are planning to apply the model, we might be interested in those aspects of the data that do not appear typical.”

The natural objection that having a multivariate measure of discrepancy runs into multiple testing is answered within the book with the reply that the idea is not to run formal tests. I still wonder how one should behave when faced with a vector of posterior predictive p-values (ppp).

The above picture is based on a normal mean/normal prior experiment I ran where the ratio prior-to-sampling variance increases from 100 to 10⁴. The ppp is based on the Bayes factor against a zero mean as a discrepancy. It thus grows away from zero very quickly and then levels up around 0.5, reaching only values close to 1 for very large values of x (i.e. never in practice). I find the graph interesting because if instead of the Bayes factor I use the marginal (numerator of the Bayes factor) then the picture is the exact opposite. Which, I presume, does not make a difference for Bayesian Data Analysis, since both extremes are considered as equally toxic… Still, still, still, we are is the same quandary as when using any kind of p-value: what is extreme? what is significant? Do we have again to select the dreaded 0.05?! To see how things are going, I then simulated the behaviour of the ppp under the “true” model for the pair (θ,x). And ended up with the histograms below:

which shows that under the true model the ppp does concentrate around .5 (surprisingly the range of ppp’s hardly exceeds .5 and I have no explanation for this). While the corresponding ppp does not necessarily pick any wrong model, discrepancies may be spotted by getting away from 0.5…

“The p-value is to the u-value as the posterior interval is to the confidence interval. Just as posterior intervals are not, in general, classical confidence intervals, Bayesian p-values are not generally u-values.”

Now, Bayesian Data Analysis also has this warning about ppp’s being not uniform under the true model (u-values), which is just as well considering the above example, but I cannot help wondering if the authors had intended a sort of subliminal message that they were not that far from uniform. And this brings back to the forefront the difficult interpretation of the numerical value of a ppp. That is, of its calibration. For evaluation of the fit of a model. Or for decision-making…