More about BayesFactor

Sunday, March 1, 2015

To Beware or To Embrace The Prior

In this guest post, Jeff Rouder reacts to two recent comments skeptical of Bayesian statistics, and describes the importance of the prior in Bayesian statistics. In short: the prior gives a Bayesian model the power to predict data, and prediction is what allows the evaluation of evidence. Far from being a liability, Bayesian priors are what make Bayesian statistics useful to science.

Bayes' Theorem is about 250 years old. For just about as long, there has been this one never-ending criticism—beware the prior. That is: priors are too subjective or arbitrary. In the last week I have read two separate examples of this critique in the psychological literature. The first comes from Savalei and Dunn (2015) who write,

…using Bayes factors further increases ‘researcher degrees of freedom,’ creating another potential QRP, because researchers must select a prior–—a subjective expectation about the most likely size of the effect for their analyses. (Savalei and Dunn, 2015)

The usual problem with Bayesian procedures is that they depend on some sort of Laplacian assumption to generate numbers where none exist. (Trafimow and Marks, 2015)

The focus should be on the last part—generating numbers where none exist—which I interpret as questioning the appropriateness of priors. Though the critiques are subtly different, they both question the wisdom of Bayesian analysis for its dependence on a prior. Because of this dependence, researchers holding different priors may reach different conclusions from the same data. The implication is that ideally analyses should be more objective than the subjectivity necessitated by Bayes.

The critique is dead wrong. The prior is the strength rather than the weakness of the Bayesian method. It gives it all of its power to predict data, to embed theoretically meaningful constraint, and to adjudicate evidence among competing theoretical positions. The message here is to embrace the prior. My colleague Chris Donkin has used the Kubrick's subtitle to say it best: “How I learned to stop worrying about and love the prior.” Here goes:

Classical and Bayesian Models

Let's specify a simple model both in classical and Bayesian form. Consider for example where data, denoted \(Y_1,\ldots,Y_N\), are distributed as normals with a known variance of 1. The conventional frequentist model is
\[ Y_i(\mu) \sim \mbox{Normal} (\mu,1). \]
There is a single parameter, \(\mu\), which is the center of the distribution. Parameter \(\mu\) is a single fixed value which, unfortunately, is not known to us. I have made \(Y_i\) a function of \(\mu\) to make the relationship explicit. Clearly, the distribution of each \(Y_i\) depends on \(\mu\), so this notation is reasonable.

The Bayesian model consists of two statements. The first one is the data model:
\[ Y_i | \mu \sim \mbox{Normal} (\mu,1), \]
which is very similar to the conventional model above. The difference is that \(\mu\) is no longer a constant but a random variable. Therefore, we write the data model as a conditional statement—conditional on some value of \(\mu\), the observations follow a normal at that mean. The data model, though conceptually similar to the frequentist model, is not enough for a Bayesian. It is incomplete because it is specified as a conditional statement. Bayesians need a second statement, a model on the parameter \(\mu\). A common specification is
\[ \mu \sim \mbox{Normal}(a,b),\]
where \(a\) and \(b\), the mean and variance, are set by the analyst before observing the data.

From a classical perspective, Bayesians make an extra model specification, the prior on parameters, that is unnecessary and unwarranted. Two researchers can have the same model, that the data are normal, but have very different priors if they choose very different values of \(a\) and/or \(b\). With these different choices, they may draw different conclusions. From a Bayesian perspective, the classical perspective is incomplete because it only models the phenomena up to function of unknown parameters. Classical models are good models if you know \(\mu\) — say, as God does — but not so good if you don't; and, of course, mortals don't. This disagreement, whether Bayesian models have an unnecessary and unwise specification or whether classical models are incomplete, is critical to understand why the priors-are-too-subjective critique is off target.

Bayesian Models Make Predictions, Classical Models Don't

One criteria that I adopt, and that I hope you do to, is that models should make predictions about data. Prediction is at the heart of deductive science. Theories make predictions, and then we check if the data has indeed conformed to these predictions. This view is not too alien, in fact, it is the stuff of grade-school science. Prediction to me means the ability to make probability statements about where data will lie before the data are collected. For example, if we agree that \(\mu=0\) in the above model, we now can make such statement, say that the probability that \(Y_1\) is between -1 and 1 is about 68%.

This definition of prediction, while common sense, is quite disruptive. Do classically-specified models predict data? I admit a snarky thrill in posing this question to my colleagues who advocate classical methods. Sometimes they say “yes,” and then I remind them that the parameters remain unknown except in the large-sample limit. Since we don't have an infinite amount of data, we don't know the parameters. Sometimes they say they can make predictions with the best estimate of \(\mu\), and I remind them that they need to see the data first to estimate \(\mu\), and as such, it is not a prediction (not to mention the unaccounted sample noise in the best estimate). It always ends a bit uneasy with awkward smiles, and with the unavoidable conclusion that classical models do not predict data, at least not in the usual definition of “predict.”

The reason classical models don't predict data is that they are incomplete. They are missing the prior—a specification of how the parameters vary. With this specification, the predictions are straightforward application of the Law of Total Probability (http://en.wikipedia.org/wiki/Law_of_total_probability):
\[Pr(Y_i) = \int Pr(Y_i|\mu) Pr(\mu) d\mu. \]
The respective probabilities (densities) \(Pr(Y_i|\mu)\) and \(Pr(\mu)\) are derived from the model specifications. Hence, the \(Pr(Y_i)\) is computable. We can state the probability that an observation lies in any interval before we see the data. Bayesian specifications predict data; classical specifications don't.

Priors Instantiate Meaningful Constraint

The prior is not some nuisance that one must begrudgingly specify. Instead, it is a tool for instantiating theoretically meaningful constraint. Let's take a problem near and dear to my children—whether the candy Smarties makes children smarter. For if so, my kids have a very convincing claim why I should buy them Smarties. I have three children, and these three don't agree on much. So let's assume the eldest thinks Smarties makes you smarter, the middle thinks Smarties makes you dumber if only to spite his older brother, and the youngest thinks it's wisest to steer a course between her brothers. She thinks Smarties have no effect at all. They decide to run an experiment on 40 schoolmates where each schoolmate first takes an IQ test, then eats a Smartie, and then take the IQ test again. The critical measure is the change in IQ, and for the sake of this simple demonstration, we discount any learning or fatigue confounds.

All three kids decided to instantiate their position within a Bayesian model. All three start with the same data model:
\[ Y_i | \mu \sim \mbox{Normal}(\mu,\sigma^2)\]
where \(Y_i\) is the difference score for the $i$th kid, \(\mu\) is the true effect of Smarties, and \(\sigma^2\) is the variance of this difference across kids. For simplicity, let's treat \(\sigma=5\) as known, say as the known standard deviation of test-retest IQ score differences. Now each of my children needs a model on \(\mu\), the prior, to instantiate their position. The youngest had it easiest. With no effect, her model on \(\mu\) is
\[ M_0: \mu=0. \]
Next, consider the model of the oldest. He believe there is a positive effect, and knowing what he does about Smarties and IQ scores, he decides to place equal probability of \(\mu\) between 0-point and a 5-point IQ effect, i.e.,
\[M_1: \mu \sim \mbox{Uniform}(0,5).\]
The middle one, being his brother's perfect contrarian, comes up with the mirror-symmetric model:
\[M_2: \mu \sim \mbox{Uniform}(-5,0).\]

Predictions Are The Key To Evidence

Now a full-throated disagreement among my children will inevitably result in one of them yelling, “I'm right; you're wrong.” This proclamation will be followed by, “You're so stupid.” The whole thing will go on for a while with hurled insults and hurt feelings. And if you think this juvenile behavior is limited to my children or children in general, then you may not know many psychological scientists. What my kids need is a way of using data to inform theoretically-motivated positions.
In a previous post, Richard Morey demonstrated — in the context of Bayesian t tests — how predictions may be used state evidence). I state the point here for the problem my children face. Because my children are Bayesian, they may compute their predictions about the sample mean of the difference scores. Here they are for a sample mean across 40 kids:

My daughter with Model \(M_0\) most boldly predicts that the sample mean will be small in magnitude, and her predictive density is higher than that of her brothers for \(-1.15<\bar{Y}<1.15\). If the sample mean is in this range, she is more right than they are. Likewise if the sample mean is above 1.15, the oldest child is more right (Model \(M_1\)), and if the sample mean is below -1.15, the middle child is more right (Model \(M_2\)).

With this Bayesian setup, we as scientist can hopefully rise above the temptation to think in terms of right and wrong. Instead, we can state fine-grained evidence as ratios. For example, suppose we observe a mean of -1.4, which is indicated with the vertical dashed line. The most probable prediction comes from Model \(M_2\), and it is almost twice as probable as Model \(M_0\). This 2-to-1 ratio serves as evidence for a negative effect of Smarties relative to a null effect. The prediction for the negative model is 25 times as probable as that for the positive model, and thus the evidence for a negative-effects model is 25-to-1 compared to the positive-effects model. These ratios of marginal predictions are Bayes factors, which are intuitive measure of evidence. Naturally, the meaning of the Bayes factor is bound to the model specifications.

Take Home

The prior is not some fudge factor. Different theoretically motivated constraints on data may be specified gracefully through the prior. With this specification, not only do competing models predict data, but stating evidence for positions is as conceptually simple as comparing how well each model predicts the observed data. Embrace the prior.

10 comments:

Aside from all the good serious stuff, "My colleague Chris Donkin has used the Kubrick's subtitle to say it best: 'How I learned to stop worrying about and love the prior.'" Umm, are you sure that's a good analogy? Kubrick was being sarcastic -- loving the bomb led to global thermonuclear war. Or is that the analogy you're going for? :-)

(reposted from Frontiers)Hi Jeff, in my view, a QRP is not an acceptable--but rather, as the name implies, "questionable"--use of a method. The related (synonymous?) term "researcher degrees of freedom" (Simmons, Nelson, and Simonsohn, 2011) is defined as data analytic decisions that researchers make after they have looked at the data. Thus, setting a prior beforehand and using Bayesian methods appropriately is of course not a QRP (and it is bad writing on our part if somehow we implied that in your view). It's the opportunity to meddle with the prior once the data have arrived that would be a QRP, as pointed out by Simmons et al. The more complicated a method is, the more such opportunities there are. This of course does not invalidate the method; it is simply a pragmatic point we were making, relevant to the causes of (and hence the solutions for) the replicability crisis. Victoria