For many psychology students, Bayesian statistics remains shrouded in mystery. At the undergraduate level, Bayes’ theorem may be taught as part of probability theory, but the link between probability theory and scientific inference is almost never made. This is unfortunate, as this link—first made almost a century ago—provides a mathematically elegant and robust basis for the quantification of scientific knowledge.

As argued by Wrinch and Jeffreys (1921) and later in the works of Harold Jeffreys, probability theory is extended logic. Jaynes (2003) calls it “the logic of science.” Indeed, it is easy to see how probability theory maps directly to propositional logic if all statements are fully “true” or “false” – that is, all probabilities are either 0 or 1. Take for example the statement P(A|B) = 1. “If B is true, then the probability of A is 1″ is simply another way of saying that “B implies A” (B → A). Similarly, P(A|B) = 0 is the same as B → ¬A. Probability theory extends this concept to include uncertainty, but the rules of probability have the same status as the rules of logic: they can be used to derive statements that are guaranteed to be correct if the premises are correct. Paraphrasing Edwards, Lindman, and Savage (1963, p. 194): Probability is orderly uncertainty and inference from data is revision of uncertainty in the light of relevant new information. Bayesian statistics, then, is nothing more—and nothing less—than the application of probability theory to real problems of inference.

The close relationship of probability theory and logic leads to further fertile insights. For example, a common misunderstanding regarding Bayesian methods is that they are somehow invalidated by the fact that conclusions may depend on the prior probabilities assigned to parameter values or hypotheses. Translated to the terminology of formal logic, this claim is that logical deduction is somehow invalidated because conclusions depend on premises. Clearly, an inferential procedure is not pathological because its conclusions depend on its assumptions – rather the inverse is true. Conclusions that do not depend on assumptions may be robust, but they cannot be rational any more than conclusions that do not depend on observations.

However, the dependence on prior probabilities involves another dimension that is often misunderstood: At first glance, it appears that the prior introduces the analyst’s beliefs—an element of subjectivity—into the inference, and this is clearly undesirable if we are to be objective in our scientific inquiries. Two observations address this issue. First, it is important to emphasize—lest we forget—that “subjective” is not synonymous with “arbitrary.” Rather than beliefs, we may think of probability as conveying information. It is not at all peculiar to say that relevant information may be subjective – after all, not all humans have access to the same information. Accordingly, the information that is encoded in probability distributions may be subjective, but that does not mean it is elective. Belief—in the sense in which it is used in probability theory—is not an act of will, but merely a state in which the individual passively finds itself. Accordingly, different scientists using different sources of information can rationally reach different conclusions.

The second observation regarding the subjectivity of the prior follows from inspection of Bayes’ theorem:

P(Θ|y) = P(y|Θ)P(Θ)/P(y).

In the right hand side numerator appears the product P(y|Θ)P(Θ): likelihood and prior side by side determine the relative density of all possible values of Θ. In a typical cognitive-modeling scenario, researchers will specify these distributions with some care – much defense and reasoning will often go into the selection of which prior to use, possibly using arguments from previous literature and graphical exploration of the prior predictive distribution; criticism of prior decisions is common and expected. The likelihood is defined also.

The way these components of Bayes’ theorem are specified is somewhat reminiscent of the Biblical description of the creation of the heavens, in which “God made two great lights; the greater light to rule the day, and the lesser light to rule the night: he made the stars also” (Gen 1:16, KJV). Much like how in this verse the billions upon trillions of stars are created as an afterthought, far less argument is usually deemed necessary for the definition of the likelihood function even though it is usually much more consequential than the definition of the prior – after all, given even moderate amounts of data the prior will typically wash out in favor of the likelihood. To see argument at all for the choice of likelihood is not typical and the tacit assumptions of sequential independence and normally-distributed residual are ubiquitous. Jaynes (2003) writes that “if one fails to specify the prior information, a problem of inference is just as ill-posed as if one had failed to specify the data” (p. 373), but the emphasis can apply to both factors in the RHS numerator of Bayes’ theorem: if we fail to question the likelihood it is as if we fail to question the prior.

In some contexts, however, questioning the likelihood is common: we ask whether this or that is the “right model for the data.” For example, in the reaction time modeling world, we might wonder if a set of observations is best described by a standard linear ballistic accumulator or by some stochastic variant. In more conventional scenarios, we sometimes worry if a t test with equal variances is appropriate, or an unequal-variance procedure should be used instead. This invites a question: What if we want to estimate the magnitude of some manipulation effect but are unwilling to commit to model E (equal variance) or U (unequal variance)? Perhaps unsurprisingly, probability theory has an answer. If the posterior distribution of the effect size assuming some model M (M ∈ {E,U}) is p(δ|y,M) and the posterior probability that E is the correct model of the two is P(E|y) = 1 – P(U|y), then the posterior distribution of δ, averaged over these two models, is immediately given by the sum rule of probability:

p(δ|y) = p(δ|y,E)P(E|y) + p(δ|y,U)P(U|y).

One interpretation of this equation is that the exact identity of the model is a nuisance variable, and we can “integrate it out” by taking an average weighted by the posterior probability of each model. It provides a posterior distribution of δ that does not assume that model E is true or that model U is true, only that one of them is. This technique of marginalizing over models is a direct consequence of probability theory that is often called Bayesian model averaging. It can be applied in a staggering variety of circumstances.

While most psychologists readily draw conclusions that are based on an often arbitrary and tenuously appropriate likelihood, one who is uncomfortable with any of the assumptions can apply Bayesian model averaging to assuage their concerns. This way, we can avoid having to commit to any particular set of model assumptions function by averaging over likelihood functions – and so it goes with priors also.

[Note: This post is co-authored with Eric-Jan Wagenmakers and was originally published on The Winnower.]

Many scientific disciplines find themselves in the midst of a “crisis of confidence,” where key empirical findings turn out to reproduce at an alarmingly low rate [e.g., 1,2,3,4]. The causes for the crisis are multifaceted and there does not appear to be a single silver-bullet solution. Nevertheless, some insight can be gained by considering two rules proposed by Charles Sanders Peirce almost 150 years ago [5,6]. These rules are prerequisites for the proper evaluation of any scientific hypothesis using empirical data.

The first rule concerns the need for strictly confirmatory research [7]:

Peirce’s first rule. The hypothesis should be distinctly put as a question, before making the observations which are to test its truth. In other words, we must try to see what the result of predictions from the hypothesis will be ([6], emphasis ours).

In yet other words: hypotheses cannot be tested using the same data that were used to generate the hypotheses in the first place [8].

The second rule we consider here concerns the need to publish findings independently of their outcome:

Peirce’s second rule. The failures as well as the successes of the predictions must be honestly noted. The whole proceeding must be fair and unbiased ([6], emphasis ours).

The contrast between Peirce’s rules and current scientific practice is striking. In violation of the first rule, researchers often do not indicate in advance what specific predictions are to be tested. This means that reviewers and readers cannot assess the extent to which the data constitute a true test (i.e., prediction) or a false test (i.e., postdiction). Hindsight bias and confirmation bias make such an assessment problematic even for the original authors themselves; as Richard Feynman famously observed: “you must not fool yourself—and you are the easiest person to fool” [9].

In violation of the second rule, statistically nonsignificant findings are widely suppressed and underrepresented [10]. This phenomenon, known as publication bias, is endemic at multiple levels. At the institutional level, editors may explicitly instruct authors to develop a compelling narrative and avoid the apparent contradiction that results from a mixed set of significant and nonsignificant results. At the individual level, authors may anticipate or share editors’ preferences and suppress such ambiguous results. The presence of publication bias is sometimes assessed post hoc with statistical tests [11]; the issue is both contentious and complicated.

Most researchers will agree that Peirce’s rules are sensible, and that violating them will bias the published literature in such a way that honest replication attempts will often fail to reproduce the original findings. Nevertheless, these same researchers may be unaware that their own work almost always violates Peirce’s rules.

To increase awareness we propose a simple heuristic that individual researchers can use to determine for themselves whether their work adheres or violates Peirce’s rules. The following heuristic provides researchers with a correct intuition about exploratory analyses and publication bias, and may help reduce its influence at the individual level:

The No More Bets heuristic. For every set of observations, the publication decision must be made prior to inspection of the observations themselves.

This heuristic applies both to data points and outcomes of hypothesis tests, and requires only that authors “buy in” to their experimental methods in advance (i.e., failure to find effects cannot be due to faulty methods or flawed design). Consistent with Peirce’s first rule, a personal commitment to publish the outcome of a specific hypothesis test is made a priori; consistent with Peirce’s second rule, the decision to publish is made regardless of the outcome. Deciding to publish, in this context, means that the researchers commit to making these data and hypothesis tests visible to the academic public—typically as part of a manuscript or other scientific communication.

Once individual researchers start to adopt the “no more bets” heuristic for personal use, it is only a small step to claim credit for one’s predictions by preregistering data analysis plans (e.g., on the Open Science Framework or on AsPredicted.org) or engaging in a registered report [12]. In registered reports, articles receive “in-principle acceptance” based on an initial review of the methods in advance of data collection. A second phase of review serves only to decide whether the proposed methods were followed.

Presently, the no more bets heuristic is applied implicitly only in fields that enforce preregistration (e.g., certain medical clinical trials). Strict application of the heuristic attends researchers to the fact that their work may violate key scientific desiderata and may pave the way to preregistration and the publication of results independent of the outcome. In line with the old rules from Peirce, such a new way of conducting research will prove an essential component in the struggle for research that is both informative and reproducible.

Thinking like a Bayesian is often different from thinking like an orthodox frequentist statistician. To be a frequentist is to think about long-run frequency distributions working with certain assumptions: What would the data X look like if T were true? Often, no attention is paid to the possibility that T might not be true and the focus is exclusively on false alarm rates: How often will I conclude that an effect exists when there really is none?

To be a Bayesian is completely different in this respect. Bayesians are rarely interested in such long-run behaviors, and will never condition their conclusions on the existence or nonexistence of an effect. While frequentists ask what is the probability of observing data X given truth T, Bayesians will instead ask what is the probability of truth T given that data X are observed? I contend that the latter is an interesting question for scientists in general, and the former is only of niche interest to a small minority — process engineers, perhaps.

This difference in focus shines through when we perform simulation studies of the behavior of statistical algorithms and procedures. While it is often more straightforward to assume a certain truth T (say, the null hypothesis) and generate data from that to study its long-run behavior, this leads inevitably to frequentist questions for which few researchers have any use at all. A more interesting simulation study is an attempt to emulate the situation real scientists are in: having the data as given and drawing conclusions from them.

The standard example

The distinction between conditioning on truth and conditioning on data is emphasized in nearly every first introduction to Bayesian methods, often through a medical example. Consider the scenario of a rare but serious illness, with an incidence of perhaps one in a million, and the development of a new test, with a sensitivity of 99% but a false alarm rate of 2%. What is the probability that a person with a positive diagnosis has the illness? We can study this issue with a simple simulation:

We can summarize the outcome of this simulation study in a small two-by-two table:

T = 1

T = 0

D = 1

101

1999093

D = 0

1

98000805

At this point, there are two ways of looking at this table. The frequentist view considers each column of the table separately: if the illness is present (T = 1, left column), then in approximately 99% of cases the diagnosis will be positive. If the illness is not present (T = 0, right column), then in approximately 2% of cases the diagnosis will be positive. Now that’s all good and well, but it does not address the interesting question: what is the probability that the illness is present given a positive diagnosis? For this, we need the Bayesian view that considers only one row of the table, namely where D = 1. In that row, only about 0.005% of cases are true, and this is the correct answer.

This answer is slightly counterintuitive because the probability is so low (after all, the test is supposedly 99% accurate); this impression comes from a cognitive fallacy known as base rate neglect. Consider that the probability of the illness being present is about 50 times larger after the diagnosis than before, but it remains small. The base rate is a necessary component to address the question of interest; we could not have answered the question without it.

Of course any Bayesian worth their salt would scoff at all the effort we just put in to demonstrate a basic result of probability theory. After all,

.99 * 1e-6 / (.99 * 1e-6 + .02 * (1-1e-6))
ans = 4.9498e-05

On to bigger things

We can apply the same analysis to significance testing! Let’s generate some data for a one-sample t test with n = 15. Some proportion of the data are generated from the null hypothesis:

And now we can look at the outcomes with our Bayesian goggles; looking at the distribution of true effects conditional on the outcome measure (in this case, whether the test was significant). Let’s plot the two histograms together, with the “nonsignificant” one flipped upside-down. Because I saw this for the first time in one of Jeff’s talks, I think of this as a Rouder plot:

I like this graph because you can read off interesting conditional distributions: if p > .05, many more true effects are 0 than when p < .05, which is comforting. But how many more?

Well, we can make that same table as above (now with proportions):

T = 1

T = 0

D = 1

0.17

0.02

D = 0

0.33

0.47

Note here that, looking at the right column only, we can see that the frequentist guarantee is maintained: if the null is true, we see a significant result only roughly 5% of the time. However, if we decide the null is false on the basis of a significant effect (i.e., looking only at the top row), we will be mistaken at a different rate seen by dividing the value in the upper right by the sum of the top row.

The continuous domain

So far, we’ve looked only at discrete events: the effect is either zero or it is not (T) and the effect is either significant or it is not (D). If we call x the sample means and t the true means, we can plot the joint distribution p(x, t) of their exact values:

Here, too, we can inspect the results of our simulation in a frequentist fashion by inspecting only columns: we can condition on a particular model (e.g., t = 0) and study the behavior of the data x under this model. Alternatively, and more interestingly, we can be Bayesian and see what distribution of true values t is associated with a particular observation x.

In making figures, we can again make it explicit that you can condition on model (the frequentist idea; left) or on data (the Bayesian idea; right)

The distribution on the left is a vertical slice from the joint distribution above — right out of the middle. It is the expected distribution of the observed data if the null hypothesis is true (t = 0). This is commonly known as the t distribution (here, with 14 degrees of freedom), which I’ve underlaid in red for emphasis. The distribution on the right is a horizontal slice: the distribution of true values that generated a particular observation (in this case, the ones that generated x values close to 0). This distribution is commonly called the posterior distribution.

I can think of no real-world use for the distribution on the left, but the distribution on the right tells you what you want to know: the distribution of the true effect size, given a possible data outcome. Obviously, you can make this figure for many possible outcomes, including the one from your experiment.