Bayes factors vs p-values

Bayesian analysis and Frequentist analysis often lead to the same conclusions by different routes. But sometimes the two forms of analysis lead to starkly different conclusions.

The following illustration of this difference comes from a talk by Luis Pericci last week. He attributes the example to “Bernardo (2010)” though I have not been able to find the exact reference.

In an experiment to test the existence of extra sensory perception (ESP), researchers wanted to see whether a person could influence some process that emitted binary data. (I’m going from memory on the details here, and I have not found Bernardo’s original paper. However, you could ignore the experimental setup and treat the following as hypothetical. The point here is not to investigate ESP but to show how Bayesian and Frequentist approaches could lead to opposite conclusions.)

The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is p = 0.5. The alternative hypothesis was that p is not 0.5. There were N = 104,490,000 bits emitted during the experiment, and s = 52,263,471 were 1’s. The p-value, the probability of an imbalance this large or larger under the assumption that p = 0.5, is 0.0003. Such a tiny p-value would be regarded as extremely strong evidence in favor of ESP given the way p-values are commonly interpreted.

The Bayes factor, however, is 18.7, meaning that the null hypothesis appears to be about 19 times more likely than the alternative. The alternative in this example uses Jeffreys’ prior, Beta(0.5, 0.5).

So given the data and assumptions in this example, the Frequentist concludes there is very strong evidence for ESP while the Bayesian concludes there is strong evidence against ESP.

The following Python code shows how one might calculate the p-value and Bayes factor.

Take any study that claims to support ESP, search and replace ESP with something people find plausible, and there will be no criticism. You could say this is confirmation bias. Or you could say everyone is a really Bayesian, and most of us have a highly informative prior belief that ESP is bunk. The former argument says people are irrational, the latter says they are rational, and yet both arguments are largely the same!

Pericchi’s argument was that p-values behave paradoxically for large samples and should be adjusted for sample size. He shows how to do this adjustment so that p-values and Bayes factors agree asymptotically.

Right. My point was simply that the Bayes factor can change dramatically when a theoretically meaningful alternative prior is used. In the Pericchi example, the proportion of “heads” in the sample was barely bigger than 50%. If this were a real domain of research, apparently with enormous sample sizes, then the researchers would know in advance, from previous experience, that the phenomenon in question is barely bigger than chance. If the alternative prior is supposed to represent the alternative hypothesis, then the alternative prior should express the prior knowledge. The researchers’ hypothesis might, therefore, be better expressed by a beta(5010,4990) prior than by a beta(0.5,0.5) prior. Then the Bayes factor is 0.15 for the null; that is, against the null. [By the way, I get a Bayes factor of 18.69 for the beta(0.5,0.5) prior.] This does not contradict anything you’ve said; it’s just a reminder that Bayes factors are only as meaningful as the hypotheses being compared.

Isn’t the p value issue one of the amount of data being so large that even tiny differences become significant, whether practical or not. I ran a One proportion test in Minitab and found that againts the null of true p = 0.5, the p value was indeed at or near zero and the 95% confidence interval of the true proportion is (0.500081, 0.500273). I think if I were the frequentist researcher I would not be celebrating my discovery of ESP. :-)
But nonetheless, great warning to be careful no matter what tool one is using.

You may have mentioned it and I missed it, but I believe that the technical term is that p-values for a point hypothesis are not consistent. Not sure if this is (as much of) a problem for a one-tailed hypothesis or not.

Thanks for presenting this example.Lindley’s paradox
(Wikipedia) has a nice discussion, why the results differ and are not contradictory:
The Frequentist finds that the null hypothesis is a poor explanation for the observation, where the Bayesian finds that the null hypothesis is a far better explanation for the observation than the alternative.

First time I’ve come across Lindley’s paradox. I can follow the numbers but can’t quite get my head around how increasing sample size can increase the probability of falsely accepting an alternative hypothesis… seems completely contrary to my understanding of hypothesis testing. Any chance of a follow up post?

I don’t understand what’s the problem with the p-value here. Seems like such a sample would in fact be strong evidence for ESP, because such samples under the null are really unlikely. What’s the problem? Isn’t that how it should be?

“The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is p = 0.5.” – isn’t this false? The null hypothesis is ONLY that p = 0.5. Not also that the individual had no influence. Therefore, there is only strong evidence in favour of p0.5 under the conditions observed during the experiment, not necessarily strong evidence for ESP.

Given there’s no mention of a control (what happens when nobody is around to influence the bit stream?), the experimental set-up doesn’t tell you anything about ESP at all.

As for the Bayesian approach, Beta(0.5,0.5) is an incredibly conservative prior for a process that should (presumably) strongly be considered p=0.5. No wonder the null is favoured.

Rather than illustrating differing conclusions made by frequentist and Bayesian approaches, the experimental setup appears to simply show how not to do either.

I disagree with some of the above. The null hypothesis is NOT anything to do with probability of ‘1’ or ‘0’; the null hypothesis is that “subject’s attempt to influence the distribution did not change the distribution”

The likely reason that the p-value was so small is that random number generators do not output perfectly random numbers, and large populations yield very sensitive p-value tests.

IMO, a better experiment design is assuming nothing about the underlying distribution of ‘0’ and ‘1’ – but simply measure the frequencies for a control run, and then measure frequencies when “ESP is being attempted”, and do something like a chi-sq test on those two distributions.

RE: “The p-value, the probability of an imbalance this large or larger under the assumption that p = 0.5, is 0.0003. Such a tiny p-value would be regarded as extremely strong evidence in favor of ESP given the way p-values are commonly interpreted.”

The probability that an infinite population would yield a value different from 0.5 (given the 104+ million actual sample size of bits) is only 0.03% !!! Therefore, the hypothesis of an esp effect would be rejected at a confidence level of 99.5% by frequentist theory (i.e., Fisher’s theory, treating confidence level as a matter of choice, not fixed at 95%). Moreover, if there had been , in fact, an esp effect, it was so small as to be of negligible importance for most ordinary situations.