In March 2011 the highly respected Journal of Personality and Social
Psychology published a paper by the distinguished
psychologist Daryl Bem, of Cornell University in the USA. The paper
reports a series of experiments which, Bem claims, provide evidence
for some types of extra-sensory perception (ESP). These can occur only
if the generally accepted laws of physics are not all true. That's a
pretty challenging claim. And the claim is based largely on the
results of a very common (and very commonly misunderstood) statistical
procedure called significance testing. Bem's experiments
provide an excellent way into looking at how significance testing
works and at what's problematic about it.

Bem's experiments and what he did with the data

Bem's article reports the results of nine different
experiments, but for our purpose it's sufficient to only look at Experiment 2. This is based on well-established
psychological knowledge about perception. Images that are flashed up
on a screen for an extremely short time, so short that the conscious
mind does not register that they have been seen at all, can still
affect how an experimental subject behaves. Such images are said to be
presented subliminally, or to be subliminal
images. For instance people can be
trained to choose one item rather than another by presenting them
with a pleasant, or at any rate not unpleasant (neutral), subliminal
image after they have made the "correct" choice and a very unpleasant
one after they have made the "wrong" choice.

Bem, however, did something rather different in Experiment
2. As in a standard experiment of this sort his participants had to
choose between two closely matched pictures (projected clearly on a
screen, side by side). Then they were presented a neutral subliminal image if they
had made the "correct" choice, and an unpleasant subliminal image if
they had made the "wrong" choice. The process was then repeated with a
different pair of pictures to choose between. Each participant made
their choice 36 times, and there were 150 participants in all.

The ESP controversy

Bem's paper sparked a considerable amount of debate. The Journal of Personality and Social Psychology recognised that the statistical aspects are crucial and took the unusual step of publishing an editorial alongside Bem's paper, explaining their reasons for publishing it, and also including in the same journal another paper by the Dutch psychologist Eric-Jan Wagenmakers criticising Bem's work. Bem and colleagues provided a response to Wagenmakers, who in turn responded here. Several other researchers and commentators have joined in too. This one will run and run. The New York Times has also published two articles on the matter, which you can find here and here.

But the new feature of Bem's experiment was that when the
participants made their choice between the two pictures in each pair,
nobody — not the participants, not the experimenters — could know which
was the "correct" choice. The "correct" choice was determined by a
random mechanism, after a picture had been chosen by the
respondent.

If the experiment was working as designed, and if the laws of
physics relating to causes and effects are as we understand them,
then the subliminal images could have no effect at all on the
participants' choices of picture. This is because at the time they made their
choice there was no "correct" image to choose. Which image was
"correct" was determined afterwards. Therefore, given the way the
experiment was designed (I have not explained every detail) one
would expect each participant to be "correct" in their choice half the
time, on average. Because of random variability, some would get more
than 50% right, some would get less, but on average, people would make
the right choice 50% of the time.

What Bem found was that the average percentage of correct choices, across his
150 participants, was not 50%. It was slightly higher: 51.7%.

There are several possible explanations for this finding, including the following:

The rate was higher than 50% just because there is random variability, both in the way people respond and in the way the "correct" image was selected. That is, nothing very interesting happened.

The rate was higher than 50% because the laws of cause and effect are not as we understand them conventionally, and somehow the participants could know something about which picture was "correct" before the random system had decided which was "correct".

The rate was higher than 50% because there was something wrong with the experimental setup and the participants could get an idea about which picture was "correct" when they made their choice, without the laws of cause and affect being broken.

More subtly, these results are not typical, in the sense that actually more experiments were done than are reported in the paper, and the author chose to report the results that were favourable to the hypothesis that something happened that casts doubt on the laws of cause and effect, and not to report the others. Or perhaps more and more participants kept being added to the experiment until the results happened to look favourable to that hypothesis.

I won't consider all these in detail. Instead I'll concentrate on how and why Bem decided that point 1 was not a likely
explanation.

Bem carried out a significance test. Actually he made
several different significance tests, making slightly different
assumptions in each case, but they all led to more or less the same
conclusion, so I'll discuss only the simplest. The resulting p value was
0.009. Because this value is small he concluded that Explanation 1
was not appropriate
and that the result was statistically significant.

This is a standard statistical procedure which is very commonly used. But
what does it actually mean and what is this p value?

What's a p value?

All significance tests involve a null hypothesis, which
(typically) is a statement that nothing very interesting has
happened. In a test comparing the effects of two drugs the usual null
hypothesis would be that, on average, the drugs do not differ in their
effects. In Bem's Experiment 2 the null hypothesis is that Explanation 1
is true: the true average proportion of "correct"
answers is 50% and any difference from 50% that is observed is simply
due to random variability.

The p value for the test is found as follows. One assumes
that the null hypothesis is really true. One then calculates the
probability of observing the data that were actually observed, or
something more extreme, under this assumption. That probability is the
p value. So in this case, Bem used standard methods to
calculate the probability of getting an average proportion correct of
51.7%, or greater, on the assumption that all that was going on was
chance variability. He found this probability to be 0.009. (I'm
glossing over a small complication here, that in this case the
definition of "more extreme" depends on the variability of the data as
well as their average, but that's not crucial to the main ideas.)

The p value is not the probability that the laws of physics don't apply.

Well, 0.009 is quite a small probability. So we have two
possibilities here. Either the null hypothesis really is true but
nevertheless an unlikely event has occurred. Or the null hypothesis
isn't true. Since unlikely events do not occur often, we should at
least entertain the possibility that the null hypothesis isn't
true. Other things being equal (which they usually aren't), the
smaller the p value is the more doubt it casts on the null
hypothesis. How small the p value needs to be in order for us
to conclude that there's something really dubious about the null
hypothesis (and hence that, in the jargon, the result is
statistically significant and the null hypothesis is
rejected) depends on circumstances. Sometimes the values of
0.05 or 0.01 are used as boundaries and a p value less than
that would be considered a significant result.

This line of reasoning, though standard in statistics, is not at
all easy to get one's head round. In an experiment like this one often sees the p value interpreted as follows:

(a) "The p value is 0.009. Given that we've got these results, the
probability that chance alone is operating is 0.009." That is WRONG.

The correct way of putting it is

(b) "The p value is 0.009. Given that chance alone is operating, the
probability of getting results like these is 0.009."

(That's not quite
the whole picture, because it doesn't include the part about "results
at least as extreme as these", but it's close enough for most
purposes.)

So the difference between (a) and (b) is that the "given" part and the part
that has probability of 0.009 are swapped round.

It may well not be obvious why that matters. The point is that the
answers to the two questions "Given A, what's the probability of B?"
and "Given B, what's the probability of A" might be quite
different. An example is to imagine that you're picking a random
person off the street in London. Given that the person is a Member of
(the UK) Parliament, what's the probability that they are a British
citizen? Well that probability would be high.

What about the other way round? Given that this random person is a
British citizen, what's the probability that they are an MP? I hope
it's clear to you that that probability would be very low. The great
majority of the British citizens in London are not MPs. So it's fairly obvious (I hope!) that swapping the "given" part and
the probability changes things.

Nevertheless, Bem is interpreting his significance test in the
commonly used way when he deduces, from a p value of 0.009,
that the result is significant and that there may well be more going
on than simply the effects of chance. But, just because this is
common, that doesn't mean it's always correct.

Out of the nine experiments that Bem reports, he found significant
p values in all but one (where he seems to be taking
"significant" as meaning "p less than 0.05" - the values
range from 0.002 to 0.039, with the one he regards as non-significant
having a p value of 0.096). Conventionally, these would
indeed be taken as Bem interprets them, pointing in the direction
that most of the null hypotheses are probably not true, which in the
case of these experiments means that there is considerable evidence of
the laws of physics not applying everywhere. Can that really be the
case?

About the author

Kevin McConway is Professor of Applied Statistics at the Open University. As well as teaching statistics and health science, and researching the statistics of ecology and evolution and in decision making, he works in promoting public engagement in statistics and probability. Kevin is an occasional contributor to the Understanding Uncertainty website.