Thursday, July 30, 2015

First off, let me say that I have no strong feelings about p-values, and some positive feelings
towards Bayes Factors. There are personal reasons for this: During my PhD, p-values set me off on a wild goose chase
(see my previous blog post). A few months before my thesis submission, Bayes
Factors saved the day by informing me that there was mostly no evidence for the
effect I was chasing in the first place. This would justify strong negative feelings
towards p-values on my part, but then
again, in situations of personal conflict is it always worth taking a step back
to consider what went wrong, whether the blame lies at least in part with
myself, and what I can learn from the experience for personal growth. This blog
post is my attempt to make peace with p-values:
my conclusion, so far, is that perhaps they are not to blame for everything,
and that there is a lot to learn from my mistakes.

P-values seem to
elicit a lot of strong feelings with a lot of people. They have received a
great deal of negative publicity lately, and they have been blamed for the
replicability crisis in psychology (e.g., Cumming, 2014; Halsey, Curran-Everett, Vowler, & Drummond,
2015).
The replicability crisis refers to the outcry that most research findings
published in scientific journals are false. Basically, what seems to be
happening is that journals prefer to publish papers which report sensational,
unexpected effects, that are likely to receive a lot of attention from the
general public. In contrast, journals prefer to not publish studies that
replicate an experiment, especially if these report a null-result. The consequences
of such an incentive system are clear: the literature gets filled with
sensational papers that report type-I errors (i.e., the presence of an effect,
when in reality the null hypothesis is true).

Given this explanation, the problem leading to the
replication crisis is not with p-values
per se, but rather with the way they
are used and interpreted. As the main argument of the current blog post, I propose that it is a grave mistake to draw conclusions
based on a single p-value (i.e., a
single published study). P-values are
a frequentist statistic, where the outcome of a single experiment can only be
considered in relation to a hypothetical infinite chain of events (or, in
practice, a large number of events). Taking the classical coin example, that
most psychology majors will remember from their introductory statistics course:
if you want to determine whether a coin is fair or not, you toss it 10 000
times to check whether around 50% of the outcomes are heads. Now imagine that
you want to determine whether the coin is fair or not, but you only toss it
once. What can you conclude about the fairness of the coin? Nothing. You toss
it twice, and get a head and a tail. Still, you cannot conclude anything. If
you toss it three times and get heads each time, you might raise an eyebrow.

The same holds true for an experiment: If the null
hypothesis of a given effect is true, and we run hundreds of experiments to
test for the presence of this effect, we would expect, by definition, that
around 5% of all p-values will be
significant at the 5% level (p < 0.05). If we run a single experiment, and get a p-value (i.e., probability of obtaining the
data if the null hypothesis were true) of 0.03, we can conclude nothing. If we
replicate this experiment and get a p-value
of 0.3, we still cannot conclude anything. However, if we run a series of
experiments and get low p-values in a
majority of them, we may raise an eyebrow and consider the possibility that the
effect is actually real. Thus, p-values
can be interpreted within a chain of events, but it is theoretically impossible
to draw conclusions from a single observation.

Some related caveats about p-values are explained very nicely by Schmidt (1992).
In set of experiments, due to sampling error, the observed effect size will
vary as a normal distribution around the true population effect size. If a
population effect is real (δ > 0), it is nevertheless possible to obtain an
observed effect size of exactly zero (d = 0). Conversely, if the null
hypothesis is true, it is possible to obtain a significant p-value and a reasonably-sized observed effect. Therefore,
obtaining one significant (p­ < 0.05)
and one non-significant (p > 0.05)
result in two experiments tells us very little about whether the effect may be
real or not. Even more detrimental is the conclusion that the two p-values are intrinsically different and
must come from different populations. Such a result is often interpreted to
mean that there must be some unknown moderator determining whether the effect
is present or not, when it is likely that both observations simply reflect
sampling noise around a single true population parameter value (which may or
may not be zero).

So much for p-values
(and other frequentist statistics, such as confidence intervals or power
calculations): they are only meaningful when numerous studies are available,
and if we can be sure that the studies that are available are truly random
repeated samples. Given the current publication system, as described above,
neither of the two conditions is met: (1) Regardless of whether replications
yield significant p-values or not,
their publication is discouraged. Therefore, for some effects, there is only
one p-value available in the
literature. (2) Replications are especially difficult to publish if they yield
a non-significant result. This means that positive-result replications are
over-represented in the literature. In the worst-case scenario, a “consistent
effect” in the literature simply represents the 5% false positives (Rosenthal, 1979). In the defence of p-values, then: their distribution can tell us a lot. It can give us information about the presence
or absence of an effect, and even about questionable research practices for a given research area (Simonsohn, Nelson, & Simmons, 2014a; 2014b).
However, this becomes difficult or impossible in a system where a large
proportion of experiment outcomes are never reported because they do not meet
an arbitrary criterion, and where conclusions are generally drawn from a single
p-value.

But are Bayes Factors better, and if so, how? Unlike
frequentist theorists, Bayesians do not start off with the assumption of a
hypothetical infinite chain of events. Instead, one starts off with a prior
belief about what kind of effect one might expect. If research on this question
is already available, one can use that to guide the expected effect size;
otherwise, an educated guess will do, with an adjustment in the prior
distribution to reflect a high degree of uncertainty. After the data is
collected, the prior belief is combined with the data to yield a posterior
distribution. Thus, a Bayesian analysis allows one to update one’s a priori beliefs in light of incoming
data. More comprehensive explanations of the Bayesian framework are available
in Dienes (2011)
and van de Schoot et al. (2014).

Back to the coin example: if a Bayesian wants to check
whether a coin is fair, she starts off with an a priori belief. If there is no reason to believe that the coin is
rigged, the belief would be that in a coin toss, heads and tails are equally
likely outcomes. Say the first two coin tosses provide two heads. While this is
not overly strong evidence against this prior belief, she might be somewhat
inclined to consider the possibility that the coin biased towards heads. If the
third coin toss provides tails, the degree of belief shifts back towards the “unbiased”-hypothesis.
Thus, each consecutive coin toss allows the Bayesian to update her belief about
the coin. The more data is available, the more confident the Bayesian will be
that her belief is correct.

If a large amount of unbiased data is available, a
frequentist and a Bayesian studying the same question would probably always
converge in their conclusions. This is a big “if”, though, that does not seem
to be met in the available published literature. As such, a Bayesian approach
might be more suited to drawing conclusions based on a small amount of available data: even
though it would be optimal to have as much data as possible within any
theoretical framework, a Bayesian, but not a frequentist approach allows for conclusions based on
few studies. It is important to bear in mind that, in a Bayesian
framework, there is a large degree of uncertainty in one’s beliefs if the data
on which it is based is sparse. Furthermore, Bayesian analyses are not immune to publication bias: if one considers only papers that report type-I errors, one is less likely to arrive at the conclusion that the null-hypothesis is true, even if this were evident from a set of experiments using truly random samples.

In conclusion, it seems that the main problem with p-values is not intrinsic to p-values per se, but rather in the data that is available in the literature,
and the common practice of drawing conclusions based on a single study. The former may be a consequence of the latter: if researchers believe that the p-value of a single study can provide a convincing answer to the question "Is there an effect?", there seems indeed very little use in publishing replications. However, we know that this is not true. Furthermore, regardless of the theoretical framework that one adopts for statistical inference, it is always necessary to have multiple studies in order to be confident about the conclusions that we draw.

Friday, July 17, 2015

I grew up in a house of mathematicians. Among other things,
this means that throughout my childhood, I heard a lot of jokes about
physicists (the mathematicians’ equivalent of blondes jokes). As a child, I used
to find those jokes mildly amusing. When I started learning about research
methods in psychology, I started finding them really funny, but sad at the same
time – after all, if a hard science like physics is a laughing stock because of
their dodgy scientific practices, what does this mean for a soft science like
psychology?

Here’s an example of a physicist joke:

A physicist is doing a talk at a
conference. He is holding up a graph, and is explaining what the data on it
means. After half an hour, a graduate student timidly raises her hand and
politely notes that the graph is held upside down. The physicist stops, looks
at the graph, turns it the right way up, and says: “Why, you’re right! Well, in
this case the data is even easier to explain!”

The moral of the story is that, for a reasonably intelligent
and creative person, it is almost always possible to come up with a plausible-sounding
explanation for any set of results. This, of course, is already well-known: for
this reason, the scientific method entails explicitly stating a hypothesis
before the data is collected and analysed. However, in psychological research
this principle is not straight forward, for a simple reason: it is very rare
for the data to behave in the way that was anticipated.

Like probably everyone else in the field, I learnt this the
hard way during my PhD. I conducted a study to look an effect in three
conditions (let’s call the size of the effect in the three conditions A, B, and
C, respectively). Two theories (let’s call them X and Y) made opposing
predictions:

If X is true, A < B < C.

If Y is true, A > B > C.

The result? B = 0 < A = C.

I think this scenario is familiar to anyone who has ever
done an experiment in the so-called soft sciences, and it’s a PhD student’s
worst nightmare. What does one do with a set of results like these? One of my advisors
said I can do one of the following: (1) Figure out why we got this unexpected
set of results, (2) write up a paper with our initial predictions in the
introduction, our results, and conclude that ‘more research is needed’ to
understand this unexpected set of results, or (3) forget about the whole thing.

In retrospect, I should have done (3), but due to my
stubbornness I went for (1) instead. I had numerous meetings with my advisors
to discuss how any theory could account for the obtained results – but in this
case, we could not even come up with a reasonable-sounding explanation. Then I
decided to collect some more data. I conducted four more studies with larger
samples, and eventually performed a meta-analysis of all the data on this
effect that I could get my hands on (which, aside from the data that I had
collected, was not much). Thus having maximised the power to obtain a
potentially true effect, I found that A = B = 0, and C is only slightly larger
than zero. On the bright side, this finally allowed me to conclude that the
data is more compatible with Theory X than Theory Y, but at this stage I had
wasted hours of my advisors’ time and most of my PhD trying to understand the
results of the first experiment, which were basically just random noise.

This is where Hyman’s Maxim comes in. I came across it in this
blogpost by chance, after I had already submitted my PhD thesis. The maxim
says: “Do not try to explain something until you are sure there is something to
be explained.” Ray Hyman
started off as a magician, but later became a skeptic and a psychologist. Aside
from the blogpost, I have not found any publications on Hyman’s Maxim, but in
my opinion, this is the most important principle in psychological science, and
possibly any science that involves drawing inferences from data. As a
scientist’s main job is to obtain data that can support or refute theories, it
is easy to get carried away with drawing the link between data and theory, and
to forget how important it is to ensure that the data actually tells you what
you think it tells you. In psychology, with generally small effects and noisy
data, the non-zero probability that a statistically significant effect reflects
random noise is often forgotten. Consequently, any statistically significant
result is in the danger of being interpreted as ‘meaningful’: if the a priori theory did not predict it, we
must be missing something, there must be some explanation, or moderating
factor, which should explain this unexpected result.

In conclusion, unless we have ensured that an unexpected
result is replicable, drawing inferences from a single study with a statistically
significant result that was not predicted a
priori is a lot like telling someone’s future from the stars or their tea
leaves. In fact, if the null hypothesis happens to be true, it is literally
like telling someone’s future from the stars or their tea leaves. On some
level, everyone knows this already, but perhaps it is easy to forget this point.
My proposed solution to this problem: Create motivational posters starring
Hyman’s Maxim. Put them up in every psychological scientist’s office and
bathroom.