Link List

Wednesday, May 29, 2013

WHY PUBLISHED SIGNIFICANCE VALUES ARE (MOSTLY) LIES

In
my recent post advocating the abandonment of NHST (null-hypothesis significance
testing), I skimmed over two important issues that I think need more
elaboration: the multiple-test problem and publication bias (also called the
“file-drawer” bias). The two are deeply
related and ought to make everyone profoundly uncomfortable about the true
meaning of achieved significance levels reported in the scientific literature.The
multiple-test problem is wonderfully illustrated by this cartoon, xkcd's 'Significant', by the incomparable Randall Munroe. The cartoon was subject of Monday's post here on MT (so you can scroll down to see it), but I thought it worthy of further reflection. (If you don’t know Munroe's cartoons, check ‘em
out. In combination they’re like a
two-dimensional “Big Bang Theory”, only deeper.
I especially like “Frequentists vs. Bayesians”.)

If you don’t get the "Significant" cartoon, you need to study
up on the multiple-test problem. But
briefly stated, here it is: You (in your
charming innocence) are doing the Neyman-Pearson version of NHST. As required, you preset an a value, i.e. the highest
probability of making a Type 1 error (rejecting a true null hypothesis) that
you’re willing to accept as compatible with a rejection of the null. Like most people, you choose a = 0.05, which means
that if your test were to be repeated 20 times on different samples or using
different versions of the test/model, you’d expect to find about one
“significant” test even when your null hypothesis is absolutely true. Well, okay, most people seem to find 0.05 a
sufficiently small value to proceed with the test and reject the null whenever p < 0.05 on a single test. But what if you do multiple tests in the course of, for example, refining your model,
examining confounding effects, or just exploring the data-set? If, for example, you do 20 tests on different
colors like the jellybean scientists, then there’s a quite high probability of
getting at least one result with p
< 0.05 even if the null hypothesis is eternally true. If you own up to the multiplicity of tests –
or, indeed, if you’re even aware of the multiple-test problem – then there are
various corrections you can do to adjust your putative p values (the Bonferroni correction is undoubtedly the best-known
and most widely used of these). But if
you don’t and if you only publish the one “significant” result (see the
cartoon), then you are, whether consciously or unconsciously, lying to your
readers about your test results. Green
jellybeans cause acne!! (Who knew?)

If
you do multiple tests and include all of them in your publication (a very rare
practice), then even if you don’t do the Bonferroni correction (or whatever)
your readers can. But what if you do a
whole bunch of tests and include only the
“significant” result(s) without acknowledging the other tests? Then you’re lying. You’re publishing the apparent green
jellybean effect without revealing that you tested 19 other colors, thus
invalidating the p < 0.05 you
achieved for the greenies. Let me say it
again: you’re lying, whether you know it or not.

This
problem is greatly compounded by publication bias. Understandably wanting your paper to be cited
and to have an impact, you submit only your significant results for
publication. Or your editor won’t accept
a paper for review without significant results.
Or your reviewers find “negative” results uninteresting and unworthy of
publication. Then significant test
results end up in print and non-significant ones are filtered out – thus the
bias. As a consequence, we have no idea
how to interpret your putative p value,
even if we buy into the NHST approach.
Are you lying? Do you even know
you’re lying?This
problem is real and serious. Many
publications have explored how widespread the problem is, and their findings
are not encouraging. I haven’t done a
thorough review of those publications, but a few results stick in my mind. (If anyone can point me to the sources, I’d
be grateful.) One study (in psychology,
I think) found that the probability of submitting significant test results for
publication was about 75% whereas the probability of submitting non-significant
results was about 5%. (The
non-significant results are, or used to be, stuck in a file cabinet, hence the
alternative name for the bias.) Another
study of randomized clinical trials found that failure to achieve significance
was the single most common reason for not writing up the results of completed
trials. Another study of journals in the
behavioral and health sciences found that something like 85-90% of all
published papers contain significant results (p < 0.05), which cannot even remotely reflect the reality of applied
statistical testing. Again, please don’t
take these numbers too literally (and please don’t cite me as their source)
since they’re popping out of my rather aged brainpan rather than from the
original publications.

The
multiple-test/publication-bias problem is increasingly being seen as a major
crisis in several areas of science. In research
involving clinical trials, it’s making people think that many – perhaps most –
reported results in epidemiology and pharmacology are bogus. Ben Goldacre of “Bad Science” and “Bad
Pharma” fame has been especially effective in making this point. There are now several groups advocating the
archiving of negative results for public access; see, for example, the Cochrane Collaboration. But in my own field of biological
anthropology, the problem is scarcely even acknowledged. This is why N. T. Longford, the researcher
cited in my previous post, called the scientific literature based on NHST a
“junkyard” of unwarranted positive results.
Yet more reason to abandon NHST.

7 comments:

Nice post. I've seen that number (likelihood of publishing) somewhere, and I'm in psychology, and follow the problems with it. Some of the usual suspects would be Greg Francis, Uri Simonsohn, Leslie John.

I also checked Keith Laws blog, as he talks about this and similar questions, esp in this particular post

I also collect a lot about these questions in my own two blogs. (Especially on Åse Fixes Science).

The discussion needs to be spread across disciplines, because I think one of the reasons nothing have changed is that there just have not been enough power to shake things out of equilibrium (I see the Crowd wisdom economics link to the right as I speak, where they mention that problem.

The hope is if things are shook up enough, perhaps we can move out of this space and into a more robust space.

Thanks for the links. Interestingly, active and ardent discussion of this problem (as well as the general NHST problem) seems to be largely confined to psychology. So you're absolutely right that the discussion needs to spread across disciplines.

We have dealt with this issue on Mermaid's Tale as well, over the past couple of years. Actually, the problem is worse. Because first of all you usually have poked around in the area, or in your data, before you actually do your tests. You don't put your analytic plans specified in very precise terms, in a vault before collecting data, then do them.

Secondly, if you do enough tests, you can find what looks interesting and publish that without the multiple testing criterion. For example, it's not just the color of the jellybeans that needs multiple-test correction. Suppose in doing the color tests, you notice that there may be some pattern of their shape, so you then do a test of the 'spherical hypothesis' and reject it because (as you had noticed incidentally) the jellybeans were rather oblate.

Instead of using statistics as data exploration and simply saying that such-and-such seems interesting, and then designing a specific study to address that, we want the semblance of rigor when it just doesn't apply.

I think genetics has faced all of these issues in its study design history as genomic technology has advanced, including the NHST issue, but the latter is deeply rooted in epidemiological thinking, and for me the bottom line is that investigators will do whatever it takes to be able to report 'important' results.

Actually, the next stage in your "analysis" after showing that green jellybean have a significant effect on the risk of acne is to cook up an ex post facto explanation for that effect. No doubt one compound in the green dye has an adverse effect on the sebaceous glands. Worthy of another press release!

Yes, well that's the gig! Sadly, it is always possible that surprise results are true and could lead to progress, but too often its post-hoc justifying or the effect is so weak as to be uninteresting even if 'significant'

Regarding the above cartoon, Bayesian Andrew Gelman has some interesting things to say:

"I am happy to see statistical theory and methods be a topic in popular culture, and of course I’m glad that, contra Feller, the Bayesian is presented as the hero this time, but . . . . I think the lower-left panel of the cartoon unfairly misrepresents frequentist statisticians."

and,

"I think the cartoon as a whole is unfair in that it compares a sensible Bayesian to a frequentist statistician who blindly follows the advice of shallow textbooks."

Comments

We always welcome comments, but we moderate them to reduce spam, gratuitous unkindness and so forth. Because we moderate comments, they won't appear on the blog until one of us publishes them, but we try to do that in a timely way.

We've had to make a change to the commenting page. People had told us that Blogger was eating their comments, so now, rather than embedding comment editing with the posts, it has to be done on a separate, full page. Unfortunately, the 'reply' option has disappeared so comments will just follow one another. We'll see how this goes.