Aeon email newsletters are issued by the not-for-profit, registered charity Aeon Media Group Ltd (Australian Business Number 80 612 076 614). This Email Newsletter Privacy Statement pertains to the personally identifying information you voluntarily submit in the form of your email address to receive our email newsletters

More generally, when visiting the Aeon site you should refer to our site Privacy Policy here.

This Email Newsletter Privacy Statement may change from time to time and was last revised 5 June, 2018.

By clicking ‘Subscribe’ you agree to the following:

We will use the email address you provide to send you daily and/or weekly email (depending on your selection). We also send occasional donation requests and, no more than once a year, reader surveys.

The email address/es you provide will be transferred to our external marketing automation service ‘MailChimp’ for processing in accordance with their Privacy Policy and Terms. We use MailChimp to issue our newsletters, donation requests and reader surveys. We have no control over, and assume no responsibility for, the conduct, practices or privacy policies of MailChimp.

Unsubscribing

You can change your mind at any time by clicking the ‘unsubscribe link’ in the footer of emails you receive from us, or by contacting us at support@aeon.co

If you want to review and correct the personal information we have about you, you can click on ‘update preferences’ in the footer of emails you receive from us, or by contacting us at support@aeon.co

Security of your personal information

We are committed to ensuring that your information is secure. We have taken reasonable measures to protect information about you from loss, theft, misuse or unauthorised access, disclosure, alteration and destruction. No physical or electronic security system is impenetrable however and you should take your own precautions to protect the security of any personally identifiable information you transmit. We cannot guarantee that the personal information you supply will not be intercepted while transmitted to us or our marketing automation service Mailchimp.

Sharing your personal information

We will not disclose your personal information except: (1) as described by this Privacy Policy (2) after obtaining your permission to a specific use or disclosure or (3) if we are required to do so by a valid legal process or government request (such as a court order, a search warrant, a subpoena, a civil discovery request, or a statutory requirement). We will retain your information for as long as needed in light of the purposes for which is was obtained or to comply with our legal obligations and enforce our agreements.

Access to your personal information

You may request a copy of the personal information we hold about you by submitting a written request to support@aeon.co We may only implement requests with respect to the personal information associated with the particular email address you use to send us the request. We will try and respond to your request as soon as reasonably practical. When you receive the information, if you think any of it is wrong or out of date, you can ask us to change or delete it for you.

Aeon for Friends

The aim of science is to establish facts, as accurately as possible. It is therefore crucially important to determine whether an observed phenomenon is real, or whether it’s the result of pure chance. If you declare that you’ve discovered something when in fact it’s just random, that’s called a false discovery or a false positive. And false positives are alarmingly common in some areas of medical science.

In 2005, the epidemiologist John Ioannidis at Stanford caused a storm when he wrote the paper ‘Why Most Published Research Findings Are False’, focusing on results in certain areas of biomedicine. He’s been vindicated by subsequent investigations. For example, a recent article found that repeating 100 different results in experimental psychology confirmed the original conclusions in only 38 per cent of cases. It’s probably at least as bad for brain-imaging studies and cognitive neuroscience. How can this happen?

The problem of how to distinguish a genuine observation from random chance is a very old one. It’s been debated for centuries by philosophers and, more fruitfully, by statisticians. It turns on the distinction between induction and deduction. Science is an exercise in inductive reasoning: we are making observations and trying to infer general rules from them. Induction can never be certain. In contrast, deductive reasoning is easier: you deduce what you would expect to observe if some general rule were true and then compare it with what you actually see. The problem is that, for a scientist, deductive arguments don’t directly answer the question that you want to ask.

What matters to a scientific observer is how often you’ll be wrong if you claim that an effect is real, rather than being merely random. That’s a question of induction, so it’s hard. In the early 20th century, it became the custom to avoid induction, by changing the question into one that used only deductive reasoning. In the 1920s, the statistician Ronald Fisher did this by advocating tests of statistical significance. These are wholly deductive and so sidestep the philosophical problems of induction.

Tests of statistical significance proceed by calculating the probability of making our observations (or the more extreme ones) if there were no real effect. This isn’t an assertion that there is no real effect, but rather a calculation of what wouldbe expected if there were no real effect. The postulate that there is no real effect is called the null hypothesis, and the probability is called the p-value. Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. All you have to do is to decide how small the p-value must be before you declare that you’ve made a discovery. But that turns out to be very difficult.

The problem is that the p-value gives the right answer to the wrong question. What we really want to know is not the probability of the observations given a hypothesis about the existence of a real effect, but rather the probability that there is a real effect – that the hypothesis is true – given the observations. And that is a problem of induction.

Confusion between these two quite different probabilities lies at the heart of why p-values are so often misinterpreted. It’s called the error of the transposed conditional. Even quite respectable sources will tell you that the p-value is the probability that your observations occurred by chance. And that is plain wrong.

Suppose, for example, that you give a pill to each of 10 people. You measure some response (such as their blood pressure). Each person will give a different response. And you give a different pill to 10 other people, and again get 10 different responses. How do you tell whether the two pills are really different?

The conventional procedure would be to follow Fisher and calculate the probability of making the observations (or the more extreme ones) if there were no true difference between the two pills. That’s the p-value, based on deductive reasoning. P-values of less than 5 per cent have come to be called ‘statistically significant’, a term that’s ubiquitous in the biomedical literature, and is now used to suggest that an effect is real, not just chance.

But the dichotomy between ‘significant’ and ‘not significant’ is absurd. There’s obviously very little difference between the implication of a p-value of 4.7 per cent and of 5.3 per cent, yet the former has come to be regarded as success and the latter as failure. And ‘success’ will get your work published, even in the most prestigious journals. That’s bad enough, but the real killer is that, if you observe a ‘just significant’ result, say P = 0.047 (4.7 per cent) in a single test, and claim to have made a discovery, the chance that you are wrong is at least 26 per cent, and could easily be more than 80 per cent. How can this be so?

Take the proposition that the Earth goes round the Sun: it either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement

For one, it’s of little use to say that your observations would be rare if there were no real difference between the pills (which is what the p-value tells you), unless you can say whether or not the observations would also be rare when there is a true difference between the pills. Which brings us back to induction.

The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis (the deductive problem) to what we actually want, the probability that the hypothesis is true given some observations (the inductive problem). But how to use his famous theorem in practice has been the subject of heated debate ever since.

Take the proposition that the Earth goes round the Sun. It either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement. Furthermore, the Bayesian conversion involves assigning a value to the probability that your hypothesis is right before any observations have been made (the ‘prior probability’). Bayes’s theorem allows that prior probability to be converted to what we want, the probability that the hypothesis is true given some relevant observations, which is known as the ‘posterior probability’.

These intangible probabilities persuaded Fisher that Bayes’s approach wasn’t feasible. Instead, he proposed the wholly deductive process of null hypothesis significance testing. The realisation that this method, as it is commonly used, gives alarmingly large numbers of false positive results has spurred several recent attempts to bridge the gap.

There is one uncontroversial application of Bayes’s theorem: diagnostic screening, the tests that doctors give healthy people to detect warning signs of disease. They’re a good way to understand the perils of the deductive approach.

In theory, picking up on the early signs of illness is obviously good. But in practice there are usually so many false positive diagnoses that it just doesn’t work very well. Take dementia. Roughly 1 per cent of the population suffer from mild cognitive impairment, which might, but doesn’t always, lead to dementia. Suppose that the test is quite a good one, in the sense that 95 per cent of the time it gives the right (negative) answer for people who are free of the condition. That means that 5 per cent of the people who don’t have cognitive impairment will test, falsely, as positive. That doesn’t sound bad. It’s directly analogous to tests of significance which will give 5 per cent of false positives when there is no real effect, if we use a p-value of less than 5 per cent to mean ‘statistically significant’.

But in fact the screening test is not good – it’s actually appallingly bad, because 86 per cent, not 5 per cent, of all positive tests are false positives. So only 14 per cent of positive tests are correct. This happens because most people don’t have the condition, and so the false positives from these people (5 per cent of 99 per cent of the people), outweigh the number of true positives that arise from the much smaller number of people who have the condition (80 per cent of 1 per cent of the people, if we assume 80 per cent of people with the disease are detected successfully). There’s a YouTube video of my attempt to explain this principle, or you can read my recent paper on the subject.

the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect

Notice, though, that it’s possible to calculate the disastrous false-positive rate for screening tests only because we have estimates for the prevalence of the condition in the whole population being tested. This is the prior probability that we need to use Bayes’s theorem. If we return to the problem of tests of significance, it’s not so easy. The analogue of the prevalence of disease in the population becomes, in the case of significance tests, the probability that there is a real difference between the pills before the experiment is done – the prior probability that there’s a real effect. And it’s usually impossible to make a good guess at the value of this figure.

An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect.

In general, though, we don’t know the real prevalence of true effects. So, although we can calculate the p-value, we can’t calculate the number of false positives. But what we can do is give a minimum value for the false positive rate. To do this, we need only assume that it’s not legitimate to say, before the observations are made, that the odds that an effect is real are any higher than 50:50. To do so would be to assume you’re more likely than not to be right before the experiment even begins.

If we repeat the drug calculations using a prevalence of 50 per cent rather than 10 per cent, we get a false positive rate of 26 per cent, still much bigger than 5 per cent. Any lower prevalence will result in an even higher false positive rate.

The upshot is that, if a scientist observes a ‘just significant’ result in a single test, say P = 0.047, and declares that she’s made a discovery, that claim will be wrong at least 26 per cent of the time, and probably more. No wonder then that there are problems with reproducibility in areas of science that rely on tests of significance.

What is to be done? For a start, it’s high time that we abandoned the well-worn term ‘statistically significant’. The cut-off of P < 0.05 that’s almost universal in biomedical sciences is entirely arbitrary – and, as we’ve seen, it’s quite inadequate as evidence for a real effect. Although it’s common to blame Fisher for the magic value of 0.05, in fact Fisher said, in 1926, that P = 0.05 was a ‘low standard of significance’ and that a scientific fact should be regarded as experimentally established only if repeating the experiment ‘rarely fails to give this level of significance’.

The ‘rarely fails’ bit, emphasised by Fisher 90 years ago, has been forgotten. A single experiment that gives P = 0.045 will get a ‘discovery’ published in the most glamorous journals. So it’s not fair to blame Fisher, but nonetheless there’s an uncomfortable amount of truth in what the physicist Robert Matthews at Aston University in Birmingham had to say in 1998: ‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug.’

The underlying problem is that universities around the world press their staff to write whether or not they have anything to say. This amounts to pressure to cut corners, to value quantity rather than quality, to exaggerate the consequences of their work and, occasionally, to cheat. People are under such pressure to produce papers that they have neither the time nor the motivation to learn about statistics, or to replicate experiments. Until something is done about these perverse incentives, biomedical science will be distrusted by the public, and rightly so. Senior scientists, vice-chancellors and politicians have set a very bad example to young researchers. As the zoologist Peter Lawrence at the University of Cambridge put it in 2007:

hype your work, slice the findings up as much as possible (four papers good, two papers bad), compress the results (most top journals have little space, a typical Nature letter now has the density of a black hole), simplify your conclusions but complexify the material (more difficult for reviewers to fault it!)

But there is good news too. Most of the problems occur only in certain areas of medicine and psychology. And despite the statistical mishaps, there have been enormous advances in biomedicine. The reproducibility crisis is being tackled. All we need to do now is to stop vice-chancellors and grant-giving agencies imposing incentives for researchers to behave badly.

David Colquhoun

is a professor of pharmacology at University College London and a Fellow of the Royal Society. He is the author of Lectures on Biostatistics (1971) and blogs at DC’s Improbable Science.