While the authorities are distracted by mass disorder, we can do some statistics. You’ll have seen plenty of news stories telling you that one part of the brain is bigger, or smaller, in people with a particular mental health problem, or even a specific job. These are generally based on real, published scientific research. But how reliable are the studies?

One way of critiquing a piece of research is to read the academic paper itself, in detail, looking for flaws. But that might not be enough, if some sources of bias might exist outside the paper, in the wider system of science.

By now you’ll be familiar with publication bias: the phenomenon where studies with boring negative results are less likely to get written up, and less likely to get published. Normally you can estimate this using a tool like, say, a funnel plot. The principle behind these is simple: big expensive landmark studies are harder to brush under the carpet, but small studies can disappear more easily. So you split your studies into “big ones”, and “small ones”: if the small studies, averaged out together, give a more positive result than the big studies, then maybe some small negative studies have gone missing in action.

Sadly this doesn’t work brain scan studies, because there’s not enough variation in size. So Professor John Ioannidis, a godlike figure in the field of “research about research”, took a different approach. He collected a large representative sample of these anatomical studies, counted up how many positive results they got, and how positive those results were, and then compared this to how many similarly positive results you could plausibly have expected to detect, simply from the sizes of the studies.

This can be derived from something called the “power calculation”. Everyone knows that bigger is better when collecting data for a piece of research: the more you have, the greater your ability to detect a modest effect. What people often miss is that the size of sample needed also changes with the size of the effect you’re trying to detect: detecting a true 0.2% difference in the size of the hippocampus between two groups, say, would need more subjects than a study aiming to detect a huge 25% difference.

By working backwards and sideways from these kinds of calculations, Ioannidis was able to to determine, from the sizes of effects measured, and from the numbers of people scanned, how many positive findings could plausibly have been expected, and compare that to how many were actually reported. The answer was stark: even being generous, there were twice as many positive findings as you could realistically have expected from the amount of data reported on.

What could explain this? Inadequate blinding is an issue: a fair amount of judgement goes into measuring the size of a brain area on a scan, so wishful nudges can creep in. And boring old publication bias is another: maybe whole negative papers aren’t getting published.

But a final, more interesting explanation is also possible. In these kinds of studies, it’s possible that many brain areas are measured, to see if they’re bigger or smaller, and maybe, then, only the positive findings get reported, within each study.

There is one final line of evidence to support this. In studies of depression, for example, 31 studies report data on the hippocampus, 6 on the putamen, and 7 on the prefrontal cortex. Maybe, perhaps, more investigators really did focus solely on the hippocampus. But given how easy it is to measure the size of another area – once you’ve recruited and scanned your participants – it’s also possible that people are measuring these other areas, finding no change, and not bothering to report that negative result in their paper, alongside the positive ones they’ve found.

There’s only one way to prevent this: researchers would have to publicly pre-register what areas they plan to measure, before they begin, and report all findings. In the absence of that process, the entire field might be distorted, by a form of exaggeration that is – we trust – honest and unconscious, but more interestingly, collective and disseminated.

++++++++++++++++++++++++++++++++++++++++++
If you like what I do, and you want me to do more, you can: buy my books Bad Science and Bad Pharma, give them to your friends, put them on your reading list, employ me to do a talk, or tweet this article to your friends. Thanks!
++++++++++++++++++++++++++++++++++++++++++

“The newly emerging field of Social Neuroscience has drawn much attention in recent yeas, with high-profile studies frequently reporting extremely high (e.g., >.8) correlations between behavioral and self-report measures of personality or emotion and measures of brain activation obtained using fMRI. We show that these correlations often exceed what is statistically possible assuming the (evidently rather limited) reliability of both fMRI and personality/emotion measures. The implausibly high correlations are all the more puzzling because social-neuroscience method sections rarely contain sufficient detail to ascertain how these correlations were obtained. We surveyed authors of 54 articles that reported findings of this kind to determine the details of their analyses. More than half acknowledged using a strategy that computes separate correlations for individual voxels, and reports means of just the subset of voxels exceeding chosen thresholds. We show how this non-independent analysis grossly inflates correlations, while yielding reassuring-looking scattergrams. This analysis technique was used to obtain the vast majority of the implausibly high correlations in our survey sample. In addition, we argue that other analysis problems likely created entirely spurious correlations in some cases. We outline how the data from these studies could be reanalyzed with unbiased methods to provide the field with accurate estimates of the correlations in question. We urge authors to perform such reanalyses and to correct the scientific record.”

casimiro said,

As a practicing psychiatrist I am worried with the lack of conceptual thinking of the profession and the empirical excesses, which I see day in and day out, committed by researchers who have no direct relationship with patients.

I believe that we are seeing is the conflation of method (statistics) with actual science. Further, as Ernst Mayr said many years ago, we are still thinking in terms of the “hegemonic” science, that is, physics.

Ioannidis work is welcome although it is paradoxical/ironic that in order to reveal the limits and errors of statistics (which is about ascertaining errors) it has to utilize these same tools!

LS said,

The reasons are technical.
First, APA style, which is mandatory for almost any significant publication, has clear guidelines on how to report positive results. It’s decades since they mandated p values, and a few years ago they started requiring effect size too. In contrast, the negative results are not standardized, as far as I remember.
Second, reporting negative results require power analysis and much larger samples. It is easier to just gloss over the fact that conditions B and C might or might not be different by saying the classical “p>.05” than to scan 30 more people at 500 dollars an hour. (That is, of course, if conditions A and B differ significantly, so that one has something to communicate.)

cellocgw said,

@ Casimiro: Can you please re-write your comment in a style that does not reek of Sokal’s submission to Social Text?
I (foolishly) prefer to think you simply lack writing skills than to accept that you believe science is somehow hindered by statistical analysis.

casimiro said,

I am sorry you did not like my style. English is not my first language. However, I am delighted that you compare me to Sokal (although I would have prefered to be compared to his text in Lingua Franca).

Up to what point have you dealt with the conceptual issue rather than with an “ad hominem” remark?

As a doctor and (occasional)researcher I know of the power of statistics; I take issue with mistaking the methods of science with Science. Further, medicine is not only “episteme” but also “tehkne” and “phronesis.”

I suspect you’re right, it’s most likely (p<0.05) the positive bias of the authors and their desire to satisfy the positive biases of the journals. I have this debate with my neuroimaging colleagues more often than I care to admit, and it's quite dispiriting.

Here's another thought. (Disclaimer: I'm on the acquisition/physics side of this ball-game so I've only rudimentary stats experience.) It's quite common to have a study look at half a dozen regions of (potential) interest and use a single control region. As best I recall, there aren't many studies with a balanced set of ROIs and control regions. Thus, surely one must expect to see more (false) positive in the considerably larger number of regions intentionally tested for changes, versus those selected for their presumed invariance. (Isn't this Bonferroni entering the picture?)

I think I'm restating your final point about measuring other brain areas than just the one(s) expected to be involved in the effect of interest, Ben. Statistically speaking, should we not be insisting on an equal number of controls to ROIs?

It doesn’t matter what statistical tools you use, if your implicit sample bias (a nice way of saying the data are edited or selected for significance) is high then the tool is invalid. (formally, one would say that the prior distribution of errors and samples is sufficiently non-normal for statistical measures derived via central limit theorems to be valid and then go on to do some form of Bayesian analysis or bootstrap or…) Even in “pure physics”, control experiments are used to calibrate the machines. Shouldn’t the imaging people be doing something similar?

More seriously, if normalization or controls are a problem, I would expect a good study would use several different approaches and show that the results were independent from the normalization. It wouldn’t be surprising that the p-value might change, but if the results are completely different when different controls are used, then something is fishy.

One of the clearest signs of a “problem child” graduate student is that all of their results are positive and great.

Toenex said,

As someone who spent a few years working in fMRI research (developing data analysis techniques) I can’t say I’m surprised at how these studies appear to be over interpreted. I’ve personally witnessed psychiatrists cooing over maps of activation in the brain whilst happily discounting the regions of statistically equivalent activation in the surrounding air-space. fMRI is essentially the study of neural blood flow and it’s correlation with an external stimulus. The assumption being that as areas of the brain become more active they demand more energy which will in turn increase the blood flow. Quite a leap to make, a bit like trying to monitor your electricity usage by measuring your water pressure. Seeing the visual cortex light up when a subject watches some flashing images is one thing, finding those regions involved in ‘forgiveness’ really is something else (I shit you not). I’m happily working in musculoskeletal x-ray imaging now.

bladesman said,

Ben raises a hugely important point here. It’s all too easy to be impressed/bamboozled by technical language and overlook the basic science. Researchers need to be more upfront about their studies so we can be sure that their a priori analysis isn’t really post- hoc data dredging. If it is exploratory analysis, this should be made clear; indeed, this can serve as a useful starting point for further research. Or, as suggested, simply publish the trial protocol (complete with methods for statistical analysis) beforehand so we can all be sure of the legitimacy of the findings.

@stinkychemist – great quote!

88HUX88 said,

when I played pool in America, I had to name my pocket for every ball. If it went in a different pocket or any (of my) other balls went down I couldn’t continue. In this country if I get a lucky shot I can keep playing. I’m sure you see the analogy.

But I’ll change water pressure to water usage and add a few nuances! There’s all manner of activity within the house, but if all one knows is a temporal plot of electrical usage and a (slower sampled) temporal plot of water usage then the transfer function is unknown. I love it!!!!

Can we find correlations that might explain taking a shower? (Strongly correlated 5 min sustained activity in water as well as electricity, the latter due to the extract fan being on.) How about cooking? (Electrical activity – the cooker – for 30 mins precedes a 5 min water usage – washing up.) The correlations and confusions are endless, just like brain activity. I owe you!!

killary45 said,

Stinky Chemist: My OED tells me “critique” has been used as a verb in English since 1751. There is a difference between “criticise” and “critique” and one of the joys of the English language is that we have so many words with slightly different meanings.

As for “party” as verb, well the OED shows it has nearly a hundred years of reputable use. The “verbing” of nouns has been a constant feature of English since the days of Chaucer and will continue to enrich the language despite the laments of those who think that the users of a language should never innovate.

Since there are clearly some people here who understand statistics far better than I, may I ask a question?

Let’s say I have a population of 30,000 and select a representative sample of 5,000 of these. I mail them a questionnaire and 1,500 reply.

What confidence can I have in the results? I suspect the answer is pretty low but the people running the survey claim it’s 95% confidence. It seems to me that, at best, the results will indicate the needs of 30% (1500/5000) of the population with any reasonable confidence.

thom said,

Yes and no and no. The results generalize to the 30,000 (given that the 5,000 are representative). This assumes that you use a finite-population correction (if you don’t it would generalize to an infinite population of people like those in your representative sample). However, this assumes that there is no response bias – that the responders are like the non-responders. If not, there are ways of correcting (e.g., weighting) for known differences. Last (but not least) the ‘confidence’ has a technical meaning here. It means that 95 out of 100 confidence intervals will contain the true population parameter (e.g., mean) that you are interested in. In many cases it is reasonable to treat the confidence interval (or +/- margin of error) as likely to contain the true population value – but this is different from its technical definition (and requires further assumptions).

That said, a sample of 1500 is a pretty decent size and likely to give sensible estimates if response bias is low. (Good surveys measure characteristics likely to be or known to be associated with response bias to be able to check this or correct for it.)

garymac said,

with regards to psychiatry I heard a lecture from a professor of nursing telling us how bad schizophrenia was, and that it “shrinks your brain”. I asked a very simple question “how”. he then went on to say that SPET scans have proved it without once refering to any scientific evidence for his statement in front of a large audience. he also neglected to say that those scans that have been carried out, have probably been carried out on individuals who may have been taking powerful antipsychotics for some years and that it is these that may have accounted for any atrophy that may have been identified, if any. A true evidence based lecture that was not

Architectonic said,

The idea of pre-trial registration is a worthy and increasingly common practise.

Unfortunately, many academics and industry researchers alike still strongly deviate from published protocols, including regularly withholding data on some of the most interesting measures.

Perhaps the problem is the competitive culture of science. Researchers are compelled to produce significant/interesting/novel findings and so they are compelled to cherry-pick which measures they publish or over-hype their results. Otherwise they risk an end to their careers.

Scientists need to be rewarded based on the quality of the hypothesis and experimental design and NOT on the results of these experiments.

As a practicing psychiatrist I am worried with the lack of conceptual thinking of the profession and the empirical excesses, which I see day in and day out, committed by researchers who have no direct relationship with patients.PC817