The dilemma of weak neuroimaging papers

Over this week, there has been a striking debate in the blogosphere and on Twitter concerning the flaws in many published neuroimaging studies. This was sparked off on Monday by Dorothy Bishop’s brutal, insightful highlighting of the methodological holes in a paper published in the prominent journal Proceedings of the National Academy of Sciences in 2003. The next day, one of the authors of this paper, Russ Poldrack, admirably held up his hands in submission, and agreed with every one of Bishop’s attacks. His partial explanation was that this was in a different age, with more lax conventions (and admittedly he was only a minor author on the paper himself). Late Tuesday night, Neurocritic posted a provocative blog article in response to this, asking the question: “How Much of the Neuroimaging Literature Should We Discard?” This initiated a lively debate on Twitter yesterday between me, Jon Simons, Dorothy Bishop and others, in answer to this question. Two key issues quickly surfaced: first, is there any mileage in retracting published results, if they are later found to be seriously flawed; and second, do these flawed studies have a generally positive worth, especially when bolstered by independent replication.

I thought it might help in this discussion to explain one of the main statistical issues that this debate is pinned on, that of corrected versus uncorrected statistics, and how this applies to brain-scanning. I then want directly to address Neurocritic’s question as to whether these problematic papers should be discarded or retracted. Related to this, I’ll then discuss whether a published, though deeply flawed neuroimaging study can do more harm than good. And if many published imaging papers are so flawed, I want to try to explain how the literature became so sloppy. I’ll end this blog entry by coming up with a few suggestions for how the situation can be improved, and then how a layperson can sift through the stories, and decide whether a neuroimaging study is of good quality of not.

Edit: Just to flag up that this blog is addressing two audiences. I wanted to explain the context of the debate to a general audience, which occurs in the next two sections, and suggest how they can assess neuroimaging stories in the light of this (in the last small section). The middle sections, although hopefully understandable (and maybe even of some interest) to all, is directed more at fellow scientists. And the comments at the end have become a little dominated by technical points, which is great, but if any non-academic wants to air an opinion or ask a question, I just wanted to emphasise that I’d be delighted to have these as comments too.

So what are corrected and uncorrected statistics?

Imagine that you are running some experiment, say, to see if corporate bankers have lower empathy than the normal population, by giving them and a control group an empathy questionnaire. Low and behold, the bankers do have a lower average empathy score, but it’s only a little bit lower. How can you tell whether this is just some random result, or that bankers really do have lower empathy? This is the point where statistical testing enters the frame.

Classically, a statistical test will churn out a probability that you would have got the same result, just by chance. If it is lower than some threshold, commonly probability (or p) =0.05, or a 1 in 20 chance, then because this is really very unlikely, we’d conclude that the test has passed, the result is significant, and that bankers really do have a lower empathy score than normal people. All well and good, but what if you also tested your control group against politicians, estate agents, CEOs and so on? In fact, let’s say you tested your control group against 20 different professions, and the banker group was the only one that was “significant”. Now we have a problem, because if we rerun a test 20 times, it is likely to be positive (under this p=0.05 threshold at least) one of those times, just by chance.

As an analogy, say Joe Superstitious flips a coin 4 times in a row, willing it with all his might to fall on heads 4 times in a row (with 1 in 16 odds, so pretty close to p=0.05). But the first time it’s just a mix of heads and tails. Oh he was just getting warmed up, so let’s ignore this round. So he tries again, and this time it’s three heads and a tail – or so nearly there. His mojo must be building! The third time it’s almost all tails, well that was because he was a bit distracted by a car horn outside. So he tries again, and again and again. Then, as if by magic, on the 20th attempt, he gets all 4 heads. Joe Superstitious proudly concludes that he is in fact very skilled at telekinesis, puts the coin in his pocket and saunters off.

Joe Superstitious was obviously flawed in his thinking, but the reason is actually because he was using uncorrected statistics, just as the empathy study would have been if it concluded that bankers are less empathic than normal people. If you do multiple tests, you normally have to apply some mathematical correction to take account of how many tests you ran. One simple yet popular method of correction (known as a Bonferroni correction) involves dividing the probability your statistical test outputs by the number of tests you’ve done in total. So for the bankers to be significantly lower than the control at a p=0.05 criterion, the statistical test would have had to output a probability of p=0.0025 (p=0.05/20), which only occurs 1 in 400 times by chance.

How does this apply to brainscanning?

Moving on to neuroimaging, the data is far more complex and inordinately larger, but in essence exactly the same very common statistical test one might have used for the empathy study, a t-test, is also used here in the vast majority of studies. However, whereas in the empathy study 20 t-tests were run, in a typical neuroimaging study, a t-test is separately carried out for each 3 dimensional pixel (known as a voxel) of a subject’s brain-scan, and they might well have 100,000 of these! So there is a vast problem of some of these voxels to be classed as significantly active, just by chance, unless you are careful to apply some kind of correction for the number of tests you ran.

One historical fudge was to keep to uncorrected thresholds, but instead of a threshold of p=0.05 (or 1 in 20) for each voxel, you use p=0.001 (or 1 in a 1000). This is still in relatively common use today, but it has been shown, many times, to be an invalid attempt at solving the problem of just how many tests are run on each brain-scan. Poldrack himself recently highlighted this issue by showing a beautiful relationship between a brain region and some variable using this threshold, even though the variable was entirely made up. In a hilarious earlier version of the same point, Craig Bennett and colleagues fMRI scanned a dead salmon, with a task involving the detection of the emotional state of a series of photos of people. Using the same standard uncorrected threshold, they found two clusters of activation in the deceased fish’s nervous system, though, like the Poldrack simulation, proper corrected thresholds showed no such activations.

So the take home message is that we clearly need to be applying effective corrections for the large quantities of statistical test we run for each and every brain activation map produced. I’m willing to concede that in a few special cases, for instance with a very small, special patient group, corrected statistics might be out of reach and there is some value in publishing uncorrected results, as long as the author heavily emphasises the statistical weakness of the results. But in almost all other circumstances, we should all be using corrected significance, and reviewers should be insisting on it.

Should we retract uncorrected neuroimaging papers?

Surprisingly, there is a vast quantity of published neuroimaging papers, even including some in press, which use uncorrected statistics. But in response to Neurocritic, and siding to some degree with Ben Goldacre, who also chipped in on the Twitter debate, it’s almost certainly impractical to retract these papers, en masse. For one thing, some might have found real, yet weak, results, which might now have been independently replicated, as Jon Simons pointed out. Many may have other useful clues to add to the literature, either in the behavioural component of the study, or due to an innovative design.

But whether a large set of literature should now be discarded is a quite separate question from whether they should have been published in the first place. Ideally, the authors should have been more aware of the statistical issues surrounding neuroimaging, and the reviewers should be barring uncorrected significance. More of this later.

Can any neuroimaging paper do more harm than good?

Another point, often overlooked, is the clear possibility that a published study can do more harm than good. Dorothy Bishop already implied this in her blog article, but I think it’s worth expanding on this point. If a published result is wrong, but influential and believed, then this can negatively impact on the scientific field. For instance, it can perpetuate an erroneous theory, thus diluting and slowing the adoption of better models. It can also make other scientists’ progress far less efficient.

A good proportion of scientific research involves reading a paper, getting excited by its results, and coming up with an idea to extend it in a novel way, with the added benefit that we have to perform an independent replication to support the extension – and everyone agrees that independent replication is a key stage in firmly establishing a result.

On a personal level, not only in neuroimaging, but also in many behavioural results, I and my research students have wasted many soul-destroying months failing to replicate the results of others. Perhaps a fifth of all experiments I’ve been involved in have been of this character, which if you include the work of research students as well, easily adds up to multiple man-years of wasted work. And I’m actually probably more critical than most, sneer at uncorrected statistics, and tend to go through papers with a fine tooth comb. But still I’ve been caught out all these times. For others who view scientists less suspiciously, the situation must be worse.

For the specifics of an fMRI study that fails to replicate another, the scanning costs can easily top $10,000, while the wage hours of radiographers, scientists, and so on that contributed to this study might add another $50-100,000. These costs, which may well have been funded by the taxpayer, are only one component of the equation, though. It can easily take 6-12 months to run a study. If the researcher carrying out the work is studying for their PhD, or in the early phase of their post-doctoral position, such a failed experiment, in the current ultra-competitive research climate, might turn a talented budding scientist away from an academic career, when those vital papers fail to get published.

The implications multiply dramatically when the study has a clinical message. One particularly tragic example in science more generally comes from the book, Baby and Childcare, by Dr Spock. Recently on the BBC Radio 4 programme, The Life Scientific, Iain Chalmers pointed out that this book, with its order that mothers put babies to sleep on their front, was probably responsible for 10,000 avoidable deaths in the UK alone.

I would therefore argue that scientists, particularly within the neuroimaging field, where experimental time and costs are substantial, and especially when this combines with a clinical message, have a duty to try, as far as possible, to publish papers that are as rigorous as they can be, with corrected statistics an obvious component of this.

Is there a culture of sloppy neuroimaging publications?

Effective corrected statistics are by no means a new addition to the neuroimaging methodology. The first (and still most popular) common correction method was published in 1992 by Keith Worsley and colleagues, while a second was published in 2002 by Tom Nichols and colleagues. When I began my PhD in 1998, when I almost immediately started my first neuroimaging study, it was already frowned upon in our department even to consider publishing uncorrected results. So how can uncorrected statistics be published so widely, even, to some extent, today?

I believe there are two components to the answer, one involving education, but the other, more worryingly, relating to certain cultural features of the cognitive neuroscience community.

Analysing neuroimaging data, especially if it’s of the fMRI flavour, is very complex -there’s no getting around that. Datasets are vast, and there are many issues to address in arriving at a clean, robust set of results. There are competing schools of thought for how to analyse the data, and a dizzying level of maths required to absorb in order fully to understand even the standard analysis steps. Thriving neuroimaging centres, such as the Cambridge community, where I carried out most of my imaging studies, invest much time in running seminars, writing webpages and so on, to disseminate the current state of play in imaging methods. More isolated neuroimaging centres, which are the norm rather than the exception, have a far greater challenge getting up to speed. The community as a whole does a reasonable job, both online and using onsite courses, in educating any scientists that need help. But clearly they could do far more, and I have a few ideas about this, which I’ll leave for a later article.

But this is only half the story – a paper can normally only be published if a set of reviewers approve it. If a paper is methodologically flawed, the reviewer should explain the flaw and suggest improvements. It is highly problematic if reviewers are either chosen by editors or allow themselves to act as gatekeepers for a paper, when they aren’t qualified to judge its methods.

Dwarfing the issue of lack of education, though, is that of culture. Papers which are obviously methodologically flawed, both in design and statistical analysis, tend to get published in minor journals and make little impact. On the other hand, there is an assumption that if you are published in the most prominent journals that you have produced high quality research, and a paper is far more likely to be influential. This is where a spate of cultural problems arise.

From the outside, the public assume that almost all scientists have noble, perfectly honest aims when papers are published. I believed this too, until I started my PhD, when I was quickly educated in how some neuroimaging scientists are masters at manipulating data to accord with their theories, and how research politics, in-fighting and many other ugly traits are relatively common. Throughout my academic career, this initial lesson has been heavily reinforced, and I think it’s a particular problem in neuroimaging, which combines a softer science with vast, complex datasets.

An ambitious scientist at the start of their career knows they need a stream of big papers to set them towards that hallowed tenured position, while an ambitious tenured scientist knows the big grants will flow if more big papers have your name on it. In other fields with large complex data sets, such as high energy physics, perhaps the transparency of the process means that you can only progress with scientific talent and genuine results. But in neuroimaging, an only slightly unscrupulous scientist can learn the many tricks hidden in this huge pile of data with its many analysis paths, to dress it up as a bold new set of results, even if the design is flawed and the analyses are invalid. I wouldn’t call this fraud, as the scientist might well have some skewed self-justification for their invalid steps, and it’s definitely not as if they are making up data from scratch – just exploiting the complexities to find some analysis that shows what they want – usually in a heavily statistically uncorrected way (though it might not be so obvious that this is happening when reading the paper).

This is not a hypothetical scientist, by the way. I know of a few senior scientists that employ such “techniques”, and any neuroimaging researcher who’s been in the field for some years could probably do the same. One huge issue here is that, as long as they can get rewarded for their tricks, by publishing, then they can flourish and progress in the field, and perpetuate these unscientific habits, which can even become general fashions (perhaps using uncorrected stats is one example here). The reviewers and editor should, ideally, stop such publications, but sometimes the reviewer is ignorant about the flaws, some of which can be quite subtle. At other times, though, there are cultural issues that lend a hand.

Some years ago, an editor at Nature Neuroscience – the most prominent specialist journal to publish neuroimaging results – came to give a talk at my old Cambridge department, the Medical Research Council Cognition and Brain Sciences Unit. When discussing what factors help some authors achieve repeated publications in this journal, she described how the author’s careful choice of which people to recommend for review and which reviewers to exclude was an influential component. One striking feature of the review process, which the non-scientific world is probably unaware of, is that in almost all journals, authors get to recommend who should review their manuscript. In principle there needn’t be anything wrong with this – after all, the author is best placed to know who in the field is most able to judge the topic of the paper, and the over-busy editor could use all the help they can get. And there is certainly no guarantee that a recommended reviewer will end up reviewing the manuscript – for one thing, they might just be too busy at that time. In practice, though, an ambitious author can easily exploit this system and recommend friends or even ex-lab members who are sure to review the manuscript favourably, and blacklist those who, perhaps for clear scientific reasons, will not. After all, the friendly reviewer knows that the author will soon be a reviewer for their papers, and the favour will be returned. The fact that the review process is ostensibly anonymous is meant to address this issue, but it can be easily bypassed.

A related trick is to send your manuscript to a journal where your friend and colleague is the main editor, and who will accept your manuscript, almost regardless of what the reviewers say. I should emphasise that these situations, while somewhat uncommon, are certainly not just hypothetical. For instance, for quite prominent journals, I have reviewed papers which were terribly shoddy, methodologically appalling with uncorrected statistics or far worse, and I as well as the other reviewer recommended against publication. I then found a year later that the article was published anyway, and did know that the lead author used to be in the same lab as the editor.

Of course, there is a wealth of exciting, valid, rigorous neuroimaging studies published, and the field is slowly becoming more standardised and robust as it matures. But, as I wrote in Twitter, the majority of neuroimaging studies I come across are so flawed, either due to design or statistical errors, that they add virtually nothing to my knowledge.

What can be done?

Okay, so we’re stuck with a series of flawed publications, imperfect education about methods, and a culture that knows it can usually get away with sloppy stats or other tricks, in order to boost publications. What can help solve some of these problems?

Of course as scientists we should strive to be more rigorous. We should consult more widely in forging our design. We should train better in the proper analysis methods, avoiding obvious mistakes like uncorrected data (which can usually be fixed by simply testing another half a dozen subjects, to increase the experiment’s statistical power). And we should try to be as honest as possible at every stage, especially by being more open about any residual drawbacks of the study.

But in some ways an even more important area for improvement is the review process. This should be made more transparent in various ways. Some journals, such as the open access Frontiers journals (which I just published in this month), publish the names of the reviewers (who are initially anonymous) towards the top of an accepted paper. This is a good first step, but perhaps the entire review discussion should be available as well somewhere.

Related specifically to neuroimaging, Dorothy Bishop made the suggestion that:

“reviewers and readers would benefit from a simple cribsheet listing the main things to look for in a methods section of a paper in this area. Is there an imaging expert out there who could write such a document, targeted at those like me, who work in this broad area, but aren’t imaging experts? Maybe it already exists, but I couldn’t find anything like that on the web.”

I think this is an excellent, pressing idea, and don’t think it would be too hard for a methodologist to generate such guidelines.

More than this, though, there should be a far greater emphasis generally on ensuring that the reviewer is equipped to judge the manuscript, and if they aren’t, then they should own up to this before reviewing. There was some talk a decade back for each neuroimaging paper to have at least one methods expert reviewing the paper, which I still think is a solid idea.

I also believe that the review process, as the shield against flawed publications, should generally be taken far more seriously than it currently is. As it stands, reviewing a paper is a thankless task we get no payment for, and usually takes (for me at least) an entire day, when almost all academics are already heavily overworked. Academic publishing is currently undergoing a revolution, amidst the call for open access. To publish in an open access journal, the author (or at least their department) has to pay a large fee, to cover the journal’s costs. Perhaps as part of this revolution, the fee could be increased by some modest amount, and the reviewers paid each time for their expertise. They would then be more likely to do a thorough job.

In addition, there should be a cultural shift in the review process, further towards not publishing a neuroimaging paper unless it’s of real worth, and has valid methods, at the very least by using corrected statistics. On the one hand, a huge amount of work may have gone into a manuscript, easily involving a year of one or more scientist’s life. And of course it’s a shame that all this work is wasted. But on the other hand, if the study is horribly flawed, the methods are invalid and so on, publishing the paper will merely drag the field down, and make it more likely that future researchers make the same mistake. I would argue that reviewers should put these personal questions entirely aside and be stubborn, tenacious and as critical as they can be, although also very clear about how the study could be improved (or even redone), to give it a later chance of publication.

Then there is the issue of nepotism in the review process. If the author has a conflict of interest, such as that they are funded by the pharmaceutical company whose drug they are testing, then they have to state this in the paper. Perhaps they should do something similar for their suggested reviewers. They could be asked, in addition to their suggested reviewer’s name, whether that person has ever worked in their lab, collaborated with them, or is considered a colleague or friend. This needn’t negate the potential reviewer being chosen, but the editor will have a firmer idea up front of the potential reviewer’s level of objectivity in the matter. And if this information was eventually attached to the potential conflict of interest section of a paper, then that would be another clue for the reader to glean about the level of rigour in the review process. Just knowing that this will happen may cause authors to choose less obviously generous reviewers in the first place.

A further issue relates to independent replication, which was one of the main topics on the Twitter debate. Should a reviewer or editor insist on independent replication of an entire study, for it to be accepted? In an ideal world, this makes some sense, but in practice, it could delay publication by a year or more, and be extremely difficult to implement. One compromise, though, is for the author to submit all their raw imaging data to an independent lab (or some new dedicated group that specializes in re-analysing data?), who can confirm that the analysis and results are sound, perhaps by using different neuroimaging software and so on. I’m not sure of the incentive for such work, beyond co-authorship on the original paper, which carries its own motivational problems. But for the top tier journals, and a particularly ground-breaking result, it’s a policy that may be worth considering.

How can a layperson know what to believe, in response to all these issues?

First off, a healthy dose of skepticism is a universally good idea.

Then for a given story you’re interested in, choose to focus not on the national newspapers (whose coverage is pitifully uncritical), but on blogs that describe it, written by scientists, who are usually quick to describe the flaws in studies. If they haven’t mentioned it, ask them these standard, but vital questions:

Are the stats properly corrected for the multiple tests carried out?

Are the results replicated elsewhere at all?

If these activation areas are linked to a given function, does the blogger know of any other functions previously linked to these brain regions?

Are there any plausible alternative interpretations of the results?

If no such blog exists, find a scientist in the field who blogs regularly and suggest they cover it.

Failing this, why not try to find out for yourself in the original paper? If a paper is behind a paywall, email the corresponding author to send you a free copy (almost all will), and see if they mention uncorrected or corrected statistics in the methods (FWE and FDR are the two main versions of corrected statistics methods), if they mention other studies with similar results, or if the main design fits with their conclusions. Sometimes these are tricky issues, even for others in the field, but other times flaws can be embarrassingly obvious to most intelligent laypeople (who occasionally see more due to a fresh perspective). In my upcoming popular science book, I relate a few examples, where a just a little common sense can destroy a well published paradigm.

If you wanted to take this further, by chatting on Twitter, Google plus, blogs and so on, most scientists should be very happy to answer your questions. I, for one, would be delighted to help if I can.

Like this:

109 comments

– Little of this is specific to neuroimaging but it’s especially bad in neuroimaging because of the size of the data & cost involved. Although there are analogies e.g. large scale genetics studies (GWAS etc.) are pretty much in the same boat – and it’s surely no coincidence that many of them go unreplicated!!!

– Retracting them all… never going to happen, but even it did, I don’t think it would help at all. Much better would be for readers to educate themselves or be educated to the point where they know how to spot sloppy stats.

– A “crib sheet” is a good idea but would have to be updated regularly because with all the new methods coming out these days, it would be obselete within a couple of years. Also would risk creating a “box ticking” mentality which we don’t particularly want, although it would be a good thing in some ways

– Points about reviewing are all good but I wonder how practical it would be. Getting a methods expert to review each paper – how many established neuroimaging methods experts do we have, really? Maybe 100 in the world? That’s a lot of work for them, given that it takes hours to properly review a paper. Making reviews more open… might just create new loopholes for the cunning to exploit (“I won’t be an author on this paper even though I wrote most of it so that I can be your reviewer… then we’ll swap places next time!”) Still, better than nothing…

I totally agree with Neuroskeptic that these problems extend similarly to genome-wide association studies. The difference is that field has grasped the nettle and now typically includes a replication sample in the experimental design, rather than relying solely on any of the various statistical approaches of correcting for multiple tests. (Those clearly do not give p-values that mean what they seem to mean). fMRI studies could be as exploratory as they liked, if they replicated findings with no prior probability in a separate sample.

A related problem in both fields is that it is too easy to come up with a post hoc explanation for why some brain region or some gene is involved in some process or trait. Until a clear mechanism is shown, reserving judgment should be the baseline response and active skepticism is often warranted.

Thanks for this thoughtful piece.
The first point I want to stress is that the issue of correcting data from whole brain analysis is the one I am *least* competent to comment on, as I don’t do imaging analysis. I had cited Russ Poldrack in my blog as an influential figure in the field, who was also an author on the Temple et al study. The blogpost has stimulated conversations with colleagues who are competent and honest imagers, some of whom think Russ Poldrack’s blanket prohibition on uncorrected whole brain analyses is too extreme because it’s known that there are real effects (in the sense of replicable) that don’t survive current methods of correction. So there is this concern re false negatives. Nevertheless, everyone I’ve spoken to does think the table of z-scores in Temple et al adopts a criterion that is far too lenient and therefore likely to include spurious findings. The general view from those i talk to is that if you are going to report this kind of thing, you need to be very cautious, and you really need replication: either good convergence with results from other labs, or some internal replication. And there needs to be much more detailed exposition of what was done and why, with explanation of the limitations: one problem in reading papers in this area is that all too often it is hard to work out what was done and data presentation is incomplete.
The second point is that the uncorrected whole brain analysis by Temple et al occurs in the context of another far more egregious error, i.e., the lack of a control group. The pre-training post-training comparison is taken to indicate the intervention is effective, and, as people on Twitter have noted, this is cited as evidence by those marketing FastForword (FFW). In a paper published in 2003, the same year as Temple et al, Rouse and Krueger stated that over 120,000 children had been through the programme – one can only imagine what the number is now. see http://www.nber.org/papers/w10315
By 2011 there had been 5 properly done randomised controlled trials of FFW, and a systematic review of their results by Strong et al (2011) concluded there was no benefit.
The changes in language seen in the treated children in the Temple et al study are totally in line with changes seen in control, untreated groups in other studies. So it seems safe to conclude that any changes in the brain aren’t a consequence of the intervention – so they have to be either chance effects, OR due to maturation OR the effect of reduced novelty of the task done in the scanner. In this context, the weakness of the whole brain analysis is particularly problematic.
I don’t think it’s realistic to expect a retraction, and certainly not just on the basis of reporting of uncorrected statistics. I think we should disregard the results of this study because of the constellation of methodological problems, starting with the basic error of conducting an intervention study without a control group. It’s interesting to see a handwavy comment about this aspect of the study in the Discussion, but it’s clear that the authors don’t really think it’s important, and they go on to discuss the findings in the conclusions and abstract as if they have demonstrated an intervention effect.
I am concerned at what might be termed ‘pollution’ of the scientific literature by this highly influential paper, but I think the only way to counteract it will be by drawing attention to the weaknesses in both formal and informal publication outlets.

Thanks a lot for this long, detailed comment. And thanks for sparking this debate off! I think it’s wonderful these issues were raised so openly.

I totally agree with the thrust of your point, that there can be many design flaws, independent of issues surrounding how one thresholds the statistics in fMRI analyses. Having appropriate control groups and/or conditions is vital (though sometimes very tricky), and all too often overlooked. It’s again a particular problem in neuroimaging where you want to isolate a given function against a really tightly matched control, since otherwise a large range of processes will be different (including all those executive ones that activate half the brain! e.g. see Duncan and Owen, 2000), and your conclusions about brain-regions/networks associated with your intended single process get watered down or become outright invalid.

I think retraction of papers that report uncorrected statistics is a bit much to ask for; after all, most of the results that were published in the days before rigid statistical corrections were common have turned out to replicate, and indeed large-scale meta-analyses have shown a good degree of consistency, at least in the sets of regions that activate for broad task contrasts. I think that the use of uncorrected statistics, while problematic, is qualitatively different from the lack of a proper control group, which strikes at the heart of a study’s interpretability.

Daniel, I think that the problem was historically just as bad in genetics, but they have done a better job of converging on common standards for the testing and replication of genetic associations. We could certainly learn a lot from them.

I take your point, and wholeheartedly agree, that lack of proper controls is a more serious issue than uncorrected stats. In the latter case, the results may be noise or a real signal, but in the former case it just won’t be meaningful, even with the most significant corrected statistics ever recorded. Most studies I come across (just involving normal groups) don’t require a control group, but nevertheless I commonly find missing or inappropriate control conditions, which is even more frustrating as this means the study was doomed from the start of testing. Perhaps half the studies I personally write off I do so for such design issues, and I would have liked to talk about the specifics of this too, but the article was already rather long and I chose to focus instead on the corrected stats question, as that seemed the main point of discussion, at least on Twitter.

Perhaps in another blog entry I’ll try to describe to a general audience how controls are a particularly tricky, though even more vital issue in neuroimaging, if there would be any interest for this.

Jon Simons

Considering the wider issue of corrected statistics, Dorothy mentions Type II error which doesn’t IMO form part of this kind of discussion often enough, but I think it’s important. It can be argued that we’re limiting ourselves if we only consider Type I error and think p<0.05 solves all problems, because it can lead to the kind of thinking that p=0.049 is real and p=0.051 is noise. This is, of course, nonsense – and Fisher never intended 0.05 to have the magical properties that it seems to have acquired. "No scientific worker has a fixed level of significance at which … in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas." (Fisher, 1956).

Thus, I guess another way of thinking about the issue of corrected or not comes down to what kind of inference you're looking to make on the basis of your data, because it can be argued that it should always be a balance between Type I and Type II errors. If you're looking to claim that a particular intervention is effective and that people should start using it, you may be more concerned about Type I errors than if you're conducting more exploratory studies trying to advance understanding of, for example, the contribution of different brain regions to a particular cognitive ability. Lieberman & Cunningham (2009) have written about this, arguing that this view of cognitive neuroscience "focuses more on the aggregation of data across multiple studies and in meta-analyses." In other words, no single study is considered definitive or "the answer" – instead, each study is a clue or piece of the jigsaw that contributes to progressively advancing knowledge about the problem.

This view encourages a more sophisticated way of considering the literature. No single study is "right" or "wrong" (I'm excluding from this analysis studies with other, as agreed above much more heinous, statistical errors such as invalid control groups, etc). Instead, each study is taken with a certain pinch of salt, the size of that pinch determined by a whole number of factors that relate to the quality of the research. One of those factors may be the statistical threshold used (and I have far more concern about people arbitrarily choosing supposedly a priori regions of interest in order to be able to report apparently corrected statistics than I do about people using a conventional — and therefore comparable across studies — uncorrected threshold like p<0.001 k=5). But there are many other methodological factors that contribute to how much I believe the findings of the studies I read; some of them much more likely to influence the results than stats thresholds.

By this view, as Roediger argued in his recent Psych Science article, replication (both internal and independent) becomes much more important than it currently is, with "signal" progressively getting stronger and "noise" (Type I error) progressively reducing over time. As I said above, this may not be sufficient for intervention studies for which the stakes are different, but such studies are only a part of what cognitive neuroscience is concerned with, and constraining the whole field with rules that may be necessary only for part of it would stifle progress.

Well, I suppose it depends on the context. If your experiment is correlating brain activity to a picture of a pie at age 10 with anorexia diagnosis at age 20, for example, I would be more worried about Type I errors because that’s a long shot.

In fact the same goes for most “brain-behaviour” correlations.

But when your study is basically “Which areas are activated by task X”, I would say the null hypothesis is not going to be “none”, it’s “all of them” (but to different degrees), because the brain is all interconnected.

Everyone knows it’s a myth that we only use 10% of our brain but by adopting p=0.05 cluster correction we may be encouraging the impression that we only use 5% of it at any one time 😉

I agree there is some degree of arbitrariness to p<0.05 corrected, but it certainly doesn't follow from this that we should abandon well established methods of correction for multiple comparisons. I would also think that any reviewer who sees a figure of p=0.051 corrected would be pretty sympathetic to the author over this!

A related issue is sample sizes in fMRI. 15 subjects is about the norm for a study/group, I'd say, but various studies have shown that this is too small (e.g. see Murphy & Garavan (2004) or Desmond & Glover (2008)), and you need at least 20, possibly even 25-30 subjects for sufficient statistical power, on average. Not enough subjects in a study can also increase the chances of getting actrivations just by chance. Ideally therefore, we should be testing many more subjects (which can be difficult for some labs, given the cost), and if there’s a real effect, increased numbers should make it more likely that it survives corrected significance anyway. But if we are limited to the standard number of subjects in the mid teens, using corrected stats is even more vital, given the increased chance of false positives.

I agree about your point about the fluidity of conservatism in a study, depending on its aims and clinical significance, but only up to a quite limited level! Surely we should all be trying to carry out the most robust studies in both design and statistical analysis, particularly in this admittedly soft science with much complexity – and using well accepted valid correction for multiple comparisons methods is one component of that. But maybe a higher bar should be set for clinically relevant studies, perhaps involving replication as a pre-requisite for publication?

As for “no study is right or wrong” one of my points in the blog is that many studies may well be wrong, potentially sending a field towards a cul de sac that it can take years to get out of, but this is hidden behind the surface, care of the many tricks available with such a complex dataset with many analysis options. And there are cultural ways it can be exacerbated, by domineering figures exploiting nepotistic tactics. I totally agree that region of interest choice is one prominent, potentially exploitable issue.

As for signal increasing due to replication and so on, that’s a really interesting point. One colleague in my current department, Zoltan Dienes, for behavioural data at least, is advocating using non-classical, Bayes factor analysis methods (see Dienes (2011)). Here data is viewed in a far more fluid way and, for instance, a previously classically non-significant study can be combined with a current one to increase its predictive power. This approach is in many ways a more intuitive way of doing things (especially for you, I think!). I know there has been some attempt to apply Bayesian statistics to neuroimaging data, but am not sure of the current state of this.

Neuroapocalypse

I also agree with John. We may very well be correcting ourselves into oblivion. Besides which, current correction techniques strike me as all unsatisfactory. For example (and please someone correct me if I’m mistaken) in my experience FWE (with random field theory) is often just as conservative as Bonferonni. The current community favorite, FDR, is no magic bullet. Being dependent on the data for determining the threshold, it is very sensitive to the structure of the effect maps. For instance, someone who runs a subtle contrast between two highly controlled conditions that are expected to have effects in a single focal patch of cortex is penalized (higher thresholds required for significance) compared to someone who performs a less constrained experiment between two very dissimilar categories. You might answer, “well don’t compare apples and orange, stupid” but often it is necessary to do so to answer the question or merely due to tradition. Faces vs. houses anyone? And finally, cluster extent thresholds work well, however at conventional levels of smoothness you end up with thresholds in the ballpark of 35-60 voxels for whole-brain correction. You could easily miss many smaller structures at that threshold as well as get “lucky” by having your favorite region touch the cluster to another region, thereby giving you a cluster larger than that required for correction.
Uncorrected may very well be the worse, as contrary to John’s comment, it is not comparable across studies. p<0.001 uncorrected on 20 slices of 5mm voxels is much more conservative than p<0.001 uncorrected on 36 slices of 2mm voxels. If I had to pick a standard, it would be FWE with RFT, but whole-brain corrected at p<0.1 or even p<0.2 to counter the over-conservatism.

Of course, the researcher with the question about a specific brain region can do small volume correction or anatomical VOI analyses (assuming the region can be defined as such) or perhaps use a localizer task or unbiased contrast to define a region. However, each of those methods has its detractors. And I've been reviewed by all of them! Even Russ on his blog and John Simmons above both express concern that people are peaking at their results and then choosing ROIs to give them what they want.

I also agree with Daniel, the issue of power is a real thorny one. Especially in between group studies which kill you on both between subject variance and between group variance. Add to that the fact that usually we expect differential activation of the same region between groups (i.e. a difference in degree, not a "it's off in schizophrenics and on in controls"). Between group effects in the behavioral literature, even when in patients, are typically not so large as to surpass multiple correction for 50-90,000 tests, yet somehow, and with less subjects, we expect fMRI to be more majestically more powerful.

To sum up. I'm not sure that insisting on corrected statistics is really the most important point. Moreover, I don't think uncorrected statistics are quite as tragically flawed as usually maintained. It isn't as though people will jump to report any effect, anywhere in the brain and in any direction, so long as it passes the uncorrected threshold. Usually there has to be some overlap with their priors. Now I'm no Bayesian either, so I have no idea how to correct for that, but I would bet that the combination of uncorrected statistics plus only considering a study successful if the results represent a general pattern that fits the experimenter / literature priors would mean that most studies with uncorrected stats are not simply random noise (as they are often portrayed to be in by critics). Bayesians can feel free to school me on my utter misunderstanding of the issue if that indeed is not how it works.

Personally, I would rather less skepticism for VOI or SVC analyses as long as they are reasonably motivated.

One toy example that I think might make the issue more concrete is the following: Let's say you had a hypothesis about a single dependent variable, reaction time for instance, but your measurement device gave you data across thousands of channels (gsr, erp, gaze, what people ate for breakfast, the fuel efficiency of their last automobile, the number of lovers their mother had before meeting their father, etc.). Should you be expected to run your statistics on all these channels and correct for the ensuing massive amount of univariate tests? If not, then why should we correct for the entire brain (which commonly includes white matter, and possibly even ventricles)?

If I had any constructive criticism, I would say we either need a combination of whole-brain exploratory plus VOI analyses (meta-analysis people can see the big picture in your data, you can test your hypothesis with better sensitivity). Or we make it common (and support it with grants) for people to keep a hold out sample for discovery at disgustingly liberal thresholds and then confirm in the test dataset.

Although this isn't without its problems. We tried this once to get around non-independence issues and found it to be difficult to implement and computationally expensive and once implemented it proved unwieldy. Using different runs for discovery and test led to subtle changes in the shape and location of the VOIs derived from each discovery set. How do you present that to reviewers? So our region moves around on every fold of the cross-validation, but it's roughly here, just squint your eyes a bit and trust us.

Thanks for the detailed comment, Neuroapocalypse, and all your great observations about the complexities surrounding correction. You illustrate really well how methods are complex and clearly still in flux.

But there are some methodologies where there’s a broad consensus that we shouldn’t be using, such as uncorrected statistics and inappropriate controls, to stick with the blog debates of the week from Dorothy Bishop and Russ Poldrack. I think we need to be careful as a community not to use the complexity of neuroimaging as an excuse to make these more obvious errors.

* The issue of correction is tricky. I think there’s a ‘danger zone’ that happens to be around the modal threshold of p < .001 that has the worst of both worlds: high false positive rate, but also insufficient power to detect most effects (see the power curves in Yarkoni (2009), for example). More stringent thresholds tend to produce null results; less stringent thresholds tend to produce too much stuff. People like the danger zone because it’s easier and more compelling to talk about two or three regions that seem to do X than to say “we found nothing” or “it looks like 30% of the brain shows this effect”. But the reality is that, at the resolution fMRI provides, there simply isn’t that much specificity, and most of the contrasts we use to isolate cognitive processes are going to elicit activation changes in fairly distributed swaths of tissue. So there’s a real danger in using the “let’s be as statistically stringent as we can while still getting a couple of regions to show up” approach, which I would say is probably still dominant right now.

* The recommendation to use samples larger than 15 – 20 is important, but there really isn’t any single number that’s “okay”. I’d actually argue that under reasonable assumptions about effect sizes and statistical thresholds, we would typically need much larger samples–often on the order of 60+ subjects–to have conventionally adequate power levels (see Yarkoni (2009) for more discussion). This isn’t going to happen any time soon, unfortunately, because the costs of fMRI are more or less fixed. I completely agree with Russ’s point that the fMRI literature now is exactly where the genetics literature was 10 – 15 years ago. But the big difference is that the cost of scanning genomes has fallen by several orders of magnitude, whereas the cost of fMRI has come down not at all. It’s easy for the geneticists to be rigorous when GWAS studies with 30,000 people now cost the same as a study with 300 people would have cost a decade ago; I suspect if the cost had remained flat, we would still all be complaining about how poor the standards are in genetics as well. There’s enormous pressure to keep costs down in individual studies, even when everyone knows on some level that we end up with shitty science as a result.

* I agree with pretty much all of the points above re: how bad the peer review process is. In my view the right solution is to do away entirely with pre-publication review, and move towards post-publication models as fast as possible. I’ve written about this myself here, and there’s an enormous literature going back a decade or so that basically argues to the same effect. If commercial and social news websites like reddit, Netflix, and Amazon can implement highly effective collaborative filtering and review algorithms–often with billions of dollars at stake–there’s no reason the scientific community can’t also figure this out. We should really be publishing everything at the front end and then letting everyone take a crack at reviewing, while providing appropriate incentives for good reviewing behavior (I discuss this in my paper; my favored solution is a Stack Overflow-like reputation system).

* Perhaps the biggest problem you touch on is that bad science is, to a considerable extent, currently reinforced by the publishing and evaluation system we have. The reality is that you can’t get a paper into Science reporting a 0.1 correlation between some personality measure and activity in 30% of the brain, but you may well be able to get a paper into Science that reports a .8 correlation between the same measure and activity in one or two focal regions. The former result is almost certainly the correct one in the vast majority of cases, but you would only know that if you collected a much larger sample (e.g., a couple of hundred people). In effect, researchers have no real incentive to conduct very large, expensive studies when the main outcome is that no one wants to publish the “boring” results. Personally I’ve found that almost every time I see a poster at a conference reporting the results of an enormous fMRI sample (and there are now some with 500 – 2000 subjects), the presenters seem despondent at the perceived lack of effects (or, rather, ubiquity of tiny effects) and the results never make it into print, or end up in a low-tier journal. The reason studies with enormous samples end up with diffuse but small effects is not that these studies were poorly done and the 15-subject ones were better; it’s that the world is in reality a complex place, and most effects are. Using grossly underpowered samples is a sure way to end up with inflated, seemingly selective effects. As a community, we’re addicted to interesting, implausible results, and the result is that our science suffers. Unfortunately I don’t see any easy (or, for that matter, difficult) fix for this.

Thanks so much for your long, fascinating comment, Tal. I think I agreed with everything you say.

RE post publication models, Frontiers is heading a little in this direction, isn’t it? They have tiers of journals, with a publication starting in the lowest, most specialist tier. The top 10% of articles here, as based on popularity and post-publication evaluation by the academic community, get elevated to a higher tier, where they are republished in a more general journal (and I think peer reviewed again for it, in fact). I think there are 4 tiers existing, or at least planned. A very interesting model!

But I think your idea of implementing a far more aggressive and dynamic post-publication review system for the entire field is a wonderful, intriguing suggestion. At present, the only index we really have of the quality of a paper is the number of times it’s been cited in the literature, which takes quite a time to build up, can also be partially exploited (with prolific publishers citing themselves obsessively), and biases certain journals (right now penalising someone who may on principle only publish in open access ones). Your method could provide a far better means of assessing a paper’s worth, while also attaching a wealth of discussion to it – for instance allowing others to point out flaws in statistics or controls, as a warning label.

Some years ago, an editor at Nature Neuroscience – the most prominent specialist journal to publish neuroimaging results – came to give a talk at my old Cambridge department, the Medical Research Council Cognition and Brain Sciences Unit. One of the audience members asked her what distinguished those scientists who consistently get papers accepted in her journal. Rather than talking about the rigour of the research or anything like that, she instead said that they were particularly good at choosing their reviewers.

My talk at Cambridge was over ten years ago, so I suppose it’s no surprise that you’re misremembering what I said so badly. (I had to go back to the PowerPoint file for the details myself.) The information about referees was part of the main talk, not in response to a question. I never suggested that scientists could “choose” their reviewers, as indeed they can’t. Instead I said that authors should take care to suggest reviewers who are well known in the field, because editors are unlikely to use the author’s suggestions unless the editor already knows those people to be qualified. Then I added that the author’s influence on the editorial process from suggesting reviewers is relatively weak, and that authors are in a much stronger position to make a difference when they ask to exclude particular reviewers from evaluating the paper.

The last sentence is also wrong. Of the seven habits of highly effective authors that I discuss in that talk, four relate to scientific substance. First, I said that successful authors don’t submit to Nature Neuroscience unless they’re convinced the paper is important to the field, technically convincing, and novel. Second, successful authors take a big-picture view of their science and focus on investigating questions with broad implications. Third, in replying to referees, successful authors respond effectively to the major scientific criticisms instead of trying to sweep the elephant under the rug. Fourth, they tend to address concerns definitively with experiments rather than inconclusively with endless pages of rebuttal. I have no idea how any of my statements turned into what you wrote above.

Very many apologies, Sandra, if I caused you any distress here, and for any misrepresentation.

But I’ve just looked up your face from your blog to see if it was familiar – it wasn’t. I then checked my own notes, and it wasn’t you that gave the talk when I was present at least, but another Nat Neuro editor (in July 2006, not around 2002). Your talk may well have been at the MRC CBU and I just didn’t attend, but I can’t find any record of it, unlike the other editor’s (I was definitely around 10 years ago – I started there in 1998). Perhaps it was another Cambridge cog neuroscience department? There are quite a few.

I do recall that the discussion returned for quite a time to questions of choosing reviewers (or excluding them) as a key successful strategy for authors who repeatedly publish in Nature Neuroscience. However, I guess there is a stock talk to give, and it’s possible I was unfairly capturing the overall spirit of the other editor’s talk, which I am again very sorry for, so I will edit the main article to correct this immediately after posting this comment.

I meant “choosing” reviewers for recommendation, as I know of course that who an author chooses is not necessarily who you end up with (especially given that so many scientists may reject an offer to review at a particular time as they are too busy). However, I can see the text is ambiguous and will correct this point as well.

While you’re here, if could spare a little time, it would be wonderful to get your input on a few of the issues raised, as you are obviously someone at the very locus of this debate.
1) In terms of reviewing systems, do you think the current model works well? As someone that deals with the process all the time, do you think there are any ways that either the journals, reviewers or authors can move forwards and make it better?
2) More specifically, does Nature Neuroscience have some internal equivalent of a cribsheet that Dorothy Bishop suggested, for instance in the light of articles such as the Kriegeskorte et al 2009 circular analysis paper that Nature Neuroscience admirably published, given that it highlighted the prevalence (including within Nature Neuroscience) of the kind of statistically invalid tricks I alluded to in my main article above? If not, could the imaging community help you and other journals by producing one (if we could ever be found to agree!)?
3) If the suggested reviewers are prominent and relevant, what’s the likelihood that Nature Neuroscience would try them first as candidate reviewers?
4) I think we’re all agreed, aren’t we, that who an author chooses to recommend for (or especially against?) reviewing the paper makes a significant difference in getting that paper published. Under the current model, an author could recommend a set of (scientifically prominent) people s/he knows would give a favourable review (such as former lab members), and suggest to exclude those with competing theories, known stringent reviewers and so on. Is there any strategy in Nature Neuroscience to limit potential exploitation here, or do you think such issues are too difficult to check?
5) To avoid this issue, is it at all practical entirely to remove author recommendations for reviewers (and to exclude reviewers too)? Presumably places like Nature Neuroscience has a database of reviewers and would have a pretty good idea from this, or from the reference list of a submitted paper, who would be the best people to review?
6) Out of interest, what were the other three habits for effective authors you had? And I’m very probably misreading what you wrote, but doesn’t the first habit you mention refer more to personality than scientific substance?

Again, many apologies, Sandra, and also many thanks in advance if you do find the time to respond to the above questions.

Clas Linnman

Very interesting and thoughtful discussion. Another problem worth considering is spatial and behavioral specificity.
Shakman et al (http://www.ncbi.nlm.nih.gov/pubmed/21331082) provides very convincing data that negative affect, pain and cognitive control are all processed in the cingulate cortex, yet studies often only cite papers within their domain. Another example is the periaqueductal gray involved in a whole range of radically different behaviors. Yet the region is discussed as a pain inhibition region by anesthesiologists, as a fight-flight region by psychologist and as a pudendal nerve target in the micturition circuit by urologists.
Moreover, the spatial specificity is sometimes questionable, se for example figure 4 in this review (http://www.ncbi.nlm.nih.gov/pubmed/22197740)

I think, in the end, it comes down when and if we turn the results into applications. For example, once we start use fMRI or DTI for surgical planning, we better make sure we got it right.

Thanks for your reply, Daniel. Sounds like I owe you an apology for jumping to the conclusion that you were referring to my talk – though it did sound very familiar. I left Nature Neuroscience about four years ago, so I can’t speak to the journal’s current policies, but I’m happy to answer the questions that relate to my experience with scientific publishing.

In general, I think most editors are well aware of the games authors sometimes play with the review process. It’s unlikely that a top journal would send a paper exclusively to reviewers that the author recommends. The paper might well go to one or two of the recommended reviewers, but in most cases only if the editor already knows and trusts them from previous reviews. A common outcome when authors suggest a set of close friends and collaborators is that several of them will respond by saying that they have a conflict of interest and can’t review the paper. Once that starts happening, the editor will typically stop taking any suggestions from that author (though the author won’t usually know about that decision).

6) Out of interest, what were the other three habits for effective authors you had? And I’m very probably misreading what you wrote, but doesn’t the first habit you mention refer more to personality than scientific substance?

The point of the first habit was to underline that, as one Nature editor liked to phrase it, “To make a rabbit stew, you must start with a rabbit.” The purpose of the talk wasn’t to explain how to get bad science into NN, but to prevent people from having potentially good (but not quite there yet) papers rejected because they didn’t handle the peer review process well – something I saw happen all the time. In my view, having a solid and interesting finding is the heart of scientific substance.

The last three habits were about interacting effectively with editors. One was the recommendation on using reviewer suggestion and exclusion wisely, which you know about already. The second was to understand the limits of what is possible for editors and frame your requests accordingly. That is, don’t call your editor and ask to have all the reviewers replaced because you don’t like their opinions. It’s much smarter to ask for consultation with one additional reviewer with expertise in a particular technical area that you feel wasn’t covered adequately in the original review, for instance. The third point is obvious, but needs repeating every so often: don’t contact your editor when you’re furious. That’s mostly because angry people don’t listen well, so you’re likely to miss something important.

I think your post highlights the likelihood that the most frequent, damaging examples of poor imaging papers may not be in the very top tier of journals, but a little below. Places like Nature Neuroscience have a dedicated set of professional, entirely independent editors, and reviewers, attending to the very high impact factors, are probably more likely to seek out every wrong detail (although with inflation in supplementary materials, perhaps this is increasingly difficult?). On the tier below, however, where papers are still very visible and potentially influential, editors have far more limited time, since they are also working scientists, reviewers might not be chosen with quite as much care, and are less likely to be do as thorough a job. For me this high, but not top level of journals is where I find the most disappointing well cited papers.

In terms of the list of habits, I think non-scientists may be struck by how much personality is vital to make an effective scientist, with good managements skills surprisingly important. On a psychological level, though, I find it interesting that there is probably only a minor difference between those clever, bold and ambitious scientists who know just how to present and angle their honest, rigorous paper for maximal effect – and those other equally clever, slightly more bold and ambitious scientists who apply their intelligence to manipulate techniques in an invalid way in order to get a strong publication out of it.

I-han Chou

I’ll chime in on another couple of your questions (I’m one of the neuroscience editors at Nature). Nature does not have an internal checklist for neural imaging papers (I can’t speak for Nature Neuroscience). However, it’s something we’ve been considering recently. We do have technical requirements in other areas of biology (e.g. experiments using RNAi must rule out off-target effects, genetic association studies generally should include replication, etc.) and we’ve been talking about instituting requirements or reviewer guidelines for imaging papers. Of course we would only attempt to create such a list in close collaboration with experts in the field — are you offering?

And just to be clear on how Nature treats reviewer exclusions and suggestions: we honor all named exclusions, within reason (“reasonable” varies by field, but ~3 is the norm). If so many people have been excluded that we feel can’t get the manuscript reviewed appropriately, we’ll contact authors and ask them to cut down or prioritize the list.

We look at authors’ reviewer recommendations just as suggestions. In general, we actively avoid having the entire reviewer list consist of individuals the authors have suggested. When making up the reviewer list our top considerations are: including people with the right technical expertise, including people who are known to be rigorous and fair reviewers, avoiding people who may have conflicts of interest. So try to stay on top of who trained with whom, who is collaborating or has recently collaborated, who may be in conflict, etc.

As for forming a checklist, there are very many people far more qualified than me. Matthew Brett and Russ Poldrack, who have contributed elsewhere in these comments, immediately come to mind as potential names to ask.

Charvy Narain

The editor whose talk you are referring to was almost certainly me, and while this was several years ago, it is apparent that some of the points I made then have clearly been misunderstood: my apologies if I didn’t make my points clearly enough.

To clarify, I gave a very similar talk to the one Sandra mentions in the comments here (complete with the ‘To make a rabbit stew, you must start with the rabbit’ quote, as well as the habits of highly effective authors section, which, as Sandra has already explained, focus largely on the scientific content of the paper). More specifically, as already mentioned, authors do not choose referees, and this is instead the editors’ responsibility. At Nature Neuroscience, we always honour named referee exclusions (usually up to about 5 names, but we do ask authors to trim these lists if they seem excessive), but as both Sandra and I-han note, we try to ensure that all the referees have the requisite technical and conceptual expertise to judge the paper’s merit. I-han’s final paragraph above is an excellent summary of how the referee selection process works at Nature Neuroscience as well, and we try to avoid having all the referees be from the list that the authors suggest.

Again, it is worth emphasizing that authors choosing the right referees is not ‘a key successful strategy’, since this is not a choice that the authors can make. Instead, the editorial referee selection process is pretty much independent of the authors’ list of suggested referees, and this has always been so.

Regarding the other issue of a checklist, we do not currently have a stock neuroimaging cribsheet that the editors look at (or that is sent out to the referees) but we do try and look out for the sort of statistical issue highlighted in Kriegeskorte 2009 perspective article, and we do flag these up for the referees if we notice them (or reject the paper without review, if the statistical errors are egregious enough to seriously limit the conclusions drawn). More generally, there are many other issues about the use as well as reporting of statistical analyses, which affect not just neuroimaging, but other kinds of neuroscience research as well. For example, the problems that Sander Nieuwenhuis highlighted in his Nature Neuroscience perspective last year affect behavioural, systems, molecular and cellular neuroscience studies as well, even though some of these fields may pay less attention to statistical issues, compared to the neuroimaging community, which is at least having a discussion about these problems.

We’ve trying to really put some thought into what we, as a journal, can do to encourage better statistical reporting and practice (apart from choosing technically stringent referees). We’ve just finished constructing a preliminary, very basic, statistics and methods checklist and the editors now fill this out for all papers sent out for review (an adjustment for multiple comparisons is one of the things on this checklist), and we will probably start including such a checklist in the information we send to the referees, so as to encourage more careful evaluation and reporting of the statistics and methods. We’re hoping to expand this over the next few months to see if we can come up with a more specific list of reviewer guidelines for neuroimaging papers. Obviously, we’ll need to work closely with the community (including hopefully several of you) to come up with such a list, and if any of you are willing to work with us to generate such a list, please do let us know.

Coincidentally, I’ll be giving a talk next week in Cambridge (21st March), so this is good practice for the sort of discussion that may happen!

First off, many apologies for any misunderstanding or misrepresentation on my part.

And thank you so much for contributing to the discussion and these further clarifications.

It’s really encouraging news that you are already constructing a methods checklist. And hopefully some here far more qualified than me can assist, and similar checklists can be adopted in other journals that publish neuroimaging.

I quite enjoyed this post, as I did Dorothy’s earlier blog post. I agree with neuroskeptic that none of these statistical issues are specific to neuroimaging.

Thus far the discussion seems to have focussed on the statistical criteria employed in individual studies within a field that may limit replicability due to type I or II error. I think the issue of replication needs to be considered more broadly. In my view, the cognitive task should have primacy, and this is not currently the case in cognitive neuroscience (despite the word order, the focus seems to be very much on neuroscience rather than cognition). Relatively few cognitive neuroscience studies employ tasks that have demonstrable validity and reliability at the behavioural level. Often new tasks are devised to manipulate a cognitive process of interest just for the purpose of neuroimaging. The concurrent validity of these tasks is unknown (i.e., does the new task correlate with other tasks that the literature agrees involve that cognitive process?), and the test-retest reliability is likewise unknown. So, I’m not quite sure about the utility of expensive, large scale databasing efforts that include neuroimaging results with tasks with poor psychometric characteristics, hoping to develop new ‘cognitive ontologies’ on the basis of the results, particularly when selectivity of function is clearly task dependent. We could cut out quite a bit of the noise in the literature by simply adhering to psychometric requirements before running to the scanner in the hope of publishing the next big thing.

So, a solution may be to require the field to employ cognitive tasks that elicit known effects. This is a bit like the requirement of a control group in treatment studies. Neuroimaging studies would either be limited to tasks already published in the behavioural literature and that have some consensus about the information processing components involved, or become multi-experiment studies in which the first few experiments establish the validity and reliability of the new task at the behavioural level.

And I agree that many published studies seem to reflect a tendency to focus on brain activations at the expense of solid psychological tasks, with the result that those activations are associated with unclear or unhelpfully compound mental processes.

The paper highlights the following 6 factors as particularly dangerous for false findings:

* The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.
* The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
* The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true.
* The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.
* The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.
* The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.

Sounds like a field I know…

Neuroimaging is a computational science, and others in computational science consider the field to be in crisis:

Thanks for contributing – I was really hoping you’d chime in at some point! And it’s not surprising that we have so many similar views, given that I learnt most of mine from you over the years!

I thought Russ would have mentioned the 2008 Guidelines, but thanks for mentioning this yourself. Do you think an updated version of the appendix of that article might generate a decent “cribsheet”, as Dorothy Bishop was asking for?

But do you have a sense of how much of this advice was ignored in later Neuroimage articles? Also, do you know if these guidelines were taken up in other journals, or do you feel their adoption been painfully slow?

I didn’t know of the PLOS Medicine article, but thanks for this – wonderfully elegant, clear and pretty devastating. Aside from the methodological guidelines, do you have any ideas how the field can tighten up its act, especially on a cultural level?

Russ Poldrack

I don’t really think that our guidelines paper addresses Dorothy’s concerns, which is why I never brought it up. That paper describes what details about a study need o be reported in the methods section, not how the study should actually be done. A paper can be fully guideline compliant and still report uncontrolled stars- it will just be clearer exactly what they did.

I think the problems are entirely caused by lack of clarity. For example, in our guidelines paper, we see this:

“Be clear about the inferences that can be drawn from your approach. For example, if you have used an uncorrected threshold then state clearly that you have unquantified control of family-wise error”

I find uncorrected thresholds dangerous and misleading, because an inexperienced reader can conclude that the data is sort of corrected, especially when they see the very small p value thresholds. But the problem is solved when the author says “Of course this image is for your viewing pleasure only, and no-one knows how much of this stuff is real or noise”.

There are so many factors that make the guidelines ineffective. These are the same factors that cause us to bend our results and discussion to increase our influence and reputation. This is not at all surprising. What is surprising is that there should be any resistance to the conclusion that this must lead to a dense cloud of false and misleading results that will make us slow to find the right [1] answer from among the plausible falsehoods.

Of course we tailor our results to our own preference, of course we write the paper to seem clever and novel, of course this means that many published findings will be false.

To quote David Donoho [2]:

“In my own experience, error is ubiquitous in scientific computing, and one
needs to work very diligently and energetically to eliminate it. One needs a
very clear idea of what has been done in order to know where to look for likely
sources of error. I often cannot really be sure what a student or colleague has
done from his/her own presentation, and in fact often his/her description does
not agree with my own understanding of what has been done, once I look carefully
at the scripts. Actually, I find that researchers quite generally forget what
they have done and misrepresent their computations.

Computing results are now being presented in a very loose, “breezy” way—in
journal articles, in conferences, and in books. All too often one simply takes
computations at face value. This is spectacularly against the evidence of my own
experience. I would much rather that at talks and in referee reports, the
possibility of such error were seriously examined.”

I’ve spent the last 4 years of my academic life writing computer code. Code gives faster feedback than science, because it is easier to know when you are wrong. When I write code, I make many mistakes. I need to test constantly and ask others to check my work. I know about the ubiquity of error.

If you are convinced of the ubiquity of error, then you will not accept journal articles as they are published now. You wil know from experience, that you cannot even take your own word for a finding, if you cannot demonstrate that finding to yourself, and you cannot show your suspicious colleagues what you have done, and why you have done it.

So, I think the problem cannot be fixed by guidelines, better reviewers, better editors, or better education. It can only be fixed by full, fierce, embarrassing transparency of data and code.

Satra

+1 on matthew’s comment about transparency and to that effect I would like to put a plug here for Russ’ efforts on sharing data (see: openfmri.org). perhaps 10 years from now another group of people might probably indicate that our current methods are inappropriate. but if we have the data available, they can easily recompute things based on their new algorithms.

this discussion stems from an observation about the methods in a paper and the summary result from it. but “embarrassing transparency” as matthew notes requires us to think about it at as far back we can possibly go and that could include even the proprietary sequences that acquire the data, the experimental software that presented the stimuli (at least for task based designs), the stimuli itself.

is it ever really sufficient to say that our flaws lie only in the analysis?

Parents and indeed entire school districts buy into these things–FFW, CogMed, and so on. I have never met a child who didn’t hate these interventions and think they were boring. (I know that is a testimonial.) One student told me of the fun school activities and even some important test prep that she had to miss because her school bought into CogMed. And it pains me that my colleagues in the field of Speech Language Pathology are responsible for so much wasted time, energy and finances because of the way they push FFW. I would like to see some investigative reporting on some of these companies that overstate their claims. Anyone interested?

One of the most egregious claims I’ve seen lately comes from Lindamood-Bell. Lindamood-Bell has forty nine centers in the US and centers London and Australia as well. The centers are generally directed and staffed by non-educators who are compensated poorly (I get this information from Monster.Com, Craigslist, etc.) and these non-professional individuals indeed implement the interventions and administer and interpret the the standardized testing. Yes, Pre- Post-test is used to determine student progress.

I contacted one of the authors of the study letting her know that the results were being misrepresented. She thanked me for letting her know.

Well, thank you scientists for all you do. It’s not easy earning a living as an honest interventionist when competing with so many quick fixes. And I just want to cry when I see so many parents with false hopes wasting their money not to mention their childrens’ precious time on sciency-sounding therapies that won’t work.

Hi there Holly and thanks so much for your very thought-provoking contribution.

I personally don’t know much about the fields you describe (it’s some of Dorothy Bishop’s areas of expertise, though, I think), but although I haven’t had a chance to look at it yet, at first blush the Neuroimage study you mention sounds disturbingly bad, especially as it’s so recent. How did that get past review?

I totally understand your frustrations. And the commercial question is particularly worrying. It hasn’t yet been discussed here (though it was mentioned on Dorothy’s blog), but questions of profit add a powerful extra potential motivation for misreporting data, especially in terms of inflating results and conclusions.

I can only reiterate that us neuroscientists need collectively to improve on many levels: in educating ourselves, in experimental design, statistical analysis, honest reporting of results, running replications, and in the review process – the last to me being one of the most critical as the gatekeeper stage for publication.

And clinical studies need to be even more rigorous than the standard experimental paradigms, not less, as often appears the case, given how so much more is at stake, as your examples from school programs clearly show.

Kim Bannon

Thank you for your informative and well written article. I am a non academic and wondered why failing to replicate someones results is a waste of time? Surely not replicating them is a result in and of itself? Probably not the hypothesis you were testing but in a field where there is not 100% certainty these seem like an important part of research and a result. And would avoid other people making the same ‘mistake’. Excuse my ignorance, but if you could clarify that would be lovely.

Hi Kim, very good question – and observant points as well. Sorry I wasn’t clearer in the main article about this. In some ways it is quite a complex issue, though.

To pick a toy example, say that a big study comes out, concluding that brain region pseudocortex is responsible for recognising antelopes – but in actual fact pseudocortex performs no such function. If a bunch of other labs try to replicate this result, but fail – because the result was wrong in the first place, then on the one hand you’re right and that this is useful, but mainly in correcting the first wrong study and making sure the scientific field gets back on the right track, for instance by working out what else region X does. BUT if the first wrong study never was published, then all these other labs might have tried out different, far more promising studies, and the whole field would have moved forwards a lot faster.

I think there are two key questions that make the situation more complex:
1) Say everyone is pretty much agreed that pseudocortex only does one of two possible functions, recognising antelopes or koala bears, then a failure to replicate one or other of these ideas is quite useful, as you say. In reality, though, I’d say this situation is rare, the field is not nearly mature enough for such things, and there are usually very many different hypotheses about the functional role of a given brain region. In this way, a bit of evidence ruling out one of many options is next to no use.
2) What does a failure to replicate mean? It commonly means the study didn’t find the given region, pseudocortex or whatever, lighting up for the function it was predicted to. This is termed a null result, and null results are really problematic things, because it’s extremely hard to know whether you didn’t get the activation because pseudocortex has nothing to do with this function in reality, OR because of any of a large number of small, but potentially important differences between your study and the published one – maybe you didn’t test enough subjects, maybe you had a group of subjects that were different somehow from the published group, maybe your scanner is a bit different to the scanner of the published study, or it’s the same, but the technique for scanning in your MRI centre is a bit different, maybe the scanner had a few bad days and wasn’t capturing the brain activation of subjects half as well then, maybe you presented your antelopes in a slightly different colour to the published study, and so on. Scientists can waste a lot of time trying to iron out these differences, and turn a null result back into a positive result, but if the positive published result wasn’t real in the first place, then this is going to be a very depressing wild goose chase. It’s also very generally really hard to publish boring null results in pretty much any scientific field, compared to those exciting positive results. This in itself can prolong the life of a bad positive result, as a bunch of labs then go on to replicate, fail to do this, but don’t tell anyone in a published paper and the wrong positive result’s life is greatly extended in the field as a whole.

I hope the above clarifies things, but let me know if I’ve still not explained things well, or created new confusions!

Dan, thanks! You are so right that more rigor is needed. Neuroimagers need to face up to the fact that perfectly well designed studies sometimes just don’t work. This just reflects our rudimentary (but slowly growing) understanding of the cognition and the brain.

However, I think there are two dangers of an excessive focus on “corrected statistics” that adjust for the multiple tests across the brain. It is painful to see how many times lax uncorrected thresholds make it into articles, but at least this is a transparent flaw. A savvy reader can examine the thresholds, and do their own filtering to reject studies where the conclusions aren’t justified. They won’t be misled, and many of the costs associated with a bad publication can be avoided.

What is much more pernicious, however, is the hidden multiple tests that scientists do in the process of exploration. That ambitious young scientist tries different methods (models, software) looking for the “right” way to understand the data. At each stage, they do tests, and only when they come out significant do they decide they’ve found the right method, and report just this one. To a reader, these multiple tests are completely hidden. Worse, it is the beautiful theories that everyone wants to believe in, where the most fishing happens. Perhaps by stamping down too hard on the transparent statistics, the multiple comparisons problem will be pushed to this hidden level.

A second distinction we should be making is in what statistics are used to test hypotheses, versus what is used to display the data. It is usually very helpful to see an unthresholded map of brain activity, or uncorrected statistics, as it gives the reader an intuitive feeling for the quality of the data. As Jon Simons points out, even Fisher saw thresholds as just arbitrary. If some small patch of brain is activated by some task, it is very useful to know whether the rest of the brain was not at all activated, or whether it was also activated, but not by quite enough to make it to threshold. I believe scientists should be encouraged to report uncorrected or unthresholded maps, in addition to doing corrected statistical tests to support their conclusions. Providing the full data might be useful too, but few people actually have time to go and check it out.

Unfortunately, one of the solutions to correction is opaque and loads far too heavily on the integrity of scientists. In “regions-of-interest” analysis, the scientist states a prior reason why they are interested in a specific brain region. They then test this region, and apply a minimal correction for multiple tests (if any). Unfortunately, this is open to abuse, as the scientist can test many regions behind the scenes and then just justify and report the one that comes out significant. I’ve seen a case where hundreds of regions were tested behind the scenes, rendering the resulting statistics completely meaningless (although they did then get their hallowed high impact publication).

So, we have to be very careful what we ask for. Sometimes, a rawer view of the data, with a bit less in the way of statistical convention, might just make things better.

I totally agree that levels of correction are just one issue of a large set. When I mentioned the manipulative tricks in the main article, I was particularly thinking of choice of regions of interest, as Jon Simons flagged up in his comment, and which you describe the dangers of very well.

Very early on, I remember (please correct me if I’m wrong, Matthew!) Matthew Brett encouraging us to write our (very small number of) a priori regions of interest down in a sealed envelope for an independent methods “auditor” before we test our first subject, and only using those regions in analyses or of course in a published paper.

I know, though, that there are lots of other parameters to explore in order to search for apparently significant, interesting results.

And I totally agree, too, that as long as you are properly correcting for multiple comparisons in tables, text and so on, and only claiming that these activations are genuinely significant, it’s sometimes quite helpful to show uncorrected data on images. As mentioned above, most eloquently by Tal Yarkoni, we’re a little obsessed by finding a couple of peaks of activity, as that’s the easiest story to write about, but it might not be real story in many cases. If I was thinking (probably very impractical) pie in the sky ideas, I’d love online papers to have a widget for their activation map figures, where it’s presented at the corrected threshold mentioned in the methods, but the reader can manipulate a cluster size and height threshold at will, just to be given their own better feel for the data, and make their own minds up as to what’s important in a more transparent way. If this could be linked to a dynamic table of activations, with t scores and so on, that would be even better.

In line with this, and following on from Jon’s comment, if there are one or two activations that aren’t quite corrected but follow the same pattern of the data (say they are in exactly the same place as a corrected region, just in the other hemisphere), then there is some use in mentioning them additionally, and letting the reader make their minds up as to their import.

Quite a few times I’ve reviewed papers with no whole brain table or figure in sight, as they’ve only chosen to examine a few small regions of interest, and they do all their stats on these. It infuriates me, and I always insist on seeing whole brain tables. The result is usually a desperate argument from the authors for why they shouldn’t do this, or the tables produced by providing evidence against most of their conclusions. I definitely agree that any whole brain picture is better than none, but corrected whole brain data reported seems such an obvious easy step for reviewers to insist on, and it baffles me that this is still ignored so frequently. What do you (and others) think of the practicality of some document approaching Dorothy’s suggestion of a cribsheet/guidelines that the community agrees on for minimum standards, which could be widely distributed (including to journals)?

Russ Poldrack

Rhodri, good point. One thing that we are doing to address this in my lab is to require a written analysis plan that specifies exactly which models and tests will be run before the analysis ever starts. This doesn’t mean that we won’t go and do additional analyses, but at least in that case we will know which we’re really planned and which were exploratory. I also agree completely with your concern about the use of ROI-based corrections, to the degree that I have little faith in papers that use them unless it’s from a group that repeatedly uses the same ROI.

I’m reading Sam Harris’s “Free Will” and, as a complete layman when it comes to neuroimaging in particular and neuroscience in general, am wondering if any of the questions around neuroimaging studies discussed here warrant increased skepticism of the conclusions Mr. Harris draws from results of fMRIs that indicate a person’s mind is made up about a course of action some number of milliseconds before becoming consciously aware of the choice.

I’m afraid I don’t have the book nearby and am unable to find the specific citation with a quick google search, but what I’m really after is an understanding how much skepticism is reasonable to bring to bear on conclusions about what the conscious mind knows and when it knows it drawn from the results of neuroimaging?

Scanning the comments above I’m afraid I’m more than a little out my depth and hope I’m at least asking a question that’s at least related to the dilemma being discussed. If not, please excuse my blundering and … well … carry on.

I’m very happy to answer this here. I haven’t read this book, I’m afraid, but I guess it’s referring to Libet’s work in the early 1980’s, which used EEG (scalp electrodes recording brain activity, millisecond by millisecond) to show that the brain activity for a finger movement starts to ramp up around a third of a second before we consciously decide to move that finger. In terms of the all important question of replication mentioned a few times here, this result has been replicated a few times in independent labs, so I’d say it’s pretty robust and believable.

There is also more recent work, in fMRI this time, by John Dylan Haynes’ lab, which has done a similar study, but shown that you can start to see activity for a given decision ramping up far earlier, around up to 8 seconds before we consciously believe we’ve initiated our decision. The last time I looked, Haynes’ lab replicated their own results in a later experiment, but no independent lab had as yet. So it’s worth being more cautious about this claim, as the methods used are far newer and more complex in fMRI, particularly in the Haynes studies.

As for what it means about free will being doomed, the situation is more complex than it appears, and free will isn’t necessarily doomed, from this body of work. I discuss this at length in my upcoming general audience book on the science consciousness, if you want another opinion on this field.

Thanks for this link. Yes, it’s definitely a far wider issue than just neuroimaging, but with neuroimaging, with all its complexities and lack of maturity as a field, the potential exploitations may be easier. Also, in some ways there is more at stake in neuroimaging, as it tends to cost more, take more time, and – whether they should or not – the wider public tend to take more stock from some result that relates to the physical brain, compared with a purely behavioural effect. This can be used dangerously by companies, as Dorothy Bishop wrote in her blog and Holly Shapiro mentioned above, to aggressively push the efficacy of their clinical interventions, so in some ways there may be more at stake here.

It seems to me that this extended discussion (and Dorothy’s original blog post) is extremely useful, and the various related cautionary points need to be taken very seriously. Could I add/re-emphasise some related dangers that I fear are too often ignored:

The first has already been covered to some extent in the above discussions and linked blogs, but I think deserves explicit re-mentioning: correction for multiple comparisons should include multiple “contrasts” (meaning multiple different tests carried out, such as speech>music, speech>maths, maths>speech). If you didn’t pre-specify which of (e.g.) 5 possible experimental tests (contrasts) you originally cared about, you should apply a further factor of 5 multiple comparison correction. Note that in this example I included both speech>maths and maths>speech : the majority of tests that people carry out are one-tailed, meaning that if you carry out the test in both directions, then one-tailed derived p-values need that further correction. Of course, if N contrasts are not fully independent of each other, a full correction by a factor of N may be over-conservative, but it is dangerous to apply no correction.

Directly related to this, it is increasingly common in resting-state FMRI to test a number of different resting-state networks for some difference (typically between a control group and a pathology group). The N networks might (e.g.) be estimated from multiple “seed correlation” locations, or found by a group-ICA. As with multiple contrasts, if you didn’t pre-specify which network you were originally interested in, then you need to apply a further correction of N (2N if you tested both directions of difference).

Finally, a more general concern relates to new, exciting, but often complicated/obscure, methods for analysis, often multivariate, often data-driven. With such new methods it can be quite hard for authors and reviewers to understand whether rigorous statistical testing is being achieved. The general question that must always be asked is: “what is the relevant null scenario” (typically no result of interest), and then “has this new method achieved rigorous control of the false positive rate for this null scenario”. Too often this question is not asked.

Steve, I think resting-state fMRI raises its own peculiar problems that are more serious than knowing when and how to employ multiple corrections. Morcom and Fletcher wrote an excellent article 5 years ago about the pitfalls of not employing explicit tasks. Now I read articles on the need to employ yet more post hoc correction of residual motion (Van Dijk et al.) and the need to employ covariates derived post hoc to characterise what the participants might have been thinking about (Doucet et al, 2012; Preminger et al., 2011) – which is just introspection in modern guise (but then if you’re content with introspection as a method to interrogate your data, why not psychoanalysis?). The test-retest reliability of various graph theory metrics for resting state fMRI is abysmal (Braun et al., 2012) if you use criteria from psychometric theory where an intra-class correlation coefficient (ICC) > .8 is a requirement for a clinically useful test, yet task-based fMRI can provide ICCs of this magnitude (Plichta et al., 2012) even with something as credulous in terms of model selection as DCM (Schuyler et al., 2010 – but at least we’re employing a model testing approach).

NeuroPrefix

It seems to me that the unique publishing structure of PNAS has avoided any criticism here.

I have read a number of papers ‘contributed’ to PNAS that have fundamental flaws in experimental design that completely muddy the interpretation of their findings (though I’m sure there are some that are of higher quality). In contrast, papers submitted directly for review are often of a much higher standard, so one has to be careful in citing PNAS. Why does the journal persist in supporting this kind of bi-polar publishing model?

This is one of the issues that Dorothy Bishop has centred on in her blog article that I mentioned at the start of this article. And she’s now returned to it in a later blog here. So you might want to take a look there for discussions on this particular point.

Brad Buchsbaum

It would be interesting to try and track how and where the 0.001 “convention” came about. My impression is that 0.001 came about as a kind of reaction to the “too conservative” whole-brain corrections. And 0.001 was not coming from novices who didn’t know any better but — on the contrary, from fMRI sophisticates who were in tacit agreement that it was in some sense OK to get a bit lax with thresholds. I think that’s because after about the year 2002, the notion of a “uniform spatial prior” — i.e. a completely agnostic view as to where an activation cluster is likely to appear in a given experiment was increasingly not the norm. In other words, a person studying social cognition isn’t going to write a paper about a cluster in the cerebellum, so why should he/she be penalized by using a whole-brain correction factor? But then why not use an ROI approach? Well, then reviewers will ask if we’re not missing something elsewhere. One will be accused of ROI myopia.

An example. A researcher is really interested in the hippocampus. And he’s got a cluster, p < 0.001 (uncorrected). But he also wants to show a whole-brain map. But he doesn't want to use the whole-brain corrected threshold (because he's really interested in the hippocampus). And he doesn't want to use an ROI either, because everyone expects a whole-brain map. And so on.

By the way, would the state of affairs be improved if everyone went back to the most conservative conceivable correction method? I don't think so. There is a trade-off in the cost of type I and type II errors, and there is a melancholy medium somewhere between the worst excesses of uncorrected thresholds and Bonferroni. P < 0.001 probably isn't the right answer, but that is the answer that spontaneously emerged from the neuroimaging community for a few years. And I contend it was from the sophisticates, not the novices.

Russ Poldrack

Brad – I would like to believe that the ROI analyses are truly driven by a priori hypotheses, and indeed I think that’s a legitimate approach. However, I know for a fact that it is regular practice in some groups to identify the ROIs for small volume correction after having seen the group map (i.e. in a completely circular manner). One can always come up with a post hoc justification for any particular ROI.

Russ – that’s a common approach in the social sciences. Daryl Bem advocated it in a textbook that’s still in use: “There are two possible articles you can write: (a) the article you planned to write when you designed your study or (b) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (b)”. see the chapter at http://dbem.ws/WritingArticle.pdf

The approach also has a name, see Kerr N.L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2, 196–217.

Note that I’m not advocating the Bem approach (I agree with Kerr’s views), just noting that it ‘is out there’ and not at all specific to neuroimaging. So why do we see it in neuroimaging? I think Kerr is absolutely right: “…when less agreement exists about what the important questions are (because theories are primitive and unrelated or because the utility for application is less clear), and when there is tremendous competition for journal space and for readers’ attention, there will be a real premium on how well an author can convince the reader that his or hers is one of those important questions; that is, on how good a story an author can tell. I suspect that a greater premium is placed on telling a good story in relatively newer (and hence, immature) sciences (a testable conjecture), precisely because there is much less consensus on what the key questions are”.

A terrific blog that I’ve stumbled on thanks to some links from Rhodri’s students. The concept of mass retractions or somehow ignoring papers with poor statistics is an interesting concept. If only one could get back the thousands of hours lost to reading them !

So using the standards being proposed here, I suggest the retraction of the following 3 papers which would not satisfy the statistics mafia

OK, you say these have been replicated tens of thousands of times now, but I seriously doubt anyone here would be doing fMRI if these papers hadn’t been published. So the ability to take a published result and replicate it (and improve on it) also has some merit in the progress of science. This argues against retractions based on incomplete (or in our case, naive) statistics.

Contrast with an imaging paper that appeared on the cover of Science in 1991, using a double dose of Gadolinium to do functional brain imaging.

Belliveau et al., 1991, Science, 254, 716-9

As best I know, this result has never been replicated in a publication by any other lab, and it also suffers from the same naive statistics. Is it wrong ? Of course not. It’s just that not many ethics boards allow that kind of dosing and BOLD burst on the scene the next year, so it fell by the wayside.

Good science is more than statistics. It’s the method to the madness that also counts. Savvy reviewers can figure than out without the crutch of statistics to do triage.

Full disclosure: I know what averages and standard deviations are. And t-tests. But that’s it. That’s what they teach us in physics. And it seems to have been good enough to crank out a few papers in fMRI.

As an historian of neurology and neuroscience observing this discussion from ‘outside’ the world of scientific practice, I am enormously impressed by the rigor and response of all of these exchanges, as well as their timeliness. I can think of few instances in my research on neuroscience and neurology where I have seen such rigorous discussion take place in so public a fashion (although critiques of phrenology and public discussions of cerebral localization studies do come to mind).

It is striking to me that largely missing from this conversation has been discussion of the suitability of neuro studies for explaining questions that have long dominated in the human sciences. Questions such as: What is personality? What is culture? How should we govern and be governed? At the risk of moving this conversation beyond the problem of technical issues (highly germane and important), I think that it is also necessary to recognize that the largest perhaps most persistent tendency in many studies with which neurocritics and scientists alike contend is the tendency to infer and extrapolate greater meanings from any of these studies – even the very best ones – towards conversations it is not clear the technology, science, models, or statistics can suitably address. For some reason those inferences/meanings are nevertheless drawn from this work or built into the experimental design. I consistently observe that there seems to be a great mismatch among the questions posed, the design of the experiments, the conclusions drawn in the literature, the popular science books that explain “what everything means”, and the story told in the popular media. In other words, the media is not solely the responsible party here.

Two points in particular arise from these facts. Much of the work appears built on sand. As Dorothy Bishop pointed out so eloquently – every published paper can only be as good as the work it cites in support. Knock even one of those papers out on grounds ranging from inadequate correction to outright falsity and questions about how many papers that came afterwards can plausibly stand follows. The second point is that despite a great deal of effort – often (as is pointed out above) at great expense and cost of time – it is the inferences that persist and cause the problems. I can believe this trend will become further exaggerated in the turn of management science, economics, psychology, sociology, history, political science and law towards studies of this fashion.

Were it only a matter of cost and the development of technique, then the faddish element would eventually go away and hard science would replace it. But I don’t think that anyone here thinks that is happening. I suspect that many of us are worried that a careful analysis of many published papers as well as their life in public science would demonstrate that the interpretation of these results have a self-confirmatory bias. In other words, these studies tend to be interpreted to accord with the social-political values of the person doing the interpretation. The design of the experiment simply validates what the researcher already believes. I think this would also place limits on the peer review process which would transcend concerns about nepotism.

My two cents – for what they are worth.

Peter Bandettini

What a great set of comments! Ravi, I think you hit an important point on the head. I certainly respect the massive effort to increase the rigor which we address our data. However, I think that we were still far from fully understanding the nature of the signal and the noise. As an example…fMRI latency. The distribution of the vasculature is not ascertained for each study, yet many have shown significant latency differences between one area and another or between one pulse sequence and another. This may be due to unknown bias in sampling different vascular pools (i.e. downstream veins vs capillaries). Most experienced researchers know this and find better ways at addressing this question (i.e. modulation of task timing, etc..), but, for those who don’t consider the vasculature, statistically significant (yet somewhat meaningless) results have been obtained. Also, motion correction is never perfect, and I have yet to see how uncorrected stimulus-correlated motion is removed statistically as a false positive (other than perhaps using the fact that such changes are rapid while hemodynamic changes are slow – but even that is imperfect). Many just look at the data and easily identify “edge effects” as motion, etc..

One other note. We have a paper coming out in PNAS (Gonzalez-Castillo is first author) , that suggests that even with simple tasks, activation is everywhere. Does this imply that the null hypothesis here is just wrong? What then?

As Ravi mentioned, while statistics are fundamentally important, good science is more than statistics. Many “true” results are statistically unverified, yet some meaningless results can be shown to be statistically valid if not all the variables are considered.

Thanks everyone, It is about time our field is having a serious discussion on this topic.

I have occasionally asked respected colleagues what percent of published neuroimaging findings they think would replicate, and the answer is generally very depressing. My own guess is *way* less than 50%.

The simplest solution for scientists is that if you are reporting something important and novel (and why else would you both to publish a result?) you should replicate the main new effect at least once in your paper (ideally in a second experiment that enables you to test other hypotheses). I am sick of arguments that this is too time consuming or too expensive. It is much more important in my view to replicate your result and publish one solid result, than it is to jump to a second study and publish two dubious results. So, just publish half as much stuff, and make sure what you publish is real. Think what a different field we would have if everyone did this! This has been the policy in my lab for 15 years – every major new result is already replicated at least once within the original paper that result was published in. People may think my work is wrong-headed, or misinterpreted, but I dont think people often doubt the replicabilty of the results I publish. I wish this practice was more common in our field.

How can we get scientists to take the replicability of their results more seriously? Here is my fantasy – not completely realistic, I suppose, but I still love the idea:

NIH sets up a web lottery, for real money, in which neuroscientists place bets on the replicability of any published neuroimaging paper. NIH further assembles a consortium of respected neuroimagers to attempt to replicate either a random subset of published studies, or perhaps any studies that a lot of people are betting on. Importantly, the purchased bets are made public immediately (the amount and number of bets, not the name of the bettors), so you get to see the whole neuroimaging community’s collective bet on which results are replicable and which are not. Now of course most studies will never be subjected to the NIH replication test. But because they MIGHT be, the votes of the community are real. Various details would have to be worked out (who gets to vote? what exactly counts as a replication? etc.). But this system would clean up our field in many ways. First and foremost, it would serve as a deterrent against publishing nonreplicable crap: If your colleagues may vote publicly against the replicability of your results, you might think twice before you publish them. Second, because the bets are public, you can get an immediate read of the opinion of the field on whether a given paper will replicate or not. (Like Faculty of 1,000, but with real money riding on the opinions.) Third, once the NIH-approved blue ribbon replication panel starts to deliver its verdict, we would actually learn which published results replicate and which do not. Fourth, the tally of bets from peers could be used to measure the opinion of the field not only on the replicability a given result, but on the overall judged replicability of all the results from a particular scientist, and/or a particular journal.

A third point: Some of the fault lies in the journals themselves, and the fancier the journal the greater the problem. I once participated in one of those infamous reviews for Science where I wrote a scathing review, and so did the other reviewer, and the paper was published. When I confronted the editor, saying there was a very good chance that the paper would not replicate, he basically said that was fine, they were more interested in novelty than replicability. Wow. That had in fact exactly my impression of the priorities at Science, ut it was shocking to hear it endorsed explicity. These priorities, which are not unique to Science (but seem particularly bad there) are a menace to our field.

Rahul

Thanks, a very interesting article. Between this and the correlation/causation issue, I get the feeling that many branches of science will have issues. Finance for example, where there can be large numbers of variables.

I wonder if there is a case to be made for a “statistical methods” body to be created as a research institution by the Government.

Also, it would be easy to automate a process of taking papers from journals and use some sort of linguistic analysis to see whether the study can suffer from various methodological problems, and further scrutinising those that seem likely to have issues. Or maybe google can take it on.

I think we will find that there is a larger issue of the business world and government using inappropriate techniques. The training in statistics is much lower, and there is little peer review. Underlying data and calculations are rarely in the public domain. We rely on many types of this analysis to make decisions and commit money or resources to a variety of issues.

Federico Turkheimer

I was directed to this blog by colleagues and with amusement I find a debate that was already waged 10 years ago on journals* and the SPM mailbase but obviously has had little impact on the field.

May I raise a point for discussion?

The mass univariate testing model is based on the phrenological idea. Does anyone in this blog believes this is a credible integral model of functional brain activity? Years ago with Matthew Brett we built a wavelet SPM (Phiwave) that provided multi-scale activity analysis and as an output an estimate of the effect size. The problem was that Phiwave was always showing whole brain changes for any task so we always had a hell of a time with the reviewers as they preferred to see blobs. This has been vindicated by the recent discovery that brain activity is of fractal nature and any functional signal (EEG.MEG,fMRI) has an unlimited correlation function bounded by the extent of the brain. So, to answer Peter Bandettini question, yes of course the whole brain will change at any task. Please check the recent literature on brain and self organized criticality and reviews of Werner on Frontiers.

Failing to correct for multiple comparisons is an elementary statistical mistake. It’s astonishing to me that papers ever got published. That isn’t the only problem though.

As a complete outsider to the world of imaging, I recall getting into trouble when I criticised the original taxi driver paper. It struck me then (and still does) as resembling the lowest grade of observational epidemiology. It wasn’t even case-controlled. The evidence, or lack of it, for causality wasn’t discussed at all. The most elementary principles of randomisation and blinding were ignored. Perhaps someone here can tell me how many fMRI studies are analysed blind, and how control groups are selected now.

I think we agree that there are a lot of problems in our field. Some of the problems are due to ignorance about the analysis or the underlying signal or the nature of statistical inference. There is obviously bias introduced by the need for publication and the desire for recognition.

But I think the underlying theme is one of lack of care. I mean lack of care about whether the result that we publish is right.

This explains why people rarely replicate. It explains why it is difficult to engage people in discussion about whether the inference is correct.

Why does this happen? I mean why is it that we don’t care enough about whether we are right?

I think the answer lies in science as it is practiced. The practice of science can be something like a game played against our reviewers, editors, tenure committee and grant funding agencies. We care about winning that game, so we care less about whether we are right.

This is the classic paradox described in “Punished by rewards” by Alfie Kohn. He reviews many studies showing that if you reward someone for a task they find of intrinsic interest, they often perform worse and they lose interest in the task more quickly. We reward people for publications and they lose interest in science.

I think the problem is fundamentally about what a ‘publication’ is:

“An article about computational science in a scientific publication is *not* the scholarship itself, it is merely *advertising* of the scholarship. The actual scholarship is the the complete software development environment and the complete set of instructions which generated the figures” Buckheit and Donoho,http://www-stat.stanford.edu/~wavelab/Wavelab_850/wavelab.pdf

A publication is advertising. The product advertised may not be any good. It may not even exist. It is dangerous to reward scientists for success in advertising. I believe this is the root of the culture that allows the abuses we are discussing here.

to David Colquhoun
I think you are rather muddying the waters by bringing up the taxi driver study, which I know is a bete noire of yours. Maybe some time you and I should have a blog-based debate about the extent to which nonexperimental studies have value: i.e. those where you can’t manipulate independent variables of interest, as I know this vexes you! You seem to think that unless you can conclusively demonstrate causality, it’s bad science, but many of us work in areas where it’s not feasible – e.g. can’t randomly assign people to be taxi drivers or not. And astronomers can’t test theories by changing the positions of planets. Of course this makes the research messier and conclusions about causality hard to draw, but it can generate testable predictions for further studies, which is what has happened with the taxi driver work. Often what we do isn’t definitive, but it narrows down the range of possibilities and rules out some lines of explanation.
I’d also add that the multiple comparisons correction issue is well recognised in the fMRI field and some very clever brains have been devoted to developing methods to deal with it. There are also genuine reasons to be concerned about whether the balance between avoiding false positives and committing false negatives may have been got wrong. But the real problems occur when people deliberately bend the rules and when they overhype findings – I agree with all Matthew Brett has said about that!

Of course I agree entirely that it is not always possible to randomise, though it is always possible to analyse blindly. I also agree that when it is genuinely impossible to randomise, that doesn’t mean that the observations are not worth having.

I’m still struck, though, by the contrast between epidemiologists and those imaging papers that I’ve read (not nearly as many as you have read, I’m sure). Epidemiologists take the matter of causality quite seriously, and have invested large amounts of time and money in doing large prospective surveys. Of course even large prospective surveys have, not infrequently, proved to give the wrong answer when tested against RCTs, but they are sometimes the best that can be done. I suppose it’s also true that epidemiologists, having discussed seriously the matter of causality in the Introduction, can be quite cavalier about making recommendations in the Discussion which assume causality has been demonstrated when it hasn’t.

I freely admit to having a bee in my bonnet about causality. I think it is a more serious problem than many people admit. Not least in the imaging community.

The more I think about this, the more I’m starting to think that multiple comparisons in neuroimaging is not as simple as some (including me in the past) have made out because of how unlikely it is (given what we now know about the sensitivity of fMRI) that the brain wouldn’t do anything in response to a task… the meaningful null hypothesis in many studies I think is not “nothing happens” but “everywhere lights up weakly, some areas are a bit stronger than others”. In which case, by being conservative, we might actually end up being liberal about what we could as a “local activation” and given that 99% of us are interested in particular brain regions… that could be a problem…

I’m working on a post about this.

Federico Turkheimer

Yes, of course the global null hypothesis is always false and the debate every 10 years or so catches fire again.

If it helps.

Jernigan in 2003 (http://www.ncbi.nlm.nih.gov/pubmed/12768533) proposed the adoption effect size maps and highlighted the problem that usual analytical methodologies only determine associations (e.g. this area is associated with…) and never double dissociations (this area is associated with … more than any other one), which is the crux of localization in brain sciences (if one is really interested in it). This makes the point that at present we are not really localizing much but creating the illusion of localized effect by underpowering the studies.

In my previous post I referenced a paper of mine (2004) where I highlighted the logical problems behind multiple comparisons (all the various flavours) generally and in medical imaging in particular. For example, I mentioned the notorius fact (known outside neuroimaging) that FDR should not be used in an experimental setting unless one is in exploratory mode /or can repeat the experiment with a newer cohort even with an ancillary methodology as it is done in molecular biology/genetics.

Just want to congratulate Matthew on a magnificent post that will be hammerred into the heads of all the PhD students. Of course he is right, the problems in neuroimaging have nothing to do with statistics but, as everywhere, with the incentives (scientists, journal editors etc.)

Laura Germine

This problem is particularly bad in clinical research, I think, as certain populations are very difficult to recruit from and very difficult to test, so the cost of false negatives is high and this reduces methodological rigor. I have seen studies with thresholds of p < 0.05 uncorrected in fMRI with clinical populations. Combine this with populations that are protected by certain research groups, and you get have the untenable situation where methodological standards are low and external replications virtually impossible due to the fact that one lab or group of labs "owns" the group of participants.

Whatever the implementation, having a way to track or incentivize replications would likely raise standards and also prevent these monopolies on certain patient groups that make replications so difficult in clinical research.

I think it is worth bearing in mind the following quote from the great statistician John Tukey: “An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.” http://en.wikiquote.org/wiki/John_Tukey

Clearly it is important to do multiple comparisons correction and related stats in the proper manner. However, much work in the 1990s consisted of developing incredibly sophisticated and elegant ways of giving precise answers to (what I believe to be) the wrong question, namely: what is the best way to do mass-univariate fMRI stats? In this, I agree with Federico Turkheimer’s comment above, although perhaps coming from a different direction.

While Gaussian Random Field and AR(1) correction models were getting more and more sophisticated, Jim Haxby asked a very simple but much more powerful question: why are we looking at brain activation just one voxel at a time? His alternative approach simply used high-school math: spatial correlation. But it launched an entirely new way of doing brain-imaging. http://www.sciencemag.org/content/293/5539/2425.short

If the question being asked is “Which bit of the brain lights up when subjects do task X?”, then the answer is going to have very limited value (in my personal view), no matter how carefully done the stats are.

I know that not everybody agrees with this. But it’s my view, for what it’s worth. The danger of feeling statistically sophisticated is that you may be giving a precise answer to the wrong question.

A genuinely fascinating discussion, and great to see so many of the interested, and highly expert, parties contributing to a public forum.

A couple of my own thoughts on the topic.

1. As ‘Neuroapocalypse’ mentioned above, there are issues with all three of the widely-used multiple-comparison correction methods (FWE, FDR and Cluster correction). I would probably, if pushed, put forward a vote for cluster correction, as it’s the only one which derives from a plausible model of how the brain actually works – we know the brain activates regionally, and that single voxels or small clusters are likely to be noise – why not exploit that?

2. A few people have mentioned ROI-definition methods above, but no-one has really mentioned the many advantages of such an approach, namely: a) all analyses can be done in individual subject space – no normalisation necessary, b) No, or minimal, smoothing is required which preserves the high spatial resolution of your data, c) smaller numbers of subjects are generally required, d) the multiple comparison problem (at the group level, anyway) disappears, since each subject’s/condition’s data point is simply a mean response amplitude from a ROI.

Of course there are drawbacks, the most important being the necessity of an appropriate and independent method of defining ROIs, either functionally with a localiser, or anatomically. Single-subject ROI methods aren’t appropriate or perhaps even desirable for many investigations, but for those of us who focus on relatively low-level perceptual processes (vision, pain, etc.) they are definitely a viable, and I’d argue, an attractive option. Also, the recent great work by Ev Fedorenko and Nancy Kanwisher shows that it’s also a highly workable approach when looking at more cognitive processes, like language.

I realise this is likely to be a minority view, and (also as mentioned above) there is huge scope for doing ‘bad’ ROI studies (most commonly, by non-independent selection of regions), but done well, it’s a very powerful approach.

Marc Pelletier

1) In clinical fMRI, it seems to me that there is no consensus about the thresholds and the methods to be used to objectify a significant group difference for brain activations. In addition to the multiple thresholds used to show activation in the “control” group (as discussed here, 0.05corr vs 0.001unc vs 0.01 in an apriori ROI), there are multiple thresholds used to show the group difference per se (as Laura discusses). In fact, some labs now have simplified the methodology. They show a group difference in a given ROI, without showing that the loci or the ROI is significantly active for the “control” group for the task under study (but of course the discussion goes on as if the region was assumed to be active for the control group). Actually, some labs don’t even “show” the brain difference (except for “illustrative” reasons). SPSS computations seems to be enough for the editors.

2) In clinical fMRI even more than in cognitive neuroscience, there is no consensus on how to interpret the brain-behaviour connection. In particular, an hypoactivation combined with a normal performance in a clincal group has been discussed very liberally in terms of “window for hidden behaviour” or of potential endophenotype, only very rarely as an experiment that failed.

3) By design, clinical groups ARE different than so-called “control” group, which is really a sample of the normal population. Of course, this is true for the whole clinical field. What is so disappointing in clinical fMRI is the total absence of will by the community to control for at least the variables known to affect most neuroimaging. To take schizophrenia as an example, we know that patients who suffered from this terrible disease are more often heavy smokers, eat more unhealthy food, exercise less, have more cardiac and vascular problems, consumes more caffeine, are scanned later rather than sooner in the day—all things known to affect brain activations. The only variables that are used to match the groups are the same used in psychosocial studies (most often: gender, parental SES, sometimes IQ), as if neuroimaging didn’t require its know matching variables.

CrewsControl

“But in neuroimaging, an only slightly unscrupulous scientist can learn the many tricks hidden in this huge pile of data”
And
“I wouldn’t call this fraud, as the scientist might well have some skewed self-justification for their invalid steps, ”
And
“I know of a few senior scientists that employ such “techniques”, and any neuroimaging researcher who’s been in the field for some years could probably do the same.”

What a thoroughly depressing insight into the behaviour of some ‘professional’ scientists. I’m sure Peter Medawar in ‘Advice to a Young Scientist’ didn’t cover how to be only slightly unscrupulous in science. I’m sure he would have viewed a ‘trick’ as something to be confined to the activities of the Magic Circle. Where are the whistleblowers when you need them? If you know of the statistical sleight of hand of some of your colleagues should you not raise this with them or report your concerns to a senior member of your department/university? Failure to stamp on it surely it means it propagates from supervisor to student. And when the dirty linen is hung out the whole department is surely tainted.

Furthermore if, post publication, it has been discovered that some researchers have applied inappropriate analysis to their data then surely it is the responsibility of the journal’s editorial board to invite the authors to submit a corrigendum with the results of the correct application of the statistic; after all they have the original data. If applied rigorously this would have a sobering effect on authors, reviewers and editors.

All in all I’m rather shaken by what I’ve read here. fMRI? What’s the ‘f’ for? False, flimsy or fabricated?

Scientific Learning claims the FastForward program will “cross-train” the brain. It will make the brain “stronger” so children can read more successfully and with greater enjoyment. It just came to my attention that even the Chicago Public Schools have bought into it.

A virtual satellite symposium on neuroimaging and the sociology of science!

There is no doubt a vast number of sketchy neuroimaging studies. I think the field faces the Herculean labor of cleaning out the dirty stable.

I agree with the many commentators who have noted that for many studies, the principle problem lies not in the methods themselves, which have been up to the job for quite a while, but in the erratic application of these good practices. Why does this discussion seem to crop up every few years? In addition to the pressures that all scientists face to publish in impact journals, get grants, etc, I think there are some idiosyncrasies to the sociology of neuroimaging. I’ve recently written an essay (http://www.cfn.upenn.edu/aguirre/wiki/_media/public:papers:shackles_and_keys_aguirre.pdf) that touches on this. The short version is that neuroimaging combines a nearly ubiquitous technology (MRI scanners) with freely available software with many knobs. From the essay:

“Consequently, little more than research access to a hospital and an internet connection is needed to perform neuroimaging experiments and produce pictures of the brain. Inevitably, investigators are able to analyze fMRI data with minimal understanding of the many statistical processes and assumptions which lay behind the buttons, and the images that they ultimately produce. A playful analogy to this circumstance considers the difference between long and arduous martial arts training that produces both power and responsibility, and the provision of a gun.

The ready availability of neuroimaging hardware and software has meant that the technique can be readily adopted in new areas of intellectual inquiry, without having to bring along the hard-won cautionary experience that arises from years of practice. While best practices may be understood in areas of inquiry for which neuroimaging is a mature technique, this is often not the case in the first, exciting rush of novel work in a new field. Essentially, the phylogeny of the evolution of neuroimaging technique is recapitulated anew each time fMRI seeds new ground.”

I should add that the democratization of the technique has been a boon, but I think it cuts both ways.

“The ready availability of neuroimaging hardware and software has meant that the technique can be readily adopted in new areas of intellectual inquiry, without having to bring along the hard-won cautionary experience that arises from years of practice.”

First – I’d like to ask your opinion. Is it really true that most of the problems in neuroimaging are due to inexperienced people doing the analysis? Put another way, if the motivation to be transparent is low, the variations in the analysis are high, and the rewards for publication are high, what is going to stop some people, experienced or inexperienced, from publishing results that are likely to be wrong?

Second – do you think we are teaching our people the right thing? Do you think we are teaching them enough? My impression is we often teach people badly. We give them short courses that are hard to understand and that would be difficult for engineers, let alone psychologists, and then we set them loose saying – ‘OK – go and scan’. I have seen new neuroimaging researchers giggle nervously when anyone asks about the analysis. It has been so badly taught that it seems like magic to them. They are thinking “Surely no-one would reasonably expect me to be able to do this with the training I just had”. But then they discover “Oh my God, they _are_ expecting me to do this”. Chaos and cynicism is the only predictable result.

So, I believe, that the only way out of this, is to agree that a neuroimaging paper, as for any computational paper, should be reproducible, from raw data to figures in paper. Anyone publishing knows that everything that they did has to bear review, from anyone. That is the democratization we need, not only a democracy of output, but a democracy of true and fundamental review.

I think we would agree that methodological inexperience is but a small part of a big set of issues. More to the point, I was focusing less on the inexperience of individuals than that of fields of inquiry. If you are the first to conduct (e.g.) neuro-politics studies, your papers may be reviewed by those who are not yet wary of uncorrected p-values. A well rendered neuroimaging figure can be a deceptive hallmark of quality science.

The preceding to the contrary, I’m perhaps not as pessimistic as some. As a field, we’ve come so astoundingly far in the sophistication of the questions we can ask and the rigor with which we can answer them. We all have web sites and we can make our stimuli, data, and software available to others. If we train our students to be skeptical of their own results (and everyone else’s!) we’ll keep creaking forward. I am especially encouraged by the rise of blogging as a tool to help keep us all on the straight and narrow (e.g., looking at you, Neuroskeptic and Neurocritic, not to mention those who initiated this monumental discussion itself!).

* Is there a major problem in the quality and reliability of published neuroimaging results?
* If so – what should we do about it?

I believe there is consensus here that there is a problem. For example, Nancy Kanwisher believes that “*way* less than 50%” of neuroimaging results will replicate. My guess would be about 30%. If 70% of neuroimaging papers are wrong I guess we’d agree that’s a very major problem, and we really do have to do something about it.

Well I cannot say I am surprised although my reasons for distrusting the bulk of neuroimaging studies is for more mundane and logical reasons that having read so many of them and seen the amount of mutual contradiction amongst them, it invokes the Duhem Quine problem does it not?

The problem again as I see it, is that there are not only the specialties digging ever deeper pits, in ignorance of who is shoveling away in the next plot of land to them, there are the specialties within specialties building their tottering tower of Babel (now there is a paradoxical mixed metaphor, one minute holes and the next towers, but you get my drift I hope) on ever shakier foundations because they can only go forward, never back to check if the foundations are sound.

It is an economic and a social problem more than it is a scientific one, because it is driven by the academic research industry in search of self justification and continuation and the publishing circus about which more later : …..

Quite apart from that I came here looking to see if I could discover again the Ur Paper on artefacts caused by oversampling of voxels which was one cause that led me to doubt the fMRI’s claims to be the microscope du jour, it has been clear to me for some time that there is a deep malaise in neuroscience, and one that even the big names can’t see as they put there names to countless papers which massively abuse statistics in a way no social scientist would be allowed to get away with (I hope) It’s about time some of them went back to school and learnt the basics.

We live in bad media times, there is so much out there, never mind the peer reviewed standards that are a self replicating virus in themselves, there are the faux peer journals that are set up to give an easier route to publishing for wannabee mad scientists with crackpot theories, the Wakefield caucus for example.

In this world we can add “argumentum ad google” to the list of logical fallacies whereby the public can get multiple hits for almost any position they wish to defend, because there is no distinction of quality.

Trouble is amongst what is legitimately published there is not that guarantee either, as any master of google fu can tell you, it is just as easy to dig out papers from the legit journals to support wrong ideas. That is because it is so difficult to look at any paper in the context of the others it relates to, it’s intertextuality (another reason why Scientists need to learn some media sociology)

The academic savant’s d’antan do not exist anymore with a sufficient breadth of application and knowledge and compendious memory of what they have read to see the patterns, and the contradictions, the references, the circularity of references, and all those other interconnections.

NeuroPrefix

I don’t understand a lot of words in this post by L. Rex, but I have to say I’m a bit uncomfortable with the extreme condemnation of neuroimaging (and by attribution the scientists who use this technique) in this post. I know a number of well respected leaders in basic and clinical neuroimaging, and in fact most of them have an excellent grasp on the literature at hand, and a lot of interesting findings tend to dovetail together rather than appear starkly divergent.

Some of these earlier comments seem equally over critical or fantastic in their suggestions.

M. Brett is brilliant and a giant in the field, but does he really believe that 70% of neuroimaging papers are junk (or “wrong”)? This is almost a wholesale indictment of scientists working in this field, and I think it has been grossly overstated.

Does Dr. Kanwisher truly think any young scientist would enter a field where they are required to retain enough funding to run a single study twice before publishing a single paper? Or should they split their 24 subjects into two studies of 12? Isn’t this what RFX modeling is for?

Like all disciplines, functional imaging is evolving and maturing. The fact that earlier papers used techniques or statistical methods that we now deem as wrongheaded or simplistic is a sign of progress, rather than some evidence of a rotten foundation.

I am most familiar with the literature in Alzheimer’s Disease, and while it has its methodological hurdles to consider (behaviour, atrophy, and perfusion confounds) there is still a strong degree of consensus, and a good feeling that the field is moving forward rather than backward (see recent papers from Sheline, Filippini, Sperling etc.). There are challenges, to be sure. But to suggest that the vast majority of published work in this field is bunk is to overstate the point. This kind of breathless condemnation overstates and overshadows the very real challenges that functional neuroimaging faces.

Russ Poldrack

It’s important to be clear about what you mean by “fail to replicate.” If you mean that another study of the same size would fail to find EXACTLY the same regions reported to be active in the original study, then the number is probably very high. However, if you mean that the study would fail to find activation in ANY of the regions reported to be active in the original study (i.e., the entire set of results is a Type I error), I think that the number is much lower. If it were not the case that most studies reported mostly replicable results, then there is no way that the predictions obtained by Yarkoni et al. (2011, Nature Methods) and presented at http://neurosynth.org would work – but they do. Such a high rate of non-replicability would also suggest that one could not accurately predict what cognitive task a person is engaging in based on an image of brain activity (using a classifier trained on other people), but one can (see our 2009 Psych Science paper). I thus agree with NeuroPrefix that while the substance of the critiques is important, but it’s important not to get pulled into blanket condemnations of fMRI research.

I must say I don’t feel pulled into blanket condemnation of FMRI research.

The question I have taken to asking my colleagues is the following “Let us say you took a random sample of papers using functional MRI over the last five years. For each study in the sample, you repeated the same experiment. What proportion of your repeat experiments would substantially replicate the main findings of the original paper”.

That’s the question I was trying to answer with 30%.

I hope we could agree that there are some tasks which generate very reliable and replicable signal. If you doing flashing checkerboards, faces, houses, motor tasks, a task requiring mental effort, you’ll likely get very predictable activation. However, if I took a random sample of FMRI papers, I bet that I would be looking at a large number of subtle subtractions.

So – Russ – if I asked you the question above – what answer would you give?

Russ Poldrack

Again, it depends on what you mean by “substantially replicate”. If by that you mean that the result that supports the main claim of the paper (e.g., activation in an anatomical region, even if not at exactly the same location) is found at the same threshold, I would guess that the number is at least 50%, but this would like differ substantially across subfields. In areas where the work is more discovery-based/exploratory rather than hypothesis-driven, I think that it could be lower than 50%. In areas like vision or memory research where the research is very strongly hypothesis-driven, I think it would be higher.

Russ Poldrack

Matthew, I don’t quite understand the point of this. My feeling is that if you take a randomly chosen fmri paper published in a solid journal, then all else being equal it’s a coin flip as to whether the main result would replicate or not. Of course, if I know something about the paper (e.g., its authors, the topic, sample size, methods), then my confidence in its replicability would probably change (sometimes going up to 99.9%, and sometimes down to 0.1%), but otherwise I don’t quite understand the utility of throwing around these specific numbers.

Russ Poldrack

It’s definitely worrisome that there are a significant number of unreplicable findings floating around (whether the number is 10% or 90%). However, I’m not sure that it’s any worse for imaging studies in this respect than for other areas of science (cf. Ioannidis). In the end, I think that the bigger problem for imaging studies is that the fundamental approach is broken, such that even if all of the results were replicable, they couldn’t answer the question that we really want to answer, which is how the brain enables mental functions. Instead, what it gives us is a laundry list of functions that each area is supposed to support. This critique is fleshed out in my 2010 Perspectives on Psychological Science paper (http://pps.sagepub.com/content/5/6/753.abstract).

OK – to put it another way – my estimate was 30%, I suppose Nancy’s would be about the same. Yours would be 50%. We can’t be sure what the right answer is. That means that there it is within the realms of possibility that 30% is the real figure, or even 10%, as you say.

It strikes me that even 50% is at least hugely wasteful, and must be slowing down the field a great deal.

The reason I pursued you on the figures was because, once we accept that it is likely that we are not doing a very good job, we really must start planning how to fix that.

I thought the question of replicability had been addressed empirically by a number of multisite fMRI studies, beginning with Casey et al. nearly 15 years ago (1998; NeuroImage, 8, 249-261). Over 4 sites with different scanners, they had different groups of participants perform the same tasks, and they concluded: “In sum, even when different image acquisitions and analytic tools were used we observed the same general findings across sites, and in our group analysis. This provides strong evidence for the reproducibility, reliability, and comparability of our fMRI results.”

So, in short, I don’t know where low estimates of replicability across fMRI studies come from, when the empirical data from a number of published multisite studies already indicates it is much higher.

The paper you cite describes a multicenter study of a simple working memory paradigm. If you’d asked “what proportion of multi-center studies of well-studied highly-activating working memory tasks would give substantially similar results” – I’d guess some very high number. That wouldn’t much influence my guess to the answer to this question:

“Let us say you took a random sample of papers using functional MRI over the last five years. For each study in the sample, you repeated the same experiment. What proportion of your repeat experiments would substantially replicate the main findings of the original paper”.

As the papers above demonstrate, Bayesian approaches to answering the question are not easy to devise. There will always be differences between studies in terms of what is known about the effects they investigate, especially in a nascent field such as neuroimaging, so replicability will reflect this. So again, I don’t see the supporting evidence for ‘guesstimates’ of low replicability across the board. Your comment about multisite studies using “well-studied highly-activating” tasks that wouldn’t influence your answer indicates you are modifying your estimate of replicability based on prior knowledge. A random sample should include some of these “well-studied highly-activating” tasks.

Yes, you are right, I am indeed modifying my answer about replicability based on what has to be a rough assessment of the different types of studies published.

So, if a large proportion of neuroimaging studies published in the last 5 years were multi-center replication studies of well-studied highly-activating tasks, then my estimate for replicability would be higher. I’m guessing that that is a tiny proportion of neuroimaging studies published in the last 5 years.

But, just to clarify, what is your estimate? What is that estimate based on?

I think replicability is a question to ask about a given task/effect, not a question to be applied across a field/discipline, so I disagree with the way you pose your question. Psychology is grappling with this issue with initiatives like psychfiledrawer.org. We can always create ‘neurofiledrawer.org’, and nominate the fMRI results we’d really like to see replicated, and then beg for the funding. As I mentioned in an earlier post on this blog, I think fMRI studies should use cognitive tasks with high validity and replicability.

If you conduct a random sample of the fMRI literature, I suspect well-studied, highly-activating tasks with known characteristics like the Stroop paradigm have been employed more frequently by researchers, so will influence the estimate across the board. Studies with unlikely, counter-intuitive results will have been published less frequently, by definition.

NeuroPrefix

I didn’t mean to suggest you were dismissing fMRI out of hand Matthew, but I definitely disagree with your assertion, and think it may be more useful if broken down by research paradigm and population.
I think for the bulk of highly studied visual paradigms (working memory, episodic memory, face v. house or lower level visual attention, parametric difficulty, task-induced deactivation ‘default mode’ patterns) the rate of replication would be on the order of 80%.
If you are talking about novel analysis methods, tricky paradigms that are sensitive to time of day, emotional state etc, or clinical populations, it is probably closer to 60% in my gut. Though again, the literature in some clinical spheres is quite robust in my view (clinically diagnosed Alzheimer’s being a clear example).

While we wring our hands over our overstating of the value of fMRI or its flaws, perhaps the most egregious misuse of functional imaging currently is carried out every day and doesn’t use fMRI at all.

The private clinics of Dr. Daniel Amen use SPECT scans to diagnose everything from AD to ADHD to school problems – perhaps most infamously using the phrase “ring of fire” to refer to an invented abnormality that seems to be present in the bulk of people he scans. Amen is quoted as saying that he has yet to see a normal brain, feeling that over 90% of the population has some abnormality (which is by definition impossible). In my view this is far more dangerous and worthy of widespread discussion and critique in the field, as it is both more insidious (SPECT carries a patina of validity due to its clinical use in other areas), and more dangerous (Amen requires patients pay for 2 SPECT scans one day apart, which is repeated dosing for no valid clinical benefit).

NeuroPrefix

I don’t have a way of objectively justifying my response, it is just my coarse perception of the field based on the papers I read. That said, I mostly read articles using well established paradigms, so this may colour my estimates (for example, I do very little reading of social or affective neuroscience and tend to be skeptical of studies that purport to examine psychological phenomena that are esoteric or hard to define, or invoke reverse inference).

NeuroPrefix

I don’t, again because the bulk of studies will use well established paradigms, most will be ‘mostly’ replicable in my view. But the entire discussion is probably moot. It is down to tools like neurosynth and ALE to help us establish the degree of cohesiveness in a given subfield….

I think then you would differ substantially from all of me, Nancy Kanwisher and Russ Poldrack.

Are you really saying that it is not interesting to estimate what proportion of studies would not replicate? For example, let’s say I am right and you are wrong – would you still be happy to base your hope on long term meta-analyses?

I was referred here after some posts of my own [wary skepticism… and a powerful message… ] about an article in the April American Journal of Psychiatry and its accompanying editorial [Brain Activity in Adolescent Major Depressive Disorder Before and After Fluoxetine Treatment and Imaging Adolescent Depression Treatment]. I can see why someone sent me this way – a nest of experts. From my general psychiatrist perspective, this study is exactly the kind of thing you’re addressing here – an uncorrected study being discussed as if its “uncorrectedness” is immaterial. The conclusion in the editorial, “…this research provides a powerful message that clinicians can give to families: adolescents with depression have abnormal neural circuitry, and treatment with fluoxetine will make the circuitry normal again.” seems scandalous to me. I’d appreciate hearing what people from the neuroimaging community have to say about it…

Federico Turkheimer

1. The first concern on the paper is not about the statistics but about the use of fMRI in pharmacological studies. Now, if I understand correctly, fluoexitine was administered to the depressed adolescents only. This may have implications because fMRI does not measure electric neural activity, nor related metabolic activity, nor ensuing changes in blood flow, but it measures shifts in the oxy/deoxyhemoglobine content due to the hemodynamics following neuronal activation (BOLD effect). I could not find in the Methods any detail but I suspect that fluoexitine was not stopped before the scan at eight weeks nor its plasma concentration measured. This is relevant because fluoexitine in low doses has been shown pre-clinically to relax vasculature (Chen MF, Huang YC, Long C, Yang HI, Lee HC, Chen PY, Hoffer BJ, Lee TJ. Bimodal effects of fluoxetine on cerebral nitrergic neurogenic vasodilation in porcine large cerebral arteries. Neuropharmacology. 2012 Mar;62(4):1651-8) so the effect measured here could simply be the result of the pharmacological relaxation on the vasculature. Studies of this kind (called pharmacological fMRI) needs a-priori checks on such effects; even drugs such as minocycline, an antibiotic, have been shown to affect the coupling between neuronal activation and measured BOLD activity.

2. I read the statistical section a few time and I struggled a bit to understand what they did. My major concern is that they defined the regions of interests (regional masks that you superimpose on the data to obtain the average activity for an area of interest), based on the pixel-by-pixel analysis . In particular the used the voxels where they saw an interaction in the first study to look at activity after 8 weeks. Now this is not correct. These voxels are the maxima of the distribution ensuing from that particular design in the first stage. Because they are maxima, the fact that activity was reduced after 8 weeks could simply be a regression to the mean effect.

Thanks. Super-helpful to this neuroimaging naive psychiatrist [me]. I had another run at it [a young science…]. Your comments may well explain some of the very peculiar things in the results. I’ll have to cogitate a bit. Thanks again…

I beg all you neuroscientists to keep close watch on neuroimaging research in psychiatric journals.

Now that the “chemical imbalance” theory has become a public embarrassment, sloppy neuroimaging methodology is being used to rationalize a new foundation for the biological basis of psychiatric conditions. This is not pure science for lengthy discussion and consideration, it immediately leaps to medication recommendations based on the imaging.

Curiously, even though data significance is hazy, the conclusions of the imaging studies are written to validate the pharmaceutical treatments that have been used (and vigorously questioned) for 20 years. The mystification of high technology cloaks exercises in rhetoric.

Marc Pelletier

In addition to Dr Turkheimer’s excellent points, I noticed the following problems:

1) if the task is known to activate the “normal” amygdala (as is the case with the fear faces), then activation of the amygdala in the control group is expected, and should be considered somehow a priori in the analytical method of comparison with the patients. Failure to activate the amydgala in the controls should suggest the hypothesis (if not the conclusion) that the experiment failed, not that we are in presence of an “interesting result”.

2) not many variables having potential effect on brain vasculation were controlled (level of exercise, caffeine, time of experiment).

3) we are told that 2 subjects were excluded because of excessive movement, however, threshold for not given. Moreover, considering that one group is expected to react more to fear faces, this group is likely to show correlated movement with the task. Hence, movement should be controlled in the design, or at least compared with the controlled group to insure that the significant interaction is not simply the result of correlated movement.

4) faces were shown for 2 seconds. Scanning of the brain also took 2 seconds. The two timelength should not be the same, because basically the same parts of the brains were scanned during the same timeframe of the stimuli presentation. This may cause substantial methodological problems and limits the generalizability of the results.

Thank you, Dr. Turkheimer and Dr. Pelletier, for looking at this study.

If you have a chance, I’m sure the American Journal of Psychiatry would appreciate an enlightening letter to the editor, as would Kathryn R. Cullen, M.D. (http://www.psychiatry.umn.edu/faculty/Cullen/index.htm), who wrote the enthusiastic editorial suggesting the imaging study found biomarkers for depression in adolescents.

Tomorrow, or even today, a doctor will diagnose a child with “diseased brain circuits” on the basis of this study and prescribe fluoxetine to correct the disordered brain activity perceived in the fMRI.

Thanks to everyone for this insightful discussion of a series of difficult problems in neuroimaging!

Regarding the issue of reliability, it is worth discussing here that there are already a series of statistical efforts over the last 15-20 years focused explicitly on resampling frameworks in neuroimaging. For example, multivariate techniques such as the partial least squares (PLS) and NPAIRS frameworks strictly employ resampling at multiple levels of analysis. In particular, PLS utilizes a two-stage framework, using permutation testing (e.g., 1000 tests) to test the singular value of each latent variable, and using bootstrapping with replacement (e.g., 1000 resamples) to estimate the reliability of each voxel, and of effects of interest. In my view, the use of bootstrapping is particularly powerful. Nancy’s suggestion that we should try to provide at least one replication of each major finding in the same paper is clearly a great suggestion if one has the cash, but of course, this replication establishes the reliability of a result across only two possible samples that anyone could have taken. Bootstrapping iterates through as many possible resamples as you care to examine (in my studies, always 1000, as results tend to stabilize at this point), bounded primarily by the size of, and variability in, your sample.

Another key point about the PLS framework is that bootstrap testing at the voxel level does not at all rely on multiple comparisons correction. The idea is that each voxel is compared to itself across 1000 resamples of the data; so, if the same voxel does not continue to appear reliably across resamples, then it is deemed “unreliable.” In this way, it does not matter if someone has a single ROI in mind, or is doing whole-brain….the result is always unbiased. If key goals in neuroimaging are to establish what areas reliably activate when we administer a given task, and ultimately to achieve a meta-analytic view of brain function (enabling entirely useful projects such as Tal and Russ et al’s. Neurosynth project to work as well as possible), we have to ensure that unbiased methods of analyzing and reporting data are employed. If multiple comparisons correction is employed in a typical way, and group X conducts an ROI study (less correction) and group Y conducts a whole-brain study (more correction), we have absolutely no guarantee that the same result will be reported, even if the same exact result actually occurred in both studies. To the extent this occurs, we are losing the battle to understand brain function for no good reason whatsoever.

In the end, the great thing is that we don’t have to reinvent the wheel with reliability/correction in neuroimaging…we have already made progress. We do however need to begin a more earnest discussion in the literature about how to best develop and optimize the use and choice of resampling frameworks. These frameworks are not a panacea, and like any technique, require threshold-type choices to be made. However, their employment probably would get us closer to where we want to be.

Andy Yeung

This discussion thread is really informative and basically summarizes all sorts of opinions and lists ample examples out there in the literature. Hope we can have an update of this thread in the coming future as this was dated 3 years ago.