Archives

Follow us on Twitter

35,000 papers may need to be retracted for image doctoring, says new paper

Elisabeth Bik

Yes, you read that headline right.

In a new preprint posted to bioRxiv, image sleuths scanned hundreds of papers published over a seven-year period in Molecular and Cellular Biology (MCB), published by the American Society for Microbiology (ASM). The researchers — Arturo Casadevall of Johns Hopkins University, Elisabeth Bik of uBiome, Ferric Fang of the University of Washington (also on the board of directors of our parent non-profit organization), Roger Davis of the University of Massachusetts (and former MCB editor), and Amy Kullas, ASM’s publication ethics manager — found 59 potentially problematic papers, of which five were retracted. Extrapolating from these findings and those of another paper that scanned duplication rates, the researchers propose that tens of thousands of papers might need to be purged from the literature. That 35,000 figure is double the amount of retractions we’ve tallied so far in our database, which goes back to the 1970s. We spoke with the authors about their findings — and how to prevent bad images from getting published in the first place.

Retraction Watch: You found 59 potential instances of inappropriate duplication — how did you define this, and validate that the images were problematic?

Arturo Casadevall: Images were spotted by Elisabeth who has an remarkable ability to detect problems based on patterns. She then sent them to Ferric Fang and I, and we all needed to agree for the figure to be classified as problematic.

Elisabeth Bik: I used the exact same criteria as in the mBio study in which I scanned 20,000 papers [described here by RW]. We flagged three types of duplications:

duplication of the exact same panel (e.g. a Western blot strip or a photo of cells) within the same paper, but that represented different experiments.

duplication of a panel with a shift (e.g. 2 photos that show an area of overlap, or a Western blot that was shifted or rotated)

duplication within a photo (e.g. 2 lanes within the same Western blot, the same cell visible multiple times within the same photo)

Note that these were all apparent duplications – instances where I judged that bands or photos looked unexpectedly similar. I am not perfect, but in case of doubt I would not flag it.

I did not flag apparent splices for this set.

Roger Davis: All of the identified image anomalies were then subjected to [Office of Research Integrity] forensic image analysis to formally confirm the presence of image problems.

RW: Among the 59 cases, 42 were corrected, but only five papers have been retracted. Does that surprise and/or disappoint you?

AC: I think we expected that most image problems were the result of error in assembling figures so the 10% retraction was not surprising.

RD: Authors of papers with image problems were contacted with a request for the original data. The constructed figures and original data were then examined using ORI forensic tools. Decisions to take no action, to correct, or to retract were made by rigorously following COPE guidelines.

EB: We trusted the authors when they said that the duplication was the result of an error. Our goal is to make sure that the science is correct, not to punish. In most cases, the error could be corrected, so that others will be able to use the right datasets for future experiments and citations. The cases that were retracted where the papers where we felt that there were too many errors to be corrected, or where misconduct was suspected.

RW: No action was taken in 12 papers. The reasons you state are: “origin from laboratories that had closed (2 papers), resolution of the issue in correspondence (4 papers), and occurrence of the event more than six years earlier (6 papers).” Are these reasonable explanations, in your opinion?

AC: The ASM had a policy to investigate allegations that were not older than 6 years. I think this is reasonable. The 6 year limit is based on the ORI statute of limitations – this is the justification employed by ASM.

Ferric Fang: The ORI established an 6-year limit in 2005 after learning from experience that it is impractical to pursue allegations of misconduct when more than 6 years have elapsed, and ASM Journals has had a similar experience. The NIH only requires investigators to retain research records for 3 years from the date of submission of the final financial report, and the NSF similarly requires retention of records for 3 years after the submission of “all required reports.” This underscores the importance of trying to address questionable data in a timely manner.

EB: This age limit is used by many publishers, not just ASM, and there are several good reasons to use it. It is very hard to pursue those older cases. Most labs do not save lab notebooks and blots/films that are older than that time frame, and postdocs and graduate students have already moved on. It is almost impossible to track down errors that happen that long ago. Papers older than 5 years also will have a lower chance of being cited, and other studies might already have either confirmed or rejected the findings from older papers with duplicated images.

There are some duplications in older papers (not just MCB’s) that are suggestive for intention-to-mislead and that might benefit from being discussed or flagged, but this is my personal opinion, not necessarily that of ASM or my co-authors.

RW: You extrapolate that if 10% of the MCB papers needed to be retracted for image duplication, then 35,000 papers throughout the literature may need the same. How did you perform that calculation, and what assumptions is it based on?

EB: We extrapolated the results from previous studies to the rest of the literature. In our previous study, in which we analyzed 20,000 papers, we found that 3.8% contained duplicated images. We know that the percentage of duplicated images varies per journal, because of a wide variety of reasons (different editorial processes, variable levels of peer review, different demographics of the authors). Since this percentage was calculated on papers from 40 different journals with different impact factors, this percentage serves as a reasonable representation of the whole body of biomedical literature. The 10.1 % is the percentage of papers that were retracted in the MCB dataset. Granted, this was a much smaller dataset than the one from the mBio paper, but it was a set that was seriously looked at.

If there are 8,778,928 biomedical publications indexed in PubMed from 2009-2016, and 3.8% contain a problematic image, and 10.6% (CI 1.5- 19.8%) of that group contain images of sufficient concern to warrant retraction, then we can estimate that approximately 35,000 (CI 6,584-86,911) papers are candidates for retraction due to image duplication.

RW: 35,000 papers sounds like a lot — but, as you note, it is a small fraction of the total number of papers published. Should working scientists, who rely on the integrity of the scientific literature, feel concerned about the number of potentially problematic papers that appear, for all intents and purpose, 100% valid?

AC: The number is large in magnitude but small when compared to the fraction that may be candidates for retraction in the total literature. I think scientists need to be aware that there are problem papers out there and just be cautious with any published information. To me being cautious is always good scientific practice.

EB: Agreed. Errors can be found anywhere, not just in scientific papers. It is reassuring to know that most are the result of errors, not science misconduct. Studies like ours are also meant to raise awareness among editors and peer reviewers. Catching these errors before publication is a much better strategy than after publication. In this current study we show that investing some additional time during the editorial process to screen for image problems is worth the effort, and can save time down the road, in case duplications are discovered after publication. I hope that our study will result in more journals following in the footsteps of ASM by starting to pay attention to these duplications and other image problems, before they publish their papers.

RW: You note that it takes six hours for editorial staffers to address image issues in a published paper, but 30 minutes to screen images before publication. That’s a powerful demonstration of the benefits of screening. What barriers could prevent that from happening?

AC: The 30 mins was the time taken by the production Department to screen the figures. I think the major impediment to having screening implemented widely is cost and finding the people with the right expertise.

Amy Kullas: The 30 minutes does not refer to editorial time, but the time taken by the ASM image specialists to screen the figures.

RW: Even after screening was introduced at MCB, you still found that 4% of papers included inappropriate manipulations. How should we think about that?

AC: Screening is not perfect.

FF: The MCB pre-publication screening process is not designed to detect the kind of image duplication that Elies Bik is able to detect. The MCB staff screen for obvious instances of splicing, etc. that do not comply with journal guidelines for image presentation. The screening may incidentally deter or detect other types of image problem but it is not designed to do so.

EB: I expect this number to continue to go down. Both peer reviewers and editors are getting better in recognizing these problems. We are just starting to recognize these problems. I also expect, unfortunately, that people who really want to commit science misconduct will get better at photoshopping and generate images that cannot be recognized as fake using the human eye. Both peer reviewers and editors are getting better in recognizing these problems.

RW: Are there ways to reduce the rate of image duplication, besides pre-publication screening?

AC: Yes, we suggest that one mechanism for reducing these types of problems is to have someone else in the group assemble the figures. At the very least that would mean a second set of eyes looking at the figures.

EB: Other solutions would be to better train peer reviewers to recognize duplications, and to develop software to detect manipulated images. We also need to raise more awareness and point out that these duplications are not allowed, so that authors can recognize these issues before submitting the manuscripts, and even adopt policies to not allow photoshopping or other science misconduct practices in their lab.

22 thoughts on “35,000 papers may need to be retracted for image doctoring, says new paper”

I’m concerned that this only looked for duplications within the same paper, but of course this was still a humongous amount of work. Extending to comparisons across papers from the same authors would be exponentially harder.

However, as the great Sam W. Lee of Harvard (an accomplished scientist) demonstrated several years ago, data re-use across papers is also an issue for papers in MCB….

No correction or retraction yet for this paper (http://mcb.asm.org/content/20/5/1723.long). I wonder if it came up in this screen as problematic for other reasons? What this suggests, is that the 3 types of image duplication screened here are just the tip of the iceberg, and the 35,000 number will likely need to be rounded up, not down, as image analysis methods improve.

While I understand the notion of the extrapolation, it’s a serious matter. It’s an indictment of science in a widespread manner. At the very least, a confirmation study should be made. In addition, an attempt should be made to determine factors associated with fraud – younger researchers, those from specific academic disciplines, etc.

In addition to the comments noted above, they have identified a proportion, and proportions have CIs. The CI captures the variability or uncertainty associated with the estimation. Thus, a RANGE of values of possibly suspect papers is a better indicator of concerns.

There should be no time limit on flagging papers with dodgy blot and gel images and working to get them retracted. Author and institutional stonewalling often push investigations beyond 6 years from publication date for questioned papers. How long did it take to figure out the Piltdown hoax? (41 years according to the illustrious Wikipedia.) The correction of findings in science can not include time limits.

I am glad that this type of studies is performed. In light of reproducibility crises in research, it is highly relevant.

I agree with Paul Brookes’ concern.

I am afraid that most of the published articles are constructed stories made up of data more or less randomly put together not reflecting the truth, but a made up idea. Data duplications are spotted due to the fact that there will sometimes be done mistakes during constructions of these beautiful stories often with a full mechanistic explanation.

How can I be so sure?

First of all, most of the published research is not possible to reproduce, not even replicate. This is well documented.

Duplication of data is just one of many ways to construct a story, although a stupid one. It can also be challenging to detect.
By cross checking articles published within research groups that have an unusual high production, one can detect more duplications.

When raw data is provided (e.g. Nature Cell Biology demand whole western immunoblot membranes presented as supplementary files) it is not unusual to see that the data presented in articles does not reflect what is enclosed as raw data.

The existing anarchy in research, where misconduct problems are swept under the carpet, success is measured in impact factors and reproducibility, scientific truth and integrity seem to be forgotten terms, does not help to do better.

I might be playing the devil’s advocate here but the further we go back in time the less likely authors can present any sort of raw data or explanation, some of the co-authors have moved, retired, passed away, or became unreacheable for some other reason. The “fudged image crisis” has to be treated prospectively, placing the emphasis on scrutinising contemporary work. The “in silico” Westerns of the 90s have to be pointed out, the affected papers flagged, but the focus shoul dremain on recent articles and most importantly submitted papers.

In reply to BB June 30, 2018 at 4:06 pm.
We will all die. That is not a reason not to correct the literature. We make judgments based on the evidence. Sometimes the evidence is overwhelming that the images could not have represented reality.
Piltdown man was exposed as a hoax, and commonly believed to be a hoax, after its “discoverer” had died.

Correct, but read lines 211-215 in Biks paper. The 35,000 estimate is only 0.38% of the indexed papers because Bik et al fairly assumed that not all papers with duplicated images need to be retracted. That is a pretty small number. “You tead that right”

Correct, but Bik et al’s calculation was based on their (fair) estimate that only 10% of those papers with duplicated images merited retraction. (See lines 210-215 in their reprint). 3.8% of the baseline indexed publication number used by Bik et al would be >>330,000. (Line 211).

In fact, Bik et als estimate is much less than the general level of sociopathy in the US, less than the 1 in 78 false images submitted for insurance claims, less than JCB and other journals screening numbers identifying serious problems, and less than the estImate in the Nature published Gallop study surveying misconduct by my former colleague Sandy Titus et al.*

RW’s headline could just have well reported that, overall, Science is doing its job on the integrity front! Granted, truth is somewhere between the extremes, but certainly not in the hyperventilation.

*(truth in advertising: “Dr. Harvey” is the phantom investigator whose name was on the forensics room at ORI)

Richard, if you start with “purist’s” position that any questioned image is an assault on research integrity through sloppiness or innocent error, I’d agree with your “order of magnitude” assertion. But if you start from an ORI standard of intentional falsification in proposing, reporting, or conducting research, the incidence is quite a bit less. The problem is that journals and institutions, in professing the former, defer to the latter (ORI) to make those decisions. So what numbers are right to use in this discussion?

The incidence is certainly greater, as Bik et al themselves (and indeed others before them) recognize. (For example, just about Images as “graphical representaions of data -recorded by an instrument that contain signs of their own inauthenticity,” and it is clear photos or blots are just part of the larger detectible problem, if one exists. But in the meantime, IMHO, it serves the scientific community better if the headlines stressed a measured and balanced perspective, with less sensationalism.

The only novel feature introduced by the focused experimentsl design this paper was the opportunity to consider the “risk to benefits ratio, so to speak, in terms of the cost/motivation ratio, for a specific journal to screen, I wish they had explored that issue more.

Well, while not its main purpose, I was hoping (after all the fray) that PubPeer was essentially doing just that. There Screening has already been done through cloud sourcing.
Can one start with PubPeer to ask questions about the incidence of questioned data?

Good advice for all. However, I “gave” at work (more than a 100 fold), and Bik et al kindly cited use of my methods. My getting into the “discovery” business would impeach the latter. But thanks for the plug.

“Briefly, one person (EMB) scanned published papers by eye for image duplications in any photographic images or FACS plots.”

I hate to bring up Google, but their image matching algorithm is pretty good. It seems to me that given access to the electronic versions of journals, one might easily use such an algorithm (with modifications) to search for matches.

They checked for shifts and rotations, and on examining their prior paper, it seems to me this process could be automated to some extent, and questionable images could be flagged (the originals, including tags for the articles from which they came, and a copy with the questionable areas highlighted for human eyes to examine and use judgment).

I’m not sure if “rotations” includes transposition of axes and relabeling; I don’t think that’s mentioned.

Does anyone know of a project currently working on automatic detection of duplication of images in scientific journals?