A new study in Clinical Chemistry paints an alarming picture of how often scientists deposit data that they’re supposed to — but perhaps not surprisingly, papers whose authors did submit such data scored higher on a quality scale than those whose authors didn’t deposit their data.

Ken Witwer, a pathobiologist at Hopkins, was concerned that a lot of studies involving microarray-based microRNA (miRNA) weren’t complying with Minimum Information About a Microarray Experiment (MIAME) standards supposedly required by journals. So he looked at 127 such papers published between July 2011 and April 2012 in journals including PLOS ONE, the Journal of Biological Chemistry, Blood, and Clinical Chemistry, assigning each one a quality score and checking whether the authors had followed guidelines.

What he uncovered wasn’t pretty — and has already led to a retraction. From the abstract:

Overall, data submission was reported at publication for 40% of all articles, and almost 75% of articles were MIAME noncompliant. On average, articles that included full data submission scored significantly higher on a quality metric than articles with limited or no data submission, and studies with adequate description of methods disproportionately included larger numbers of experimental repeats. Finally, for several articles that were not MIAME compliant, data reanalysis revealed less than complete support for the published conclusions, in 1 case leading to retraction.

Here’s that retraction, for “Host cells respond to exogenous infectious agents such as viruses, including HIV-1,” published in PLOS ONE in 2011:

The authors wish to retract this article for the following reason:

Upon re-evaluation of the analyses performed, we discovered an error in the data fed into the software, which resulted in incorrect results in Table 2 and Figure 2. During the initial analysis, we eliminated miRNAs if they showed an expression CT of value 35 in over 75% of the samples. This decision was based on the instructions from the software during the initial data feed process for the selection of particular miRNAs (row) for exclusion. Unfortunately, the software included the excluded miRNAs as controls along with the endogenous controls and analyzed the data. As a result, the analyses identified miRNAs that are not statistically significant.

The multiple corrections on the paper show a correspondence between Witwer and the paper’s corresponding author, Velpandi Ayyavoo, dating back to October 2011. The original study has been cited six times, according to Thomson Scientific’s Web of Knowledge, including once by a study by Witwer and colleagues.

As Witwer writes:

Reporting and quality issues were found for articles in journals with impact factors ranging from approximately 1 to 30, with no obvious association between impact factor and quality score, indicating the endemic nature of the problem. However, other associations were clear. MIAME noncompliant studies were twice as likely to arise from array experiments with n of 1. Articles with vague descriptions of experimental design were disproportionately those with few experimental replicates. Studies with fully submitted data received significantly higher mean quality scores than articles with partial submitted data or no data deposition.

Witwer has a number of suggestions, many of which come down to researchers adopting a different ethos. We smiled at this passage:

Unless I have personally and fully funded my laboratory and research out-of-pocket, my data do not belong to me. They belong to my institution and to the taxpayer, and I have no right to withhold them to prevent another laboratory from analyzing my data in a way I did not consider.

Not surprisingly, the study is accompanied by an editorial titled “More Data, Please!” by Keith Baggerly, of MD Anderson. Along with a colleague, Baggerly, Retraction Watch readers may recall, was the bioinformatics specialist who uncovered a litany of problems in Anil Potti’s work. Baggerley writes of Witwer’s analysis:

I echo his concerns and agree the problems can and should be addressed. Data reporting problems are affecting a number of areas beyond miRNA studies.

Ideally, reproduction (which should be faster and cheaper) should precede replication as a sanity check. Poor data access hinders both. Even when data are supplied, reproducibility should not be presumed; in their survey of 18 microarray studies, Ioannidis et al. (5 ) were able to access data for 10 studies but could reproduce quantitative results for just 2.

Given this poor rate of reproduction, poor replication rates, such as the rate of 6 of 53 reported by Begley and Ellis (6 ), for even “landmark” studies are not a huge surprise.

He says the “implications can be severe,” but that “the problem is fixable.”

I want to emphasize our duty as reviewers. Witwer’s recommendation #2 is right on: “At least 1 scientist with experience with large data set analysis should be involved in the review process for manuscripts reporting miRNA (and other) profiling results. This individual should verify the raw and normalized data or, ideally, perform a rapid analysis check. A review should not be considered complete until this is done.” Get that? Every paper Witwer reviewed would have publicly available data in good order – except that reviewers have failed at their jobs. If you as a reviewer cannot actually obtain the data, and do a few checks on it, then please say that it is impossible to perform an adequate review of the paper. Also check that you can tell which sample in the data set is which sample in the paper. If the samples are from patients and some analysis with clinical variables is done, that data should also be available. If they cluster or otherwise group samples you should be able to tell which sample was in which group.

I do have an ax to grind: I could really use those data sets some days, usually to verify a result from a group where I work, but sometimes just to tell if what the paper is reporting is even remotely true. As Witwer shows, usually you can’t even get the data, or if you can, you can’t reproduce what the authors are saying. And I really hate the cheat where the paper fits models to associate markers to outcomes (e.g. Cox models about survival), but doesn’t cough up the outcome or other clinical data – even in cases where the authors have themselves benefited from someone else having done the right thing by making their data public.

I write journals to report such problems several times a year, and that sometimes has the effect of the data later appearing. But I only do that in a minority of such situations, and often not for the papers where I need the data most – because, in some narrow fields, the paper’s authors are too likely to know it was me (or my friends) that made the initial complaint. I sometimes plead with the authors personally, usually without success. ( I consider an offer to let them be a coauthor as a condition for access to be a failure. It’s like extortion. If I accept such an offer I am rewarding their bad behavior. It would be unethical of me.) -Rork Kuick

PS: What Witwer says about trusting third party processing is good advice. It often sucks, and it sometimes has frank errors. The little Baggerly piece was also outstanding, as you’d expect.

It’s unfortunate that articles extolling the virtues of open data are themselves behind a paywall. Fortunately, there is a CiteULike group for this purpose, the Open Access Irony Award: http://www.citeulike.org/group/13803.

I disagree. Open data is one thing, open access to the article itself is something else. In our field (evolutionary biology) the subscription journals are pioneering mandatory data archiving, and the OA journals have policies are lagging behind. See here: http://arxiv.org/abs/1301.3744

I think the primary issue here is that not all journals have declared that they require submissions to adhere to MIAME guidelines. This is, after all, only a recommendation and not a requirement. The instruction and reminder to authors in the instructions tend to be lost in the middle of the first submission page.

So I wonder: whose responsibility is it to ensure that papers are adherent to the MIAME guidelines? If a journal has declared that they require authors to adhere to MIAME guidelines, then submission that fail on this should be rejected outright.

Perhaps the online submission forms should include a line ‘does this work report on a microarray’ Yes/No, ‘if Yes, has the array data been made available in accordance with the guidelines of MIAME?’

“then submission that fail on this should be rejected outright.” By reviewer, who took the time to check, or by journal, that took the time to check? Right now it’s clear neither wants to do it very often. All papers deserve good statistical review too. Problem is, it’s expensive for somebody. Who has spare days to review?

PS: I’m not sure what aspects of submissions actually made caused them to be declared non-compliant (about 35% of all papers, against 40% not even trying). Maybe need to read supplements. Maybe it’s not little details that often, but big whoppers. Sometimes it’s as bad as the people saying it’s in GEO, but you go there, and find it isn’t. Witwer gave such an example I think.

Many thanks to Ivan and Adam for their tireless and important efforts here.

I appreciate the irony of the article being behind a paywall, but Tim Vines is right: article and data access are separate subjects. It’s important to remember that a journal paywall is temporary, and there are always ways around it. For example, anyone who doesn’t have institutional access can contact me for a reprint for educational purposes. I’ve sent quite a few during the last week. A twelve-month wait for public access is perhaps not ideal, but, in my view, it’s an acceptable price to pay for professional publishing. Clinical Chemistry’s handling of the peer review and publication process in my case was exemplary, and everyone involved–from reviewers and staff to the handling editor and senior editors–added important value to my piece. I would not expect this from some OA journals. And I certainly wouldn’t expect it at no cost to my research budget.