Research Reproducibility from MSWord

By Katharine Miller

A particular mashup of data and tools produces the unique results found in each computational biology publication. Now, researchers have developed a model system that gives readers—especially those lacking programming skills—the tools, data, and parameters they’d need to reproduce those results. Dubbed a “reproducible research system” (RRS), it lets the reader replicate original computational research directly from a Microsoft Word document.

“This effort was meant to show that the technology exists to make research reproducible by the non-programming user,” says Jill Mesirov, PhD, director of computational biology and bioinformatics at the Broad Institute of the Massachusetts Institute of Technology. The work was described in a policy forum in Science in January 2010.

Often, to reproduce a computational biology research result, one must contact the original researcher to request the data and tools. Even then, the precise steps taken might be lost or unrecoverable. People have been struggling with this problem for more than twenty years, and several reproducible research systems already exist, but they are not widely used and require the user to do things that are “very much like coding,” Mesirov says.

The RRS concept, as proposed by Mesirov and her colleagues, consists of two parts: an environment for doing the computational work that tracks the data, analyses and results and then packages them for redistribution; and a publisher, such as a standard word-processing software.

As an example system, Mesirov and her colleagues used GenePattern, a genomic analysis platform that provides access to more than 100 tools for gene expression analysis, proteomics, SNP analysis and common data processing tasks. In GenePattern, users’ sessions can be captured and replayed. “The idea was to take the captured user session in GenePattern—with all the parameters and datasets—and embed that in Microsoft Word,” Mesirov says. Luckily, from a technical point of view, the webservices architecture of GenePattern and the XML capabilities of Word “kind of meshed,” she says.

With funding from Microsoft, the GenePattern RRS was developed. A user can link text, tables and figures to previously executed GenePattern pipelines. And readers can open up those pipelines from the document.

Mesirov invites people to try the GenePattern RRS (available online at http://genepatternwordaddin.codeplex.com/) and to develop similar systems for other tools. “It’s not one-size-fits-all,” Mesirov says. “This is not about GenePattern or even this instantiation of reproducible research. It’s about the need and the fact that you want reproducible research accessible to people who don’t write code.”

It’s an exciting development, says Kevin Coombes, PhD, associate professor of biostatistics and applied mathematics at the University of Texas M. D. Anderson Cancer Center, where masters students are routinely trained to use a reproducible research system called Sweave. MSWord is already on peoples’ computers, so the Genepattern RRS is potentially more useful than systems like Sweave that require some programming. At the same time, he says, there are sociological hurdles to adopting reproducible research systems. “The software to unite these things is necessary to do reproducible research, but not sufficient. You have to get people to buy into it.”