I'm pretty proud of our most recently posted paper, which is on a
sequence analysis concept we call digital normalization. I think the paper is
pretty kick-ass, but so is the way in which we're approaching
replication. This blog post is about the latter.

(Quick note re "replication" vs "reproduction": The distinction
between replication and reproducibility is, from what I understand,
that "replicable" means "other people get exactly the same results
when doing exactly the same thing", while "reproducible" means
"something similar happens in other people's hands". The latter is
far stronger, in general, because it indicates that your results are
not merely some quirk of your setup and may actually be right.)

a link to the paper itself, in preprint form, stored at the arXiv
site;

a tutorial for running the software on a Linux machine hosted in
the Amazon cloud;

a git repository for the software itself (hosted on github);

a git repository for the LaTeX paper and analysis scripts (also
hosted on github), including an ipython notebook for generating the
figures (more about that in my next blog post);

instructions on how to start up an EC2 cloud instance, install the
software and paper pipeline, and build most of the analyses and all
of the figures from scratch;

the data necessary to run the pipeline;

some of the output data discussed in the paper.

(Whew, it makes me a little tired just to type all that...)

What this means is that you can regenerate substantial amounts (but
not all) of the data and analyses underlying the paper from scratch,
all on your own, on a machine that you can rent for something like 50
cents an hour. (It'll cost you about $4 -- 8 hours of CPU -- to
re-run everything, plus some incidental costs for things like downloads.)

Not only can you do this, but if you try it, it will actually work.
I've done my best to make sure the darn thing works, and this is the
actual pipeline we ourselves ran to produce the figures in the paper.
All the data is there, and all of the code used to process the data,
analyze the results, and produce the figures is also there. In
version control.

When you combine that with the ability to run this on a specific EC2
instance -- a combination of a frozen virtual machine installation and
a specific set of hardware -- I feel pretty confident that at least
this component of our paper is something that can be replicated.

Well, interestingly enough, it wasn't that much work. I already
use version control for everything, including paper text; posting it
all to github was a matter of about three commands.

Writing the code, analysis scripts, and paper was an immense amount of
work. But I had to do that anyway.

The most extra effort I put in was making sure that the big data files
were available. I didn't want to add the the 2gb E. coli resequencing
data set to git, for example. So I ended up tarballing those files
sticking them on S3.

The Makefile and analysis scripts are ugly, but suffice to remake
everything from scratch; they were already needed to make the paper,
so in order to post them all I had to do was put in a teensy bit of
effort to remove some unintentional dependencies.

The ipython notebook used to generate the figures (again -- next blog
post) was probably the most effort, because I had to learn how to use
it, which took about 20 minutes. But it was one of the smoothest
transitions into using a new tool I've ever experienced in my ~25 years
of coding.

Overall, it wasn't that much extra effort on my part.

Why bother in the first place??

The first and shortest answer is, because I could, and because I
believe in replication and reproducibility, and wanted to see how
tough it was to actually do something like this. (It's a good deal
above and beyond what most bioinformaticians do.)

Perhaps the strongest reason is that our group has been bitten a lot
in recent months by irreplicable results. I won't name names, but
several Science and PNAS and PLoS One papers of interest to us turned
out to be basically impossible for us to replicate. And, since we are
engaged in developing new computational methods that must be compared
to previous work, an inability to
regenerate exactly the results in those other papers meant we had to
work harder than we should have, simply to reproduce what they'd done.

A number of these problems came from people discarding large data sets
after publishing, under the mistaken belief that their submission to
the Short Read Archive could be used to regenerate their results.
(Often SRA submissions are unfiltered, and no one keeps the filtering
parameters around...right?) In some cases, I got the right data sets
from the authors and could replicate (kudos to Brian Haas of Trinity
for this!), but in most cases, ixnay on the eplicationre.

Then there were the cases where authors clearly were simply being bad
computational scientists. My favorite example is a very high profile
paper (coauthored by someone I admire greatly), in which the script
they sent to us -- a script necessary for the initial analyses -- had
a syntax error in it. In that case, we were fairly sure that the
authors weren't sending us the script they'd actually used... (It was
Perl, so admittedly it's hard to tell a syntax error from legitimate
code, but even the Perl interpreter was choking on this.)

(A few replication problems came from people using closed or
unpublished software, or being hand-wavy about the parameters they
used, or using version X of some Web-hosted pipeline for which only
version Y was now available. Clearly these are long-term issues that
need to be discussed with respect to replication in comp. bio., but
that's another topic.)

Thus, my group has wasted a lot of time replicating other people's
work. I wanted to avoid making other people go through that.

A third reason is that I really, really, really want to make it easy
for people to pick up this tool and use it. Digital normalization
is super ultra awesome and I want as little as possible to stand in
the way of others using it. So there's a strong element of
self-interest in doing things this way, and I hope it makes diginorm
more useful. (I know about a dozen people that have already tried it
out in the week or so since I made the paper available, which is
pretty cool. But citations will tell.)

Way back when, Jim Graham politely schooled me in the true meaning of
reproducibility, as opposed to replication. He was about 2/3 right,
but then he went a bit too far and said

But let's drop the idea that I'm going to take your data and your
code and "reproduce" your result. I'm not. First, I've got my own
work to do. More importantly, the odds are that nobody will be any
wiser when I'm done."

Well, let's take a look at that concern, shall we?

With the benefit of about two years of further practice, I can tell
you this is a dangerously wrong way to think, at least in the field of
bioinformatics. My objections hinge on a few points:

First, based on our experiences so far, I'd be surprised if the
authors themselves could replicate their own computational results --
too many files and parameters are missing. We call that "bad
science".

Second, odds are, the senior professor has little or no detailed
understanding of what bioinformatic steps were taken in processing the
data, and moreover is uninterested in the details; that's why they're
not in the Methods. Why is that a problem? Because the odds are
quite good that many biological analyses hinge critically on such
points. So the peer reviewers and the community at large need to be
able to evaluate them (see this RNA editing kerfuffle for an
excellent example of reviewer fail). Yet most bioinformatic pipelines
are so terribly described that even with some WAG I can't figure out
what, roughly speaking, is going on. I certainly couldn't replicate
it, and generating specific critiques is quite difficult in that kind
of circumstance.

Third, Graham may be guilty of physics chauvinism (just like I'm
almost certainly guilty of bioinformatics chauvinism...) Physics and
biology are quite different: in physics, you often have a theoretical
framework to go by, and results should at least roughly adhere to that
or else they are considered guilty until proven innocent. In biology,
we usually have no good idea of what we're expecting to see, and often
we're looking at a system for the very first time. In that
environment, I think it's important to make the underlying computation
WAY more solid than you would demand in physics (see RNA editing above).

As Narayan Desai pointed out to me (following which I then put it in
my PyCon talk (slide 5)),
physics and biology are quite different in the way data is generated
and analyzed. There's fewer sources of data generation in physics,
there's more of a computational culture, and there's more theory.
Having worked with physicists for much of my scientific life (and
having published a number of papers with physicists) I can tell you
that replication is certainly a big problem over there, but the
consequences don't seem as big -- eventually the differences between
theory and computation will be worked out, because they're far more
noticeable when you have theory, like in physics. Not so in biology.

Fourth, a renewed emphasis on computational methods (and therefore on
replicability of computational results) is a natural part of the
transition to Big Data biology. The quality of
analysis methods matters A LOT when you are dealing with massive
data sets with weak signals and many systematic biases. (I'll write
about this more later.)

Fifth, and probably most significant from a practical perspective,
Graham misses the point of reuse. In bioinformatics, it behooves us
to reuse proven (aka published) tools -- at least we know they worked
for someone, at least once, which is not usually the case for newly
written software. I don't pretend that it's the responsibility of
people to write awesome reusable tools for every paper, but sure as
heck I should expect to be able to run them on some combination of
hardware and software. Often that's not the case, which means I get
to reinvent the wheel (yay...) even when I'm doing the same stupid
thing the last five pubs did.

For our paper, khmer and screed should be quite reusable. The
analysis pipeline for the paper? It's not that great. But at least
you can run it, and potentially steal code from it, too.

When I was talking to a colleague about the diginorm paper, he said
something jokingly: "wow, you're making it way too easy for people!"
-- presumably he meant it would be way to easy for people to criticize
or otherwise complain about the specific way we're doing things.
Then, a day or two later he said, "hmm, but now that I think of it, no
one ever uses the software we publish, and you seem to have had better
luck with that..." -- recognizing that if you are barely able to run
your own software, perhaps others might find it even more difficult.

Heck, the diginorm paper itself would have been far harder to write
without the data sets from the Trinity paper and the
Velvet-SC paper. Having those
nice, fresh, well-analyzed data sets already at hand was fantastic.
Being able to run Trinity and reproduce their results was wonderful.

There's a saying in software engineering: "one of the main people you
should be programming for is yourself, in 6 months." That's also true
in science -- I'm sure I won't remember the finer details of the
diginorm paper analysis in 2 years -- but I can always go look into
version control. More importantly, new graduate students can go look
and really see what's going on. (And I can use it for teaching, too.)
And so can other people working with me. So there's a lot of utility
in simply nailing everything down and making it runnable.

Replication is by no means sufficient for good science. But I'll be
more impressed by the argument that "replication isn't all that
important" when I see lack of replication as the exception rather than
the rule. Replication is essential, and good, and useful. I long for
the day when it's not interesting, because it's so standard. In
the meantime I would argue that it certainly doesn't do any harm to
emphasize it.

(Note that I really appreciate Jim Graham's commentary, as I think he
is at worst usefully wrong on these points, and substantially
correct in many ways. I'm just picking on him because he wrote it all
down in one place for me to link to, and chose to use the word 'sic'
when reproducing my spelling mistake. Low blow ;)

I don't pretend to have all, or even many, of the answers; I just like
to think about what form they might take.

I don't want to argue that this approach is a panacea or a
high-quality template for others to use, inside or out of
bioinformatics. For one thing, I haven't automated some of the
analyses in the paper; it's just too much work for too little benefit
at this point. (Trust me, they're easy to reproduce... :). For
another, our paper used a fairly small amount of data overall; only a
few dozen gigabytes all told. This makes it easy to post the data for
others to use later on. Several of our next few papers will involve
over a half terabyte of raw data, plus several hundred gb of ancillary
and intermediate results; no idea what we'll do for them.

Diginorm is also a somewhat strange bioinformatics paper. We just
analyzed other people's data sets (an approach which for some reason
isn't in favor in high impact bioinformatics, probably because high
impact journal subs are primarily reviewed by biologists who want to
see cool new data that we don't understand, not boring old data that
we don't understand). There's no way we can or should argue that
biological replicates done in a different lab should replicate the
results; that's where reproducibility becomes important.

But I would like it if people considered this approach (or some
other approach) to making their analyses replicable. I don't mind
people rejecting good approaches because they don't fit; to each their
own. But this kind of limited enabling of replication isn't that
difficult, frankly, and even if it were, it has plenty of upsides.
It's definitely not irrelevant to the practice of science -- I would
challenge anyone to try to make that claim in good faith.