At the workshop, people were still confused. So I
tried something new.

I first made a simulated metagenome by taking three genomes worth of
data from the Chitsaz et al. (2011) paper (see
http://bix.ucsd.edu/projects/singlecell/) and shuffling them together.
I combined the sequences in a ratio of 10:25:50 for the E. coli
sequences, the Staph sequences, and the SAR sequences, respectively;
the latter two were single-cell MDA genomic DNA. I took the first 10m
reads of this mix and then estimated the coverage.

You can see the coverage of these genomic data sets estimated by using
the known reference sequences in the first figure. E. coli looks nice
and Gaussian; Staph is smeared from here to heck; and much of the SAR
sequence is low coverage. This reflects the realities of single cell
sequencing: you get really weird copy number biases out of multiple
displacement amplification.

Then I applied three-pass digital normalization (see the paper) and
plotted the new abundances. As a reminder, this operates without
knowing the reference in advance; we're just using the known reference
here to check the effects.

Coverage of genome read mix, calculated by mapping the mixed reads
onto the known reference genomes.

Coverage post-digital-normalization, again calculated by mapping
the mixed reads onto the known reference genomes.

As you can see, digital normalization literally "normalizes" the data
to the best of its ability. That is, it cannot create higher coverage
where high coverage doesn't exist (for the SAR), but it can convert
the existing high coverage into nice, Gaussian distributions centered
around a much lower number. You also discard quite a bit of data (look
at the X axes -- about 85% of the reads were discarded in downsampling
the coverage like this).

When you assemble this, you get as good or better results than
assembling the unnormalized data, despite having discarded so much
data. This is because no low-coverage data is discarded, so you still
retain as much overall covered bases -- just in fewer reads. To boot,
it works pretty generically for single genomes, MDA genomes,
transcriptomes, and metagenomes.

And, as a reminder? Digital normalization does this in fixed, low
memory; in a single pass; and without any reference sequence needed.

I'm pretty proud of our most recently posted paper, which is on a
sequence analysis concept we call digital normalization. I think the paper is
pretty kick-ass, but so is the way in which we're approaching
replication. This blog post is about the latter.

(Quick note re "replication" vs "reproduction": The distinction
between replication and reproducibility is, from what I understand,
that "replicable" means "other people get exactly the same results
when doing exactly the same thing", while "reproducible" means
"something similar happens in other people's hands". The latter is
far stronger, in general, because it indicates that your results are
not merely some quirk of your setup and may actually be right.)

a link to the paper itself, in preprint form, stored at the arXiv
site;

a tutorial for running the software on a Linux machine hosted in
the Amazon cloud;

a git repository for the software itself (hosted on github);

a git repository for the LaTeX paper and analysis scripts (also
hosted on github), including an ipython notebook for generating the
figures (more about that in my next blog post);

instructions on how to start up an EC2 cloud instance, install the
software and paper pipeline, and build most of the analyses and all
of the figures from scratch;

the data necessary to run the pipeline;

some of the output data discussed in the paper.

(Whew, it makes me a little tired just to type all that...)

What this means is that you can regenerate substantial amounts (but
not all) of the data and analyses underlying the paper from scratch,
all on your own, on a machine that you can rent for something like 50
cents an hour. (It'll cost you about $4 -- 8 hours of CPU -- to
re-run everything, plus some incidental costs for things like downloads.)

Not only can you do this, but if you try it, it will actually work.
I've done my best to make sure the darn thing works, and this is the
actual pipeline we ourselves ran to produce the figures in the paper.
All the data is there, and all of the code used to process the data,
analyze the results, and produce the figures is also there. In
version control.

When you combine that with the ability to run this on a specific EC2
instance -- a combination of a frozen virtual machine installation and
a specific set of hardware -- I feel pretty confident that at least
this component of our paper is something that can be replicated.

Well, interestingly enough, it wasn't that much work. I already
use version control for everything, including paper text; posting it
all to github was a matter of about three commands.

Writing the code, analysis scripts, and paper was an immense amount of
work. But I had to do that anyway.

The most extra effort I put in was making sure that the big data files
were available. I didn't want to add the the 2gb E. coli resequencing
data set to git, for example. So I ended up tarballing those files
sticking them on S3.

The Makefile and analysis scripts are ugly, but suffice to remake
everything from scratch; they were already needed to make the paper,
so in order to post them all I had to do was put in a teensy bit of
effort to remove some unintentional dependencies.

The ipython notebook used to generate the figures (again -- next blog
post) was probably the most effort, because I had to learn how to use
it, which took about 20 minutes. But it was one of the smoothest
transitions into using a new tool I've ever experienced in my ~25 years
of coding.

Overall, it wasn't that much extra effort on my part.

Why bother in the first place??

The first and shortest answer is, because I could, and because I
believe in replication and reproducibility, and wanted to see how
tough it was to actually do something like this. (It's a good deal
above and beyond what most bioinformaticians do.)

Perhaps the strongest reason is that our group has been bitten a lot
in recent months by irreplicable results. I won't name names, but
several Science and PNAS and PLoS One papers of interest to us turned
out to be basically impossible for us to replicate. And, since we are
engaged in developing new computational methods that must be compared
to previous work, an inability to
regenerate exactly the results in those other papers meant we had to
work harder than we should have, simply to reproduce what they'd done.

A number of these problems came from people discarding large data sets
after publishing, under the mistaken belief that their submission to
the Short Read Archive could be used to regenerate their results.
(Often SRA submissions are unfiltered, and no one keeps the filtering
parameters around...right?) In some cases, I got the right data sets
from the authors and could replicate (kudos to Brian Haas of Trinity
for this!), but in most cases, ixnay on the eplicationre.

Then there were the cases where authors clearly were simply being bad
computational scientists. My favorite example is a very high profile
paper (coauthored by someone I admire greatly), in which the script
they sent to us -- a script necessary for the initial analyses -- had
a syntax error in it. In that case, we were fairly sure that the
authors weren't sending us the script they'd actually used... (It was
Perl, so admittedly it's hard to tell a syntax error from legitimate
code, but even the Perl interpreter was choking on this.)

(A few replication problems came from people using closed or
unpublished software, or being hand-wavy about the parameters they
used, or using version X of some Web-hosted pipeline for which only
version Y was now available. Clearly these are long-term issues that
need to be discussed with respect to replication in comp. bio., but
that's another topic.)

Thus, my group has wasted a lot of time replicating other people's
work. I wanted to avoid making other people go through that.

A third reason is that I really, really, really want to make it easy
for people to pick up this tool and use it. Digital normalization
is super ultra awesome and I want as little as possible to stand in
the way of others using it. So there's a strong element of
self-interest in doing things this way, and I hope it makes diginorm
more useful. (I know about a dozen people that have already tried it
out in the week or so since I made the paper available, which is
pretty cool. But citations will tell.)

Way back when, Jim Graham politely schooled me in the true meaning of
reproducibility, as opposed to replication. He was about 2/3 right,
but then he went a bit too far and said

But let's drop the idea that I'm going to take your data and your
code and "reproduce" your result. I'm not. First, I've got my own
work to do. More importantly, the odds are that nobody will be any
wiser when I'm done."

Well, let's take a look at that concern, shall we?

With the benefit of about two years of further practice, I can tell
you this is a dangerously wrong way to think, at least in the field of
bioinformatics. My objections hinge on a few points:

First, based on our experiences so far, I'd be surprised if the
authors themselves could replicate their own computational results --
too many files and parameters are missing. We call that "bad
science".

Second, odds are, the senior professor has little or no detailed
understanding of what bioinformatic steps were taken in processing the
data, and moreover is uninterested in the details; that's why they're
not in the Methods. Why is that a problem? Because the odds are
quite good that many biological analyses hinge critically on such
points. So the peer reviewers and the community at large need to be
able to evaluate them (see this RNA editing kerfuffle for an
excellent example of reviewer fail). Yet most bioinformatic pipelines
are so terribly described that even with some WAG I can't figure out
what, roughly speaking, is going on. I certainly couldn't replicate
it, and generating specific critiques is quite difficult in that kind
of circumstance.

Third, Graham may be guilty of physics chauvinism (just like I'm
almost certainly guilty of bioinformatics chauvinism...) Physics and
biology are quite different: in physics, you often have a theoretical
framework to go by, and results should at least roughly adhere to that
or else they are considered guilty until proven innocent. In biology,
we usually have no good idea of what we're expecting to see, and often
we're looking at a system for the very first time. In that
environment, I think it's important to make the underlying computation
WAY more solid than you would demand in physics (see RNA editing above).

As Narayan Desai pointed out to me (following which I then put it in
my PyCon talk (slide 5)),
physics and biology are quite different in the way data is generated
and analyzed. There's fewer sources of data generation in physics,
there's more of a computational culture, and there's more theory.
Having worked with physicists for much of my scientific life (and
having published a number of papers with physicists) I can tell you
that replication is certainly a big problem over there, but the
consequences don't seem as big -- eventually the differences between
theory and computation will be worked out, because they're far more
noticeable when you have theory, like in physics. Not so in biology.

Fourth, a renewed emphasis on computational methods (and therefore on
replicability of computational results) is a natural part of the
transition to Big Data biology. The quality of
analysis methods matters A LOT when you are dealing with massive
data sets with weak signals and many systematic biases. (I'll write
about this more later.)

Fifth, and probably most significant from a practical perspective,
Graham misses the point of reuse. In bioinformatics, it behooves us
to reuse proven (aka published) tools -- at least we know they worked
for someone, at least once, which is not usually the case for newly
written software. I don't pretend that it's the responsibility of
people to write awesome reusable tools for every paper, but sure as
heck I should expect to be able to run them on some combination of
hardware and software. Often that's not the case, which means I get
to reinvent the wheel (yay...) even when I'm doing the same stupid
thing the last five pubs did.

For our paper, khmer and screed should be quite reusable. The
analysis pipeline for the paper? It's not that great. But at least
you can run it, and potentially steal code from it, too.

When I was talking to a colleague about the diginorm paper, he said
something jokingly: "wow, you're making it way too easy for people!"
-- presumably he meant it would be way to easy for people to criticize
or otherwise complain about the specific way we're doing things.
Then, a day or two later he said, "hmm, but now that I think of it, no
one ever uses the software we publish, and you seem to have had better
luck with that..." -- recognizing that if you are barely able to run
your own software, perhaps others might find it even more difficult.

Heck, the diginorm paper itself would have been far harder to write
without the data sets from the Trinity paper and the
Velvet-SC paper. Having those
nice, fresh, well-analyzed data sets already at hand was fantastic.
Being able to run Trinity and reproduce their results was wonderful.

There's a saying in software engineering: "one of the main people you
should be programming for is yourself, in 6 months." That's also true
in science -- I'm sure I won't remember the finer details of the
diginorm paper analysis in 2 years -- but I can always go look into
version control. More importantly, new graduate students can go look
and really see what's going on. (And I can use it for teaching, too.)
And so can other people working with me. So there's a lot of utility
in simply nailing everything down and making it runnable.

Replication is by no means sufficient for good science. But I'll be
more impressed by the argument that "replication isn't all that
important" when I see lack of replication as the exception rather than
the rule. Replication is essential, and good, and useful. I long for
the day when it's not interesting, because it's so standard. In
the meantime I would argue that it certainly doesn't do any harm to
emphasize it.

(Note that I really appreciate Jim Graham's commentary, as I think he
is at worst usefully wrong on these points, and substantially
correct in many ways. I'm just picking on him because he wrote it all
down in one place for me to link to, and chose to use the word 'sic'
when reproducing my spelling mistake. Low blow ;)

I don't pretend to have all, or even many, of the answers; I just like
to think about what form they might take.

I don't want to argue that this approach is a panacea or a
high-quality template for others to use, inside or out of
bioinformatics. For one thing, I haven't automated some of the
analyses in the paper; it's just too much work for too little benefit
at this point. (Trust me, they're easy to reproduce... :). For
another, our paper used a fairly small amount of data overall; only a
few dozen gigabytes all told. This makes it easy to post the data for
others to use later on. Several of our next few papers will involve
over a half terabyte of raw data, plus several hundred gb of ancillary
and intermediate results; no idea what we'll do for them.

Diginorm is also a somewhat strange bioinformatics paper. We just
analyzed other people's data sets (an approach which for some reason
isn't in favor in high impact bioinformatics, probably because high
impact journal subs are primarily reviewed by biologists who want to
see cool new data that we don't understand, not boring old data that
we don't understand). There's no way we can or should argue that
biological replicates done in a different lab should replicate the
results; that's where reproducibility becomes important.

But I would like it if people considered this approach (or some
other approach) to making their analyses replicable. I don't mind
people rejecting good approaches because they don't fit; to each their
own. But this kind of limited enabling of replication isn't that
difficult, frankly, and even if it were, it has plenty of upsides.
It's definitely not irrelevant to the practice of science -- I would
challenge anyone to try to make that claim in good faith.

I'm pretty proud of our most recently posted paper, which is on a
sequence analysis concept we call digital normalization. I think the paper is
pretty kick-ass, but so is the way in which we're approaching
replication. This blog post is about the latter.

(Quick note re "replication" vs "reproduction": The distinction
between replication and reproducibility is, from what I understand,
that "replicable" means "other people get exactly the same results
when doing exactly the same thing", while "reproducible" means
"something similar happens in other people's hands". The latter is
far stronger, in general, because it indicates that your results are
not merely some quirk of your setup and may actually be right.)

a link to the paper itself, in preprint form, stored at the arXiv
site;

a tutorial for running the software on a Linux machine hosted in
the Amazon cloud;

a git repository for the software itself (hosted on github);

a git repository for the LaTeX paper and analysis scripts (also
hosted on github), including an ipython notebook for generating the
figures (more about that in my next blog post);

instructions on how to start up an EC2 cloud instance, install the
software and paper pipeline, and build most of the analyses and all
of the figures from scratch;

the data necessary to run the pipeline;

some of the output data discussed in the paper.

(Whew, it makes me a little tired just to type all that...)

What this means is that you can regenerate substantial amounts (but
not all) of the data and analyses underlying the paper from scratch,
all on your own, on a machine that you can rent for something like 50
cents an hour. (It'll cost you about $4 -- 8 hours of CPU -- to
re-run everything, plus some incidental costs for things like downloads.)

Not only can you do this, but if you try it, it will actually work.
I've done my best to make sure the darn thing works, and this is the
actual pipeline we ourselves ran to produce the figures in the paper.
All the data is there, and all of the code used to process the data,
analyze the results, and produce the figures is also there. In
version control.

When you combine that with the ability to run this on a specific EC2
instance -- a combination of a frozen virtual machine installation and
a specific set of hardware -- I feel pretty confident that at least
this component of our paper is something that can be replicated.

Well, interestingly enough, it wasn't that much work. I already
use version control for everything, including paper text; posting it
all to github was a matter of about three commands.

Writing the code, analysis scripts, and paper was an immense amount of
work. But I had to do that anyway.

The most extra effort I put in was making sure that the big data files
were available. I didn't want to add the the 2gb E. coli resequencing
data set to git, for example. So I ended up tarballing those files
sticking them on S3.

The Makefile and analysis scripts are ugly, but suffice to remake
everything from scratch; they were already needed to make the paper,
so in order to post them all I had to do was put in a teensy bit of
effort to remove some unintentional dependencies.

The ipython notebook used to generate the figures (again -- next blog
post) was probably the most effort, because I had to learn how to use
it, which took about 20 minutes. But it was one of the smoothest
transitions into using a new tool I've ever experienced in my ~25 years
of coding.

Overall, it wasn't that much extra effort on my part.

Why bother in the first place??

The first and shortest answer is, because I could, and because I
believe in replication and reproducibility, and wanted to see how
tough it was to actually do something like this. (It's a good deal
above and beyond what most bioinformaticians do.)

Perhaps the strongest reason is that our group has been bitten a lot
in recent months by irreplicable results. I won't name names, but
several Science and PNAS and PLoS One papers of interest to us turned
out to be basically impossible for us to replicate. And, since we are
engaged in developing new computational methods that must be compared
to previous work, an inability to
regenerate exactly the results in those other papers meant we had to
work harder than we should have, simply to reproduce what they'd done.

A number of these problems came from people discarding large data sets
after publishing, under the mistaken belief that their submission to
the Short Read Archive could be used to regenerate their results.
(Often SRA submissions are unfiltered, and no one keeps the filtering
parameters around...right?) In some cases, I got the right data sets
from the authors and could replicate (kudos to Brian Haas of Trinity
for this!), but in most cases, ixnay on the eplicationre.

Then there were the cases where authors clearly were simply being bad
computational scientists. My favorite example is a very high profile
paper (coauthored by someone I admire greatly), in which the script
they sent to us -- a script necessary for the initial analyses -- had
a syntax error in it. In that case, we were fairly sure that the
authors weren't sending us the script they'd actually used... (It was
Perl, so admittedly it's hard to tell a syntax error from legitimate
code, but even the Perl interpreter was choking on this.)

(A few replication problems came from people using closed or
unpublished software, or being hand-wavy about the parameters they
used, or using version X of some Web-hosted pipeline for which only
version Y was now available. Clearly these are long-term issues that
need to be discussed with respect to replication in comp. bio., but
that's another topic.)

Thus, my group has wasted a lot of time replicating other people's
work. I wanted to avoid making other people go through that.

A third reason is that I really, really, really want to make it easy
for people to pick up this tool and use it. Digital normalization
is super ultra awesome and I want as little as possible to stand in
the way of others using it. So there's a strong element of
self-interest in doing things this way, and I hope it makes diginorm
more useful. (I know about a dozen people that have already tried it
out in the week or so since I made the paper available, which is
pretty cool. But citations will tell.)

Way back when, Jim Graham politely schooled me in the true meaning of
reproducibility, as opposed to replication. He was about 2/3 right,
but then he went a bit too far and said

But let's drop the idea that I'm going to take your data and your
code and "reproduce" your result. I'm not. First, I've got my own
work to do. More importantly, the odds are that nobody will be any
wiser when I'm done."

Well, let's take a look at that concern, shall we?

With the benefit of about two years of further practice, I can tell
you this is a dangerously wrong way to think, at least in the field of
bioinformatics. My objections hinge on a few points:

First, based on our experiences so far, I'd be surprised if the
authors themselves could replicate their own computational results --
too many files and parameters are missing. We call that "bad
science".

Second, odds are, the senior professor has little or no detailed
understanding of what bioinformatic steps were taken in processing the
data, and moreover is uninterested in the details; that's why they're
not in the Methods. Why is that a problem? Because the odds are
quite good that many biological analyses hinge critically on such
points. So the peer reviewers and the community at large need to be
able to evaluate them (see this RNA editing kerfuffle for an
excellent example of reviewer fail). Yet most bioinformatic pipelines
are so terribly described that even with some WAG I can't figure out
what, roughly speaking, is going on. I certainly couldn't replicate
it, and generating specific critiques is quite difficult in that kind
of circumstance.

Third, Graham may be guilty of physics chauvinism (just like I'm
almost certainly guilty of bioinformatics chauvinism...) Physics and
biology are quite different: in physics, you often have a theoretical
framework to go by, and results should at least roughly adhere to that
or else they are considered guilty until proven innocent. In biology,
we usually have no good idea of what we're expecting to see, and often
we're looking at a system for the very first time. In that
environment, I think it's important to make the underlying computation
WAY more solid than you would demand in physics (see RNA editing above).

As Narayan Desai pointed out to me (following which I then put it in
my PyCon talk (slide 5)),
physics and biology are quite different in the way data is generated
and analyzed. There's fewer sources of data generation in physics,
there's more of a computational culture, and there's more theory.
Having worked with physicists for much of my scientific life (and
having published a number of papers with physicists) I can tell you
that replication is certainly a big problem over there, but the
consequences don't seem as big -- eventually the differences between
theory and computation will be worked out, because they're far more
noticeable when you have theory, like in physics. Not so in biology.

Fourth, a renewed emphasis on computational methods (and therefore on
replicability of computational results) is a natural part of the
transition to Big Data biology. The quality of
analysis methods matters A LOT when you are dealing with massive
data sets with weak signals and many systematic biases. (I'll write
about this more later.)

Fifth, and probably most significant from a practical perspective,
Graham misses the point of reuse. In bioinformatics, it behooves us
to reuse proven (aka published) tools -- at least we know they worked
for someone, at least once, which is not usually the case for newly
written software. I don't pretend that it's the responsibility of
people to write awesome reusable tools for every paper, but sure as
heck I should expect to be able to run them on some combination of
hardware and software. Often that's not the case, which means I get
to reinvent the wheel (yay...) even when I'm doing the same stupid
thing the last five pubs did.

For our paper, khmer and screed should be quite reusable. The
analysis pipeline for the paper? It's not that great. But at least
you can run it, and potentially steal code from it, too.

When I was talking to a colleague about the diginorm paper, he said
something jokingly: "wow, you're making it way too easy for people!"
-- presumably he meant it would be way to easy for people to criticize
or otherwise complain about the specific way we're doing things.
Then, a day or two later he said, "hmm, but now that I think of it, no
one ever uses the software we publish, and you seem to have had better
luck with that..." -- recognizing that if you are barely able to run
your own software, perhaps others might find it even more difficult.

Heck, the diginorm paper itself would have been far harder to write
without the data sets from the Trinity paper and the
Velvet-SC paper. Having those
nice, fresh, well-analyzed data sets already at hand was fantastic.
Being able to run Trinity and reproduce their results was wonderful.

There's a saying in software engineering: "one of the main people you
should be programming for is yourself, in 6 months." That's also true
in science -- I'm sure I won't remember the finer details of the
diginorm paper analysis in 2 years -- but I can always go look into
version control. More importantly, new graduate students can go look
and really see what's going on. (And I can use it for teaching, too.)
And so can other people working with me. So there's a lot of utility
in simply nailing everything down and making it runnable.

Replication is by no means sufficient for good science. But I'll be
more impressed by the argument that "replication isn't all that
important" when I see lack of replication as the exception rather than
the rule. Replication is essential, and good, and useful. I long for
the day when it's not interesting, because it's so standard. In
the meantime I would argue that it certainly doesn't do any harm to
emphasize it.

(Note that I really appreciate Jim Graham's commentary, as I think he
is at worst usefully wrong on these points, and substantially
correct in many ways. I'm just picking on him because he wrote it all
down in one place for me to link to, and chose to use the word 'sic'
when reproducing my spelling mistake. Low blow ;)

I don't pretend to have all, or even many, of the answers; I just like
to think about what form they might take.

I don't want to argue that this approach is a panacea or a
high-quality template for others to use, inside or out of
bioinformatics. For one thing, I haven't automated some of the
analyses in the paper; it's just too much work for too little benefit
at this point. (Trust me, they're easy to reproduce... :). For
another, our paper used a fairly small amount of data overall; only a
few dozen gigabytes all told. This makes it easy to post the data for
others to use later on. Several of our next few papers will involve
over a half terabyte of raw data, plus several hundred gb of ancillary
and intermediate results; no idea what we'll do for them.

Diginorm is also a somewhat strange bioinformatics paper. We just
analyzed other people's data sets (an approach which for some reason
isn't in favor in high impact bioinformatics, probably because high
impact journal subs are primarily reviewed by biologists who want to
see cool new data that we don't understand, not boring old data that
we don't understand). There's no way we can or should argue that
biological replicates done in a different lab should replicate the
results; that's where reproducibility becomes important.

But I would like it if people considered this approach (or some
other approach) to making their analyses replicable. I don't mind
people rejecting good approaches because they don't fit; to each their
own. But this kind of limited enabling of replication isn't that
difficult, frankly, and even if it were, it has plenty of upsides.
It's definitely not irrelevant to the practice of science -- I would
challenge anyone to try to make that claim in good faith.

Deep shotgun sequencing and analysis of genomes, transcriptomes,
amplified single-cell genomes, and metagenomes enable the sensitive
investigation of a wide range of biological phenomena. However, it is
increasingly difficult to deal with the volume of data emerging from
deep short-read sequencers, in part because of random and systematic
sampling variation as well as a high sequencing error rate. These
challenges have led to the development of entire new classes of
short-read mapping tools, as well as new de novo assemblers.
Even newer assembly strategies for dealing with transcriptomes,
single-cell genomes, and metagenomes have also emerged. Despite these
advances, algorithms and compute capacity continue to be challenged by
the continued improvements in sequencing technology throughput. We
here describe an approach we term digital normalization, a single-pass
computational algorithm that discards redundant data and both sampling
variation and the number of errors present in deep sequencing data
sets. Digital normalization substantially reduces the size of data
sets and accordingly decreases the memory and time requirements for
de novo sequence assembly, all without significantly impacting
content of the generated contigs. In doing so, it converts high
random coverage to low systematic coverage. Digital normalization is
an effective and efficient approach to normalizing coverage, removing
errors, and reducing data set size for shotgun sequencing data sets.
It is particularly useful for reducing the compute requirements for
de novo sequence assembly. We demonstrate this for the assembly
of microbial genomes, amplified single-cell genomic data, and
transcriptomic data. The software is freely available for use and
modification.

---

I'll blog more about this stuff over the next few days, but, briefly,
this paper presents a single pass, fixed-memory approach to
downsampling sequencing data (yeah, the stuff I've been talking about
for a while now). This
approach is called "digital normalization", or "diginorm" for sure.
It eliminates lots of data, evens out coverage, and removes errors
from shotgun sequencing data sets. The net effect is an often massive
amount of data reduction combined with significant scaling of de
novo assembly.

Or, to put it another way, it's a streaming lossy compression
algorithm that primarily "loses" errors in sequencing data.

We've implemented it as a preprocessing filter that should work on any
data set you want to assemble, with potentially any assembler. It's
written in C++, with a Python wrapper, as part of khmer. And, of course, it's freely
available for use, re-use, modification, and redistribution under the
BSD license, 'cause why the heck not?

If you want to try it out, we've linked to some tutorials for running
microbial genome assemblies with Velvet, as well as eukaryotic
transcriptome assemblies with Oases and Trinity, on the
paper Web site

We'll submit this paper to PNAS on Friday; I'm still waiting for some
final proofreading.

Together with our other paper,
which is currently under revision for resubmission to PNAS, these two
papers form the theoretical basis for our attack on soil metagenome
assembly. (Our plan is sheer elegance in its simplicity!)

A few people have recently asked me what this "Big Data" thing is in
biology. It's not an easy question to answer, though, because
biology's a bit peculiar, and a lot of Big Data researchers are not
working in bio. While I was thinking about this I kept on coming up
with anecdotes -- and, well, that turned into this: the Top 12 Reasons
You Know You Are a Big Data Biologist.

---

You no longer get enthused by the prospect of more data.

I was at a conference a few months back, and an Brit colleague of mine
rushed up to me and said, "Hey! We just got an Illumina HiSeq! Do you
have anything you want sequenced?" My immediate, visceral response
was "Hell no! We can't even deal with a 10th of the sequence we've
already got! PLEASE don't send me any more! Can I PAY you to not send
me any more?"

I was chatting with a colleague in the hallway here at MSU, pitching
some ideas about microbiome work, and he said, "What a good idea! I
have some samples here that I'd like to sequence that could help with
that." I responded, "How about we stop producing data and think about
the analysis?" He seemed only mildly offended -- I have a rep, you
see -- but biology, as a field, is wholly unprepared to go from
data to an analyses.

My lab has finally turned the corner from "hey, let's generate more
data!" to "cool, data that we can analyze and build hypotheses from!"
-- yeah, we're probably late bloomers, but after about 3 years of tech
development, there's very little we can't do at a basic sequence
level. Analysis at that level is no longer the bottleneck. Next
step? Trying to do some actual biology! But we are working in a few
very specific areas, and I think the whole field needs to undergo this
kind of shift.

You've finally learned that 'for' loops aren't all they're cracked up to be.

This was one of my early lessons. Someone had dumped a few 10s of
millions of reads on me -- a mere gigabyte or so of data -- and I was
looking for patterns in the reads. I sat down to write my usual
opening lines of Python to look at the data: "for sequence in
dataset:" and ... just stopped. The data was just too big to do
iteration in a scripting language with it! Cleverness of some sort
needed to be applied.

Let's face it, if you've just spent a few $k generating dozens of
gigabytes amounts of hypothesis-neutral data on gene expression from
(say) chick, the end goal of generating a list of genes that are
differentially regulated is just not that exciting. (Frankly, you
probably could have done it with qPCR, if you didn't want that cachet
of "mRNAseq!") What more can you do with the data?

(This point comes courtesy of Jim Tiedje, who says (I paraphrase):
"The problem for assistant professors these days is that many of may
not be thinking big enough." Also see: Don't be afraid of failure:
really go for it)

The biggest training challenge (in my opinion) is going to be training
people in how to push past the obvious analysis of the data and go for
deeper integrative insight. This will require training not just in
biology but in data analysis, computational thinking, significant
amounts of informed skepticism, etc. (See our course.) I think about it
like this: generating hypotheses from large amounts of data isn't that
interesting -- I can do that with publicly available data sets without
spending any money! Constraining the space of hypotheses with big
data sets is far more interesting, because it gives you the space of
hypotheses that aren't ruled out; and its putting your data to good
use. I'll, uh, let you know if I ever figure out how to do this
myself...

I think there are plenty of people that can learn to do this, but as
Greg Wilson correctly points out, there has to be a tradeoff: what
do you take out of existing training curricula, to be replaced with
training in data analysis? I wish I knew.

Transferring data around is getting more than a bit tedious.

For some recent papers, I had to copy some big files from EC2 over to
my lab computer, and from there to our HPC system. It was Slow, in
the "time for lunch" sense. And these were small test data sets,
compressed. Transferring our big data sets around is getting tedious.
Luckily we have a lot of them, so I can usually work on analysis of
one while another is transferring.

5. Your sysadmin/HPC administrator yells at you on a regular basis about your
disk usage.

We regularly get nastygrams from our local sysadmins accusing of using
too many terabytes of disk space. This is in contrast to the good ol'
days of physics (which is where I got my sysadmin chops), where your
sysadmin would yell at you for using too much CPU...

You regularly crash big memory machines.

My favorite quote of the year so far is from the GAGE paper), in
which Salzberg et al (2012) say "For larger genomes, the choice of
assemblers is often limited to those that will run without crashing."
Basically, they took a reasonably big computer, threw some big data at
various assembly packages, and watched the computer melt down. Repeatedly.

Someone recently sent me an e-mail saying "hey, we did it! we took 3
Gb of sequence data from soil and assembled it in only 1 week in 3 TB
of RAM!" Pretty cool stuff -- but consider that at least four to six
runs would need to be done (parameter sweep!), and it takes only about
1 week and $10k to generate twice that data. In the long run, this
does not seem cost effective. (It currently takes us 1-2 months to
analyze this data in 300 GB of RAM. I'm not saying we have the
answer. ;)

Throwing away data looks better and better.

I made a kind of offhanded comment to a NY Times reporter once (hint: don't do
this) about how at some point we're going to need to start throwing away data.
He put it as the ultimate quote in his article. People laughed at me for it.
(BUT I WAS RIGHT! HISTORY WILL SHOW!)

But seriously, if someone came up to you and said "we can get rid of
90% of your data for you and give you an answer that's just as good",
many biologists would have an instant negative response. But I think
there's a ground truth in there: a lot of Big Data is noise. If you
can figure out how to get rid of it... why wouldn't you? This is an
interesting shift in thinking from the "every data point is precious
and special" that you adopt when it takes you a !#%!#$ week to
generate each data point.

I attended a talk that David Haussler gave at Caltech recently. He
was talking about how eventually we would need to resequence millions
of individual cancer cells to look for linked sets of mutations. At
50-300 Gb of sequence per cell, that's a lot of data. But most of
that data is going to be uninteresting -- wouldn't it be great if we
could pick out the interesting people and then throw the rest away?
It would certainly help with data archiving and analysis...

True story: a Big Computer Company called up our local HPC to ask why
we were buying so many bigmem machines. They said "It's the damned
biologists -- they keep on wanting more memory. Why? We don't know -
we suspect the software's badly written, but can't tell. Why don't
you talk to Titus? He pretends to understand this stuff." I don't
think it's weird to get calls trying to sell me stuff -- but it is a
bit weird to have our local HPC's buying habits be so out of
character, due to work that I and others are doing, that Big Computer
Companies notice.

(Note: the software's not mine, and it's not badly written, either.)

9. Your choice must increasingly be "improve algorithms" rather than "buy
bigger computers"

I've been banging this drum for a while. Sequencing
capacity is outpacing Moore's Law, and so we need to rethink
algorithms. An algorithm that was nlogn used to be good enough; now,
if analysis requires a supra-linear algorithm, we need to figure out
how to make it linear. (Sublinear would be better.)

Anecdote: we developed a nifty data structure for attacking metagenome assembly
(see: http://arxiv.org/abs/1112.4193). It scaled (scales) assembly by a factor
of about 20x, which got us pretty excited -- that meant we could in theory
assemble things like MetaHIT and rumen on commodity hardware without doing
abundance filtering. Literally the day that we told our collaborators we
had it working, they dumped 10x more data on us and told that they could send
us more any time we wanted. (...and that, boys and girls, was our introduction
to the HiSeq!) 20x wasn't enough. Sigh.

The MG-RAST folk have told me a similar story. They did some
awesomely cool engineering and got their pipeline running about 100x
faster. That'll hold them for a year or so against the tidalwave of
data.

Corollary: don't waste your time with 2% improvements in sensitivity and
specificity unless you also deliver 10x in compute performance.

10. You spend a lot of time worrying about biased noise, cross-validation, and
the incorrect statistical models used.

We were delayed in some of our research by about a year, because of
some systematic biases being placed in our sequencing data by
Illumina. Figuring out that these non-biological features were there
took about two months; figuring out how to remove them robustly took
another 6 months; and then making sure that removing didn't screw up
the actual biological signal took another four months.

This is a fairly typical story from people who do a lot of data analysis.
We developed a variety of cross-validation techniques and ways of intuiting
whether or not something was "real" or "noise", and we spent a certain
amount of time discussing what statistical approaches to use to assess
significance. In the end we more or less gave up and pointed out that
on simulated data what we were doing didn't screw things up.

Silicon Valley wants to hire your students to go work on non-biology problems.

Hey, it's all Big Data, right?

---

So: what is Big Data in biology?

First, I've talked mostly about DNA sequence analysis, because that's
what I work on. But I know that proteomics and image analysis people
are facing similar dilemmas. So it's not just sequence data.

Second, compute technology is constantly improving. So I think we
need moving definitions.

Here are some more serious points that I think bear on what, exactly,
problems for Big Data in biology. (They're not all specific to biology,
but I can defend them on my home ground more easily, you see.)

You have archival issues on a large scale.

You have lots of homogeneously formatted data that probably contains
answers you don't know you're looking for yet, so you need to save it,
metadata it, and catalog it. For a long time.

The rate at which data is arriving is itself increasing.

You aren't just getting one data set. You're getting dozens (or hundreds)
this year. And you'll get more than that next year.

One implication of this is that you'd better have a largely automated
analysis pipeline, or else you will need an increasing number of people
just to work with the data, much less do anything interesting. Another
implication is that software reuse becomes increasingly important: if
you're building custom software for each data set, you will fall behind.
A third implication is that you need a long-term plan for scaling your
compute capacity.

Data structure and algorithm research is increasingly needed.

You cannot rely on many heavyweight iterations over your data, or
simple data structures for lookup: the data is just too big and
existing algorithms are tailored to smaller data. For example, BLAST
works fine for a few gigabytes of data; past that, it becomes
prohibitively slow.

Infrastructure designers are needed.

Issues of straightforward data transfer, network partitioning, and
bus bandwidth start to come to the forefront. Bottleneck analysis
needs to be done regularly. In the past, you could get away with
"good enough", but as throughput continues to increase, bottlenecks
will need to be tackled on a regular basis. For this, you need
a person who is immersed in your problems on a regular basis; they
are hard to find and hard to keep.

One interesting implication here is for cloud computing, where
smart people set up a menu of infrastructure options and you can
tailor your software to those options. So far I like the idea,
but I'm told by aficionados that (for example) Amazon still falls
short.

You have specialized hardware needs.

Sort of a corollary of the above: what kind of analyses do you need to
do? And what's the hardware bottleneck? That's where you'll get the
most benefit from focused hardware investment.

Hardware, infrastructure design, and algorithms all need to work together.

Again, a corollary of the above, but: if your bottleneck is memory, focus
on memory improvements. If your bottleneck is disk I/O, focus on hardware
speed and caching. If your bottleneck is data transfer, try to bring your
compute to your data.

Software needs to change to be reusable and portable.

Robust, reusable software platforms are needed, with good execution
guarantees; that way you have a foundation to build on. This software
needs to be flexible (practically speaking, scriptable in a language
like Python or Ruby or Perl), well developed and tested, and should
fade into the background so that you can focus on more interesting
things like your actual analysis. It should also be portable so that
you can "scale out" -- bring the compute to your data, rather than
vice versa. This is where Hadoop and Pig and other such approaches fit
now, and where we seriously need to build software infrastructure in
biology.

Analysis thinking needs to change.

Comprehensively analyzing your data sets is tough when your data sets
are really big and noisy. Extracting significant signals from them is
potentially much easier, and some approaches and algorithms for doing
this in biology exist or are being developed (see especially
Lior Pachter's eXpress).
But this is a real shift in algorithmic thinking, and it's also a real
shift in scientific thinking, because you're no longer trying do
understand the entire data set -- you're trying to focus on the bits
that might be interesting.

Analyses are increasingly integrative.

It's hard to make sense of lots of data on its own: you need to link it in
to other data sets. Data standards and software interoperability and
"standard" software pipelines are incredibly important for doing this.

The interesting problems are still discipline-specific.

There are many people working on Big Data, and there is big business
in generic solutions. There's lots of Open Source stuff going on,
too. Don't reinvent those wheels; figure out how to connect them to
your biology, and then focus on the bits that are interesting to you
and important for your science.

11. New machine learning, data mining, and statistical models need to be
developed for data-intensive biological science.

As data volume increases, and integrative hypothesis development
proceeds, we need to figure out how to assess the quality and
significance of hypotheses. Right now, a lot of people throw their
data at several programs, pick their favorite answer, and then recite
the result as if it's correct. Since often they will have tried out
many programs, this presents an obvious multiple testing problem.
And, since users are generally putting in data that violates
one or more of the precepts of the program developers, the results
may not be applicable.

A lack of fear of computational approaches is a competitive advantage.

The ability to approach computational analyses as just another black
box upon which controls must be placed is essential. Even if you can
open up the black box and understand what's inside, it needs to be
evaluated not on what you think it's doing but on what it's
actually doing at the black box level. If there's one thing I try
to teach to students, it's to engage with the unknown without fear,
so that they can think critically about new approaches.

---

Well, that's it for now. I'd be interested in hearing about what
other people think I've missed. And, while I'm at it, a hat tip to
Erich Schwarz, Greg Wilson, and Adina Howe for reading and commenting
on a draft.

A few people have recently asked me what this "Big Data" thing is in
biology. It's not an easy question to answer, though, because
biology's a bit peculiar, and a lot of Big Data researchers are not
working in bio. While I was thinking about this I kept on coming up
with anecdotes -- and, well, that turned into this: the Top 12 Reasons
You Know You Are a Big Data Biologist.

---

You no longer get enthused by the prospect of more data.

I was at a conference a few months back, and an Brit colleague of mine
rushed up to me and said, "Hey! We just got an Illumina HiSeq! Do you
have anything you want sequenced?" My immediate, visceral response
was "Hell no! We can't even deal with a 10th of the sequence we've
already got! PLEASE don't send me any more! Can I PAY you to not send
me any more?"

I was chatting with a colleague in the hallway here at MSU, pitching
some ideas about microbiome work, and he said, "What a good idea! I
have some samples here that I'd like to sequence that could help with
that." I responded, "How about we stop producing data and think about
the analysis?" He seemed only mildly offended -- I have a rep, you
see -- but biology, as a field, is wholly unprepared to go from
data to an analyses.

My lab has finally turned the corner from "hey, let's generate more
data!" to "cool, data that we can analyze and build hypotheses from!"
-- yeah, we're probably late bloomers, but after about 3 years of tech
development, there's very little we can't do at a basic sequence
level. Analysis at that level is no longer the bottleneck. Next
step? Trying to do some actual biology! But we are working in a few
very specific areas, and I think the whole field needs to undergo this
kind of shift.

You've finally learned that 'for' loops aren't all they're cracked up to be.

This was one of my early lessons. Someone had dumped a few 10s of
millions of reads on me -- a mere gigabyte or so of data -- and I was
looking for patterns in the reads. I sat down to write my usual
opening lines of Python to look at the data: "for sequence in
dataset:" and ... just stopped. The data was just too big to do
iteration in a scripting language with it! Cleverness of some sort
needed to be applied.

Let's face it, if you've just spent a few $k generating dozens of
gigabytes amounts of hypothesis-neutral data on gene expression from
(say) chick, the end goal of generating a list of genes that are
differentially regulated is just not that exciting. (Frankly, you
probably could have done it with qPCR, if you didn't want that cachet
of "mRNAseq!") What more can you do with the data?

(This point comes courtesy of Jim Tiedje, who says (I paraphrase):
"The problem for assistant professors these days is that many of may
not be thinking big enough." Also see: Don't be afraid of failure:
really go for it)

The biggest training challenge (in my opinion) is going to be training
people in how to push past the obvious analysis of the data and go for
deeper integrative insight. This will require training not just in
biology but in data analysis, computational thinking, significant
amounts of informed skepticism, etc. (See our course.) I think about it
like this: generating hypotheses from large amounts of data isn't that
interesting -- I can do that with publicly available data sets without
spending any money! Constraining the space of hypotheses with big
data sets is far more interesting, because it gives you the space of
hypotheses that aren't ruled out; and its putting your data to good
use. I'll, uh, let you know if I ever figure out how to do this
myself...

I think there are plenty of people that can learn to do this, but as
Greg Wilson correctly points out, there has to be a tradeoff: what
do you take out of existing training curricula, to be replaced with
training in data analysis? I wish I knew.

Transferring data around is getting more than a bit tedious.

For some recent papers, I had to copy some big files from EC2 over to
my lab computer, and from there to our HPC system. It was Slow, in
the "time for lunch" sense. And these were small test data sets,
compressed. Transferring our big data sets around is getting tedious.
Luckily we have a lot of them, so I can usually work on analysis of
one while another is transferring.

5. Your sysadmin/HPC administrator yells at you on a regular basis about your
disk usage.

We regularly get nastygrams from our local sysadmins accusing of using
too many terabytes of disk space. This is in contrast to the good ol'
days of physics (which is where I got my sysadmin chops), where your
sysadmin would yell at you for using too much CPU...

You regularly crash big memory machines.

My favorite quote of the year so far is from the GAGE paper), in
which Salzberg et al (2012) say "For larger genomes, the choice of
assemblers is often limited to those that will run without crashing."
Basically, they took a reasonably big computer, threw some big data at
various assembly packages, and watched the computer melt down. Repeatedly.

Someone recently sent me an e-mail saying "hey, we did it! we took 3
Gb of sequence data from soil and assembled it in only 1 week in 3 TB
of RAM!" Pretty cool stuff -- but consider that at least four to six
runs would need to be done (parameter sweep!), and it takes only about
1 week and $10k to generate twice that data. In the long run, this
does not seem cost effective. (It currently takes us 1-2 months to
analyze this data in 300 GB of RAM. I'm not saying we have the
answer. ;)

Throwing away data looks better and better.

I made a kind of offhanded comment to a NY Times reporter once (hint: don't do
this) about how at some point we're going to need to start throwing away data.
He put it as the ultimate quote in his article. People laughed at me for it.
(BUT I WAS RIGHT! HISTORY WILL SHOW!)

But seriously, if someone came up to you and said "we can get rid of
90% of your data for you and give you an answer that's just as good",
many biologists would have an instant negative response. But I think
there's a ground truth in there: a lot of Big Data is noise. If you
can figure out how to get rid of it... why wouldn't you? This is an
interesting shift in thinking from the "every data point is precious
and special" that you adopt when it takes you a !#%!#$ week to
generate each data point.

I attended a talk that David Haussler gave at Caltech recently. He
was talking about how eventually we would need to resequence millions
of individual cancer cells to look for linked sets of mutations. At
50-300 Gb of sequence per cell, that's a lot of data. But most of
that data is going to be uninteresting -- wouldn't it be great if we
could pick out the interesting people and then throw the rest away?
It would certainly help with data archiving and analysis...

True story: a Big Computer Company called up our local HPC to ask why
we were buying so many bigmem machines. They said "It's the damned
biologists -- they keep on wanting more memory. Why? We don't know -
we suspect the software's badly written, but can't tell. Why don't
you talk to Titus? He pretends to understand this stuff." I don't
think it's weird to get calls trying to sell me stuff -- but it is a
bit weird to have our local HPC's buying habits be so out of
character, due to work that I and others are doing, that Big Computer
Companies notice.

(Note: the software's not mine, and it's not badly written, either.)

9. Your choice must increasingly be "improve algorithms" rather than "buy
bigger computers"

I've been banging this drum for a while. Sequencing
capacity is outpacing Moore's Law, and so we need to rethink
algorithms. An algorithm that was nlogn used to be good enough; now,
if analysis requires a supra-linear algorithm, we need to figure out
how to make it linear. (Sublinear would be better.)

Anecdote: we developed a nifty data structure for attacking metagenome assembly
(see: http://arxiv.org/abs/1112.4193). It scaled (scales) assembly by a factor
of about 20x, which got us pretty excited -- that meant we could in theory
assemble things like MetaHIT and rumen on commodity hardware without doing
abundance filtering. Literally the day that we told our collaborators we
had it working, they dumped 10x more data on us and told that they could send
us more any time we wanted. (...and that, boys and girls, was our introduction
to the HiSeq!) 20x wasn't enough. Sigh.

The MG-RAST folk have told me a similar story. They did some
awesomely cool engineering and got their pipeline running about 100x
faster. That'll hold them for a year or so against the tidalwave of
data.

Corollary: don't waste your time with 2% improvements in sensitivity and
specificity unless you also deliver 10x in compute performance.

10. You spend a lot of time worrying about biased noise, cross-validation, and
the incorrect statistical models used.

We were delayed in some of our research by about a year, because of
some systematic biases being placed in our sequencing data by
Illumina. Figuring out that these non-biological features were there
took about two months; figuring out how to remove them robustly took
another 6 months; and then making sure that removing didn't screw up
the actual biological signal took another four months.

This is a fairly typical story from people who do a lot of data analysis.
We developed a variety of cross-validation techniques and ways of intuiting
whether or not something was "real" or "noise", and we spent a certain
amount of time discussing what statistical approaches to use to assess
significance. In the end we more or less gave up and pointed out that
on simulated data what we were doing didn't screw things up.

Silicon Valley wants to hire your students to go work on non-biology problems.

Hey, it's all Big Data, right?

---

So: what is Big Data in biology?

First, I've talked mostly about DNA sequence analysis, because that's
what I work on. But I know that proteomics and image analysis people
are facing similar dilemmas. So it's not just sequence data.

Second, compute technology is constantly improving. So I think we
need moving definitions.

Here are some more serious points that I think bear on what, exactly,
problems for Big Data in biology. (They're not all specific to biology,
but I can defend them on my home ground more easily, you see.)

You have archival issues on a large scale.

You have lots of homogeneously formatted data that probably contains
answers you don't know you're looking for yet, so you need to save it,
metadata it, and catalog it. For a long time.

The rate at which data is arriving is itself increasing.

You aren't just getting one data set. You're getting dozens (or hundreds)
this year. And you'll get more than that next year.

One implication of this is that you'd better have a largely automated
analysis pipeline, or else you will need an increasing number of people
just to work with the data, much less do anything interesting. Another
implication is that software reuse becomes increasingly important: if
you're building custom software for each data set, you will fall behind.
A third implication is that you need a long-term plan for scaling your
compute capacity.

Data structure and algorithm research is increasingly needed.

You cannot rely on many heavyweight iterations over your data, or
simple data structures for lookup: the data is just too big and
existing algorithms are tailored to smaller data. For example, BLAST
works fine for a few gigabytes of data; past that, it becomes
prohibitively slow.

Infrastructure designers are needed.

Issues of straightforward data transfer, network partitioning, and
bus bandwidth start to come to the forefront. Bottleneck analysis
needs to be done regularly. In the past, you could get away with
"good enough", but as throughput continues to increase, bottlenecks
will need to be tackled on a regular basis. For this, you need
a person who is immersed in your problems on a regular basis; they
are hard to find and hard to keep.

One interesting implication here is for cloud computing, where
smart people set up a menu of infrastructure options and you can
tailor your software to those options. So far I like the idea,
but I'm told by aficionados that (for example) Amazon still falls
short.

You have specialized hardware needs.

Sort of a corollary of the above: what kind of analyses do you need to
do? And what's the hardware bottleneck? That's where you'll get the
most benefit from focused hardware investment.

Hardware, infrastructure design, and algorithms all need to work together.

Again, a corollary of the above, but: if your bottleneck is memory, focus
on memory improvements. If your bottleneck is disk I/O, focus on hardware
speed and caching. If your bottleneck is data transfer, try to bring your
compute to your data.

Software needs to change to be reusable and portable.

Robust, reusable software platforms are needed, with good execution
guarantees; that way you have a foundation to build on. This software
needs to be flexible (practically speaking, scriptable in a language
like Python or Ruby or Perl), well developed and tested, and should
fade into the background so that you can focus on more interesting
things like your actual analysis. It should also be portable so that
you can "scale out" -- bring the compute to your data, rather than
vice versa. This is where Hadoop and Pig and other such approaches fit
now, and where we seriously need to build software infrastructure in
biology.

Analysis thinking needs to change.

Comprehensively analyzing your data sets is tough when your data sets
are really big and noisy. Extracting significant signals from them is
potentially much easier, and some approaches and algorithms for doing
this in biology exist or are being developed (see especially
Lior Pachter's eXpress).
But this is a real shift in algorithmic thinking, and it's also a real
shift in scientific thinking, because you're no longer trying do
understand the entire data set -- you're trying to focus on the bits
that might be interesting.

Analyses are increasingly integrative.

It's hard to make sense of lots of data on its own: you need to link it in
to other data sets. Data standards and software interoperability and
"standard" software pipelines are incredibly important for doing this.

The interesting problems are still discipline-specific.

There are many people working on Big Data, and there is big business
in generic solutions. There's lots of Open Source stuff going on,
too. Don't reinvent those wheels; figure out how to connect them to
your biology, and then focus on the bits that are interesting to you
and important for your science.

11. New machine learning, data mining, and statistical models need to be
developed for data-intensive biological science.

As data volume increases, and integrative hypothesis development
proceeds, we need to figure out how to assess the quality and
significance of hypotheses. Right now, a lot of people throw their
data at several programs, pick their favorite answer, and then recite
the result as if it's correct. Since often they will have tried out
many programs, this presents an obvious multiple testing problem.
And, since users are generally putting in data that violates
one or more of the precepts of the program developers, the results
may not be applicable.

A lack of fear of computational approaches is a competitive advantage.

The ability to approach computational analyses as just another black
box upon which controls must be placed is essential. Even if you can
open up the black box and understand what's inside, it needs to be
evaluated not on what you think it's doing but on what it's
actually doing at the black box level. If there's one thing I try
to teach to students, it's to engage with the unknown without fear,
so that they can think critically about new approaches.

---

Well, that's it for now. I'd be interested in hearing about what
other people think I've missed. And, while I'm at it, a hat tip to
Erich Schwarz, Greg Wilson, and Adina Howe for reading and commenting
on a draft.

I'm writing this on my way back from Stockholm, where I attended a
workshop on the 4th Paradigm. This is the idea (so named by Jim
Gray, I gather?) that data-intensive science is a distinct paradigm
from the first three paradigms of scientific investigation -- theory,
experiment, and simulation. I was invited to attend as a token
biologist -- someone in biology who works with large scale data, and
thinks about large scale data, but isn't necessarily only devoted to
dealing with large scale data.

The workshop was pretty interesting. It was run by Microsoft, who
invited me & paid my way out there. The idea was to identify areas of
opportunity in which Microsoft could make investments to help out
scientists and develop new perspectives on the future of eScience. To
do that, we played a game that I'll call the "anchor game", where we
divvied up into groups to discuss the blocks to our work that stemmed
from algorithms and tools, data, process and workflows,
social/organizational aspects. In each group we put together sticky
notes with our "complaints" and then ranked them by how big of an
anchor they were on us -- "deep" sticky notes held us back more than
shallow sticky notes. We then reorganized by discipline, and put
together an end-to-end workflow that we hoped in 5 years would be
possible, and then finally we looked for short- and medium-term
projects that would help get us there.

The big surprise for me in all of this was that it turns out I'm most
interested in workflows and process! All of my sticky notes had the
same theme: it wasn't tools, or data management, or social aspects
that were causing me problems, but rather the development of
appropriate workflows for scientific investigation. Very weird, and
not what I would have predicted from the outset.

Two questions came up for me during the workshop:

Why don't people use workflows in bioinformatics?

The first question that comes to mind is, why doesn't anyone I know
use a formal workflow engine in bioinformatics? I know they exist
(Taverna, for one, has a bunch of bioinformatics workflows); I'm
reasonably sure they would be useful; but there seems to be some
block against using them!

What would the ideal workflow situation be?

During the anchor game, our group (which consisted of two biologists,
myself and Hugh Shanahan; a physicist, Geoffrey Fox; and a few
computer scientists, including Alex Wade) came up with an idea for a
specific tool. The tool would be a bridge between Datascope for
biologists and a workflow engine. The essential idea is to combine
data exploration with audit trail recording, which could then be
hardened into a workflow template and re-used.

---

Thinking about the process I usually use when working on a new
problem, it tends to consist of all these activites mixed together:

evaluation of data quality, and application of "computational" controls

exploration of various data manipulation steps, looking for statistical signal

solidifying of various subcomponents of the data manipulation steps into scripted actions

deployment of the entire thing on multiple data sets

Now, I'm never quite done -- new data types and questions always come
up, and there are always components to tweak. But portions of the
workflow become pretty solid by the time I'm halfway done with the
initial project, and the evaluation of data quality accretes more
steps but rarely loses one. So it could be quite useful to be able to
take a step back and say, "yeah, these steps? wrap 'em up and put 'em
into a workflow, I'm done." Except that I also want to be able to
edit and change them in the future. And I'd like to be able to post
the results along with the precise steps taken to generate them. And
(as long as I'm greedy) I would like to work at the command line,
while I know that others would like to be able to work in a notebook
or graphical format. And I'd like to be able to "scale out" the
computation by bringing the compute to the data.

For all of this I need three things: I need workflow agility, I need
workflow versioning, and I need workflow tracking. And this all
needs to sit on top of a workflow component model that lets me run the
components of the workflow wherever the data is.

I'm guessing no tool out there does this, although I know other people
are thinking this way, so maybe I'm wrong. The Microsoft folk didn't
know of any, though, and they seemed pretty well informed in this area
:).

The devil's choice I personally make in all of this is to go for
workflow agility, and ignore versioning and tracking and the component
model, by scripting the hell out of things. But this is getting old,
and as I get older and have to teach my wicked ways to grad students
and postdocs, the lack of versioning and tracking and easy scaling out
gets more and more obvious. And now that I'm trying to actually teach
computational science to biologists, it's simply appallingly difficult
to convey this stuff in a sensible way. So I'm looking for something
better.

One particularly intriguing type of software I've been looking at
recently is the "interactive Web notebook" --
ipython notebook and
sagenb. These are essentially Mathematica or matlab-style notebook
packages that work with IPython or Sage, scientific computing and
mathematical computing systems in Python. They let you run Python
code interactively, and colocate it with its output (including
graphics); the notebooks can then be saved and loaded and re-run. I'm
thinking about working one or the other into my class, since it would
let me move away from command-line dependence a bit. (Command lines
are great, but throwing them at biologists, along with Python programming,
data analysis, and program execution all together, seems a bit cruel.
And not that productive.)

It would also be great to have cloud-enabled workflow components. As
I embark on more and more sequence analysis, there are only about a
dozen "things" we actually do, but mixed and matched. These things
could be hardened, parameterized into components, and placed behind an
RPC interface that would give us a standard way to execute them.
Combined with a suitable data abstraction layer, I could run the
components from location A on data in location B in a semi-transparent
way, and also record and track their use in a variety of ways. Given
a suitably rich set of components and use cases, I could imagine that
these components and their interactions could be executed from
multiple workflow engines, and with usefully interesting GUIs. I know
Domain Specific Languages are already passe, but a DSL might be a good
way to find the right division between subcomponents.

I'd be interested in hearing about such things that may already exist.
I'm aware of Galaxy, but I think the components in Galaxy are not
quite written at the right level of abstraction for me; Galaxy is also
more focused on the GUI than I want. I don't know anything about
Taverna, so I'm going to look into that a bit more. And, inevitably,
we'll be writing some of our own software in this area, too.

In the area of metagenomics (one of my research focuses at the
moment), it would be great to see img/m, camera, and MG-RAST move towards a "broken out" workflow
that lets semi-sophisticated computational users (hi mom!) run their
stuff on the Amazon Cloud and on private HPCs or clouds. While I
appreciate hosted services, there are many drawbacks to them, and I'd love
to get my hands on the guts of those services. (I'm
sure the MG-RAST folk would like me to note that they are moving
towards making their pipeline more usable outside of Argonne:
so noted.)

I met David De Roure from myExperiment at the MS workshop. To quote,
"myExperiment makes it easy to find, use, and share scientific
workflows and other Research Objects, and to build communities."

David put me in touch with Carole Goble who is involved with
Taverna. Something to look at.

In the cloud computing workshop
I organized at the Planet and Animal Genome conference this January, I
will get a chance to buttonhole one of the Galaxy Cloud developers. I
hope to make the most of this opportunity ;).

It'd be interesting to do some social science research on what
difficulties users encounter when they attempt to use workflow
engines. A lot of money goes into developing them, apparently, but at
least in bioinformatics they are not widely used. Why? This is sort
of in line with Greg Wilson's Software Carpentry and the wonderfully
named blog It will never work in theory: rather than guessing randomly
at what technical directions need to be pursued, why not study it
empirically? It is increasingly obvious to me that improving
computational science productivity has more to do with lowering
learning barriers and changing other societal or cultural issues than
with a simple lack of technology, and figuring out how (and if)
appropriate technology could be integrated with the right incentives
and teaching strategy is pretty important.

--titus

p.s. Special thanks to Kenji Takeda and Tony Hey for inviting me to the
workshop, and especially for paying my way. 'twas really interesting!

I desperately need something to run and test things at the command
line, both for course documentation (think "doctest" but with shell
prompts) and for script testing (as part of scientific pipelines). At
the 2011 testing-in-python BoF, Augie showed us cram, which is the mercurial project's
internal test code ripped out for the hoi polloi to use.

Step zero: wonder-twin-powers activate a new virtualenv!

% virtualenv e
% . e/bin/activate

Step one: install!

% pip install cram

... that just works -- always a good sign!

OK, let's test the bejeezus out of 'ls'.

% mkdir cramtest
% cd cramtest

Next, I put

$ ls

into a file. Be careful -- you apparently need exactly two spaces before
the $ or it doesn't recognize it like a test.

Now, I run:

% cram ls.t

and I get

.
# Ran 1 tests, 0 skipped, 0 failed.

Awesome! A dot!

The only problem with this is that when I run 'ls' myself, I see:

ls.t ls.t~

Hmm.

As a test of the cram test software, let's modify the file 'ls.t' to contain a
clearly broken test, rather than an empty one:

Hmm, maybe cram is setting up a temp directory? That would explain a lot,
and would also be a very sensible approach. It's not mentioned explicitly
on the front page, but if you read into it a bit, it looks likely. OK.

I love the regexp fix, too; none of this BS that doctest forces upon you.

So, the next question: how do multiple tests work? If you look above,
you can see that it's running all the commands as one test. Logically
you should be able to just separate out the block of text and make it
into multiple tests... let's try adding

and it sees two tests... but, after fixing the expected output using
'cram -i', I only get one test:

.
# Ran 1 tests, 0 skipped, 0 failed.

So it seems like a little internal inconsistency in cram here. Two tests
when something's failing, one test when both are running. No big deal
in the end.

And... I have to admit, that's about all I need for testing/checking
course materials! The cram test format is perfectly compatible with
ReStructuredText, so I can go in and write real documents in it, and
then test them. Command line testing FTW?

And (I just checked) I can even put in Python commands and run doctest
on the same file that cram runs on. Awesome.

Critique:

The requirement for two spaces exactly before the $ was not obvious to
me, nor was the implicit (and silent, even in verbose mode) use of a
temp directory. I wiped out my test file a few times by answering
"yes" to patching, too. What was up with the 'reversed patch' foo??
And of course it'd be nice if the number of dots reflected something
more granular than the number of files run. But heck, it mostly just
works! I didn't even look at the source code at all!

Verdict: a tentative 8/10 on the "Can titus use your testing tool?"
scale.

I'll try using it in anger on a real project next time I need it, and
report back from there.

--titus

p.s. To try out my full cram test from above, grab the file from the
khmer-ngram repo at github; see:

I just parachuted in on (and heli'd out of?) the Beyond the Genome
conference
in Boston. I gave a very brief workshop on using EC2 for sequence
analysis, which seemed well received. (Mind you, virtually everything
possible went wrong, from lack of good network access to lack of
attendee computers to truncated workshop periods due to conference
overrun, but I'm used to the Demo Effect.)

After attending the last bit of the conference, I think that "the
cloud" is actually a really good metaphor for what's happening in
biology these days. We have an immense science-killing asteroid
heading for us (in the form of ridiculously vast amounts of sequence
data, a.k.a. "sequencing bonanza" -- our sequencing capacity is
doubling every 6-10 months), and we're mostly going about our daily
business because we can't see the asteroid -- it's hidden by the
clouds! Hence "cloud computing": computing in the absence of clear
vision.

But no, seriously. Our current software and algorithms simply won't
scale to the data requirements, even on the hardware of the future. A
few years ago I thought that we really just needed better data
management and organization tools. Then Chris Lee pointed out how
much a good algorithm could do -- cnestedlist, in pygr, for doing fast interval queries on extremely
large databases. That solved a lot of problems for me. And then
next-gen sequencing data started hitting me and my lab, and kept on
coming, and coming, and coming... A few fun personal items since my
Death of Sequencing Centers post:

we managed to assemble some 50 Gb of Illumina GA2 metagenomic data
using a novel research approach, and then...

...we received 1.6 Tb of HiSeq data from the Joint Genome
Institute as part of the same Great
Plains soil sequencing project. It's hard not to think that our
collaborators were saying "So, Mr. Smarty Pants -- you can develop
new approaches that work for 100 Gb, eh? Try this on for size!
BWAHAHAHAHAHAHA!"

I've been working for the past two weeks to do a lossy (no, NOT
"lousy") assembly of 1.5 billion reads (100 Gb) of mRNAseq Illumina data
from lamprey, using a derivative research approach to the one above,
and then...

...our collaborators at Mt. Sinai Medical Center told us that that we could expect
200 Gb of lamprey mRNA HiSeq data from their next run.

(See BWAHAHAHAHAHAHAHA, above.)

Honestly, I think most people in genomics, much less biology, don't
appreciate the game-changing nature of the sequencing bonanza. In
particular, I don't think they realize the rate at which it's scaling.
Lincoln Stein had a great slide in his talk at the BTG workshop about
the game-changing nature of next-gen sequencing:

The blue line is hardware capacity, the yellow line is "first-gen"
sequencing (capillary electrophoresis), and the red line is next-gen
sequencing capacity.

It helps if you realize that the Y axis is log scale.

Heh, yeah.

Now, reflect upon this: many sequence analysis algorithms (notably,
assembly, but also multiple sequence alignment and really anything
that doesn't rely on a fixed-size "reference") are supra-linear in
their scaling.

Heh, yeah.

We call this "Big Data", yep yep.

At the cloud computing workshop, I heard someone -- I won't say who,
because even though I picked specifically on them, it's a common
misconception -- compare computational biology to physics. Physicists
and astronomers had to learn how to deal with Big Data, right? So we
can, too! Yeah, but colliders and telescopes are big, and
expensive. Sequencers? Cheap. Almost every research institution I
know has at least one, and often two or three. Every lab I know
either has some Gb-sized data set or is planning to generate 1-40 Gb
within the next year. Take that graph above, and extrapolate to 2013
and beyond. Yeah, that's right -- all your base belong to us,
physical scientists! Current approaches are not going to scale well
for big projects, no matter what custom infrastructure you build or
rent.

The closest thing to this dilemma that I've read about is in climate
modeling (see: Steve Easterbrook's blog, Serendipity). However, I think the
difference with biology is that we're generating new scientific data,
not running modeling programs that generate our data. Having been in
both situations, I can tell you that it's very different when your
data is not something you can decline to measure, or something you can
summarize and digest more intelligently while generating it.

I've also heard people claim that this isn't a big problem compared
to, say, the problems that we face with Big Data on the Internet. I
think the claim at the time was that "in biology, your data is more
structured, and so you haff vays of dealing with it". Poppycock! The
unknown unknowns dominate, everyone: we often don't know what
we're looking for in large-scale biological data. When we do know,
it's a lot easier to devise data analysis strategies; but when we
don't really know, people tend to run a lot of different analyses,
looking looking looking. So in many ways we end up with an added
polynomial-time exploratory computation scaling (trawling through N
databases with M half-arsed algorithms) on top of all the other
"stuff" (Big Data, poorly scaling algorithms).

OK, OK, so the sky is falling. What do we do about it?

I don't see much short-term hope in cross-training more people,
despite my efforts in that direction (see: next-gen course,
and the BEACON course). Training is a
medium-term effort: necessary but not all that helpful in the short
term.

It's not clear that better computational science is a
solution to the sequencing bonanza. Yes, most bioinformatics software
has problems, and I'm sure most analyses are wrong in many ways --
including ours, before you ask. It's a general problem in scientific
computation, and it's aggravated by a lack of training, and we're
working on that, too, with things like Software Carpentry.

I do see two lights at the end of the tunnel, both spurred by our own
research and that of Michael Schatz (and really the whole Salzberg/Pop gang) as well as Narayan Desai's talk at the
BTG workshop.

First, we need to change the way analysis scales -- see esp. Michael
Schatz's work on assembly in the cloud, and (soon, hopefully)
our own work on scaling metagenomic and mRNAseq assembly. Michael's
code isn't available (tsk tsk) and ours is available but isn't
documented, published, or easy to use yet, but we can now do "exact"
assemblies of 100 Gb of metagenomic, and we're moving towards
nearly-exact assemblies of arbitrarily large RNAseq and metagenomic
data sets. (Yes, "arbitrary". Take THAT, JGI.)

We will have to do this kind of algorithmic scaling on a case-by-case
basis, however. I'm focused on certain kinds of sequence analysis,
personally, but there's a huge world of problems out there that will
need constant attention to scale them in the face of the new data.
And right now, I don't see too many CSE people focused on this, because
they don't see the need to scale to Tb.

Second, Big Data and cloud computing are, combined, going to dynamite
the traditional HPC model and make it clear that our only
hope is to work smarter and develop better algorithms, in
combination with scaling compute power. How so?

As Narayan has eloquently argued many times, it no longer makes sense
for most institutions to run their own HPC, if you take into account
the true costs of power, AC, and hardware. The only reason it looks
like HPCs work well is because of the way institutions play games with
funny money (a.k.a. "overhead charges"), channeling
it to HPC behind the scenes - often with much politicking involved.
If, as a scientist, your compute is "free" or even heavily subsidized,
you tend not to think much about it. But now that we have to scale
those clusters 10s or 100s or 1000s of X, to deal with data 100s or
1e6s of times as big, institutions will no longer be able to afford to
build their own clusters with funny money. And they'll have to charge
scientists for the true computational cost of their work -- or
scientists will have to use the cloud. Either way, people will be
exposed to how much it really costs to run, say, BLAST against
100,000,000 short reads. And soon thereafter they'll stop doing such
foolish things.

In turn, this will lead to a significant effort to make better use of
hardware, either by building better algorithms or asking questions
more cleverly. (Goodbye, BLAST!) It hurts me to say that, because
I'm not algorithmically focused by nature; but if you want to know the
answer to a biological question, and you have the data, but existing
approaches can't handle it within your budget... what else are you
going to do but try to ask, or answer, the question more cleverly?
Narayan said something like "we'll have to start asking if $150/BLAST
is a good deal or not" which, properly interpreted, makes the point
well: it's a great deal if you have $1000 and only one BLAST to do,
but what if you have 500 BLASTs? And didn't budget for it?

Fast, cheap, good. Choose two.

Better algorithms and more of a focus on their importance (due to the
exposure of true costs) are two necessary components to solving this
problem, and there are increasingly many people working on them.
So I think there are these two lights at the end of the tunnel for the
Big Data-in-biology challenges. And probably there are some others
that I'm missing. Although, of course, these two lights at the end of
tunnel may be train headlights, but we can hope, right?

--titus

p.s. Chris Dagdigian from BioTeam gave
an absolutely awesome talk on many of these issues, too. Although he
seems more optimistic than I am, possibly because he's paid by the hour
:).