At the workshop, people were still confused. So I
tried something new.

I first made a simulated metagenome by taking three genomes worth of
data from the Chitsaz et al. (2011) paper (see
http://bix.ucsd.edu/projects/singlecell/) and shuffling them together.
I combined the sequences in a ratio of 10:25:50 for the E. coli
sequences, the Staph sequences, and the SAR sequences, respectively;
the latter two were single-cell MDA genomic DNA. I took the first 10m
reads of this mix and then estimated the coverage.

You can see the coverage of these genomic data sets estimated by using
the known reference sequences in the first figure. E. coli looks nice
and Gaussian; Staph is smeared from here to heck; and much of the SAR
sequence is low coverage. This reflects the realities of single cell
sequencing: you get really weird copy number biases out of multiple
displacement amplification.

Then I applied three-pass digital normalization (see the paper) and
plotted the new abundances. As a reminder, this operates without
knowing the reference in advance; we're just using the known reference
here to check the effects.

Coverage of genome read mix, calculated by mapping the mixed reads
onto the known reference genomes.

Coverage post-digital-normalization, again calculated by mapping
the mixed reads onto the known reference genomes.

As you can see, digital normalization literally "normalizes" the data
to the best of its ability. That is, it cannot create higher coverage
where high coverage doesn't exist (for the SAR), but it can convert
the existing high coverage into nice, Gaussian distributions centered
around a much lower number. You also discard quite a bit of data (look
at the X axes -- about 85% of the reads were discarded in downsampling
the coverage like this).

When you assemble this, you get as good or better results than
assembling the unnormalized data, despite having discarded so much
data. This is because no low-coverage data is discarded, so you still
retain as much overall covered bases -- just in fewer reads. To boot,
it works pretty generically for single genomes, MDA genomes,
transcriptomes, and metagenomes.

And, as a reminder? Digital normalization does this in fixed, low
memory; in a single pass; and without any reference sequence needed.