Calculating insert stats from BAM files

by Ketil Malde; September 11, 2012

Given a set of reads, the first step of analysis is almost always to align them to a reference. Today, this means running some specialized aligner - I generally use bwa - which results in a BAM file. The next step is (or should be) to check the result, and I wrote a small tool for this purpose.

Basic operation

A pair of reads is classified in one of four classes. First, “innies” have the reads pointing inwards, i.e. the mate of a read is found downstream and on the reverse-complement strand. This is the normal orientation for paired-end reads. “Outies” point outwards, this is the result of mate-pair protocols that circularize and clip the insert before sequencing. Finally, “lefties” are oriented the same way with the first read upstream of the second, and “righties” point the same way, but with the second read upstream.

We support calculating statistics on the distribution of insert lengths (bam stats), a histogram (bam hist), or quantiles (bam quant). It is also possibly to simply dump the contents of the BAM file in SAM format - sorted by orientation of the pairs (bam dump), this is mostly for debugging purposes..

Examples

It is probably just as easy to document this by example, so here goes:

Here we see that most of the mapped reads are “innies”, with a mean insert length of 37K (it’s a fosmid library, so this is as expected), and a very nice normal distribution as seen from the skew and kurtosis columns.

Quantiles

The mean can be deceptive when dealing with skewed distributions, and often it is more useful to use median - and by extension, quantiles.

This is ostensibly a mate-pair library, but as we can see, it’s mostly regular paired end reads with lenghts between 200 and 400bp.

Note that we used -n to limit the number of reads (for expediency), and -b to limit the number of buckets (for clarity of presentation).

Histogram plots

Histograms are good candidates for plotting, and there’s a --plot option that generates output suitable for piping into gnuplot. This option takes an optional string argument which is passed to gnuplot, so you can use it to inject gnuplot commands (e.g. to plot only innies, do --plot='set xrange [0:]; set yrange [0:]'. Or to output to a file, do --plot='set out "file.jpg"; set term png'.) Output can be piped to gnuplot --persist, or redirected to a file.

An example plot for the mate-pair sequences. Note a large fraction of ‘innies’ remain, and that the distribution of ‘outies’ looks to be bimodal.

Performance

Parsing through a large BAM file takes a few minutes, over five on my PC for 200M reads (14Gb of compressed data). You can use the -n option to limit the number of reads processed, this usually gives you a good approximation quickly.

Notes on implementation

Since we’re dealing with potentially large files, everything is calculated by streaming through the data. This way, the memory footprint is low.

For distribution statistics, I’m using the standard trick from SAS’s UniVar method. We first calculate (simultaneously) the number of data points, their sum of x, sum of squares, sum of cubes, and sum of fourths, and at the end, we derive the various measures.

Histograms are of course trivial, using a set of buckets, we just fill them up as we go along.

Quantiles are a bit tricky, there’s a review on WP of how this can be done, but most methods either require storing the data, require multiple passes, or are complicated and slow. In the end, I settled for an approximate variant, based on the histogram code. We use a set of buckets (this time with slightly increasing sizes), and identify the correct bucket for each quantile. We then interpolate to get the quantile value.

Interestingly, I accidentally forgot to limit the number of buckets generated. To my surprise (and then embarrassement at being surprised), it worked nicely. Laziness, of course, allows an infinite sequence of buckets, and as long as you don’t look for quantiles above 100%, this will work.

Availability

As usual, bamstats is on hackage, or you can get a copy with darcs get http://malde.org/~ketil/biohaskell/bamstats (or just browse the source archive). I’ve also mirrored the repository on hub.darcs.net, at the moment this is an experiment, but if it works out well, this could be the new official home.