Tuesday, November 25, 2008

Calculating an N50 from Velvet output

In sequencing circles the N50 length is a useful heuristic for judging the quality of an assembly. Here is my definition of N50 length, which you may or may not find intuitive:

N50 length is the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs

For example's sake imagine an assembler has created contigs of the following length (in descending order):

91 77 70 69 62 56 45 29 16 4

The sum of these is 519bp, so the sum of all contigs equal to or greater than N50 must be equal to or greater than 519/2 or 259.5We can see by brute force that91+77=16891+77+70=23891+77+70+69=307 (that'll do)so the N50 for this assembly is 69bp

Another way to look at this:

at least half the nucleotides in this assembly belong to contigs of size 69bp or longer.

N50 vs N50 length

Technically N50, as opposed to N50 length, refers to the ordinal of that last contig that pushes it over the brink - in this example 4 (since 69bp is the 4th largest contig). Unfortunately, a higher N50 implies the opposite of a longer N50 length. Some papers refer to N50 length as L50, while most have simply followed the lazy convention of dropping "length" off of "N50 length". I think it is important to include units with your N50 to minimize confusion.

Contig N50 vs Scaffold N50

Another distinction is often made between contig N50 and scaffold N50. Contigs are "contiguous segments", while scaffolds (aka supercontigs) consist of contigs separated by gaps. Scaffolds are constructed using paired-end information at the read level and, in major sequencing projects, paired BAC ends. Because the scaffolds sequences are filled with varying quantities of empty N's, the scaffold N50 should not solely be used as a comparative score of assembly quality.

Velvet, when used with sane expCov settings, is very conservative with regard to scaffolding - so much that the contigs.fa N50 can be virtually considered a contig N50, as opposed to a scaffold N50. More aggressive programs, such as SOAPdenovo, produce separate contig and scaffold files.

Velvet N50 from stats.txt

Velvet is a popular assembler for short sequences that uses DeBruijn graphs and Eulerian graph theory instead of a repetitive align-consensus-align approach. Although it returns an N50 in the course of assembling, I wanted to derive it from the contigs themselves. These contigs are summarized in a table called stats.txt

Using R and its cumulative summation function we can easily compute N50.

Strategy: Order the contigs by decreasing size and find the first value for which the cumulative summation is at least half the total sum.

Thanks to Barry Rowlingson for providing this solution

myStatsTablecontigsn50= sum(contigs)/2][1]

Beware the kmer

Note: The length in the stats.txt file is given as length=lgth+kmer-1, where kmer is the kmer value chosen for that assembly. The N50 length given in the Log file also appears to be in kmers. You cannot convert an N50 in kmers to bp by adding kmer-1. The math doesn't work like that - you need to convert each contig to bp before recalculating N50.

Finally, you can calculate N50 from sequences in the contigs.fa, but this file only contains contigs longer than 2-kmers by default. The contigs.fa bp-N50 will sometimes approximate the kmer-N50 in the Log file, but that is not a rule to depend on.

Hmm... Steven Salzberg provides a slightly different definition for contig/scaffold N50 in his paper 'GAGE: A critical evaluation of genome assemblies and assembly algorithms'. It reads, "The N50 value is the size of the smallest contig (or scaffold) such that 50% of the genome is contained in contigs of size N50 or larger." Here, you seem to define it such that it depends on the sum of the contigs/scaffolds as opposed to the length of the genome. Granted, this is for de novo assembly and the length of the genome might not be known. However, the wikipedia article on 'Genome size' references an article claiming that 1 picogram of DNA is equivalent to 978 Mb.

It seems more appropriate/accurate to base this value of the genome size. No?