A collection of thoughts, ideas, and opinions independently written by members of the MSU community and curated by MSU Libraries

Estimating (meta)genome size from shotgun data

This is a recipe that provides a time- and memory- efficient way to loosely estimate the likely size of your assembled genome or metagenome from the raw reads alone. It does so by using digital normalization to assess the size of the coverage-saturated de Bruijn assembly graph given the reads provided by you. It does take into account coverage, so you need to specify a desired assembly coverage – we recommend starting with a coverage of 20.

Uses for this recipe include estimating the amount of memory required to achieve an assembly and providing a lower bound for metagenome assembly size and single-copy genome diversity.

This recipe will provide inaccurate estimates on transcriptomes (where splice variants will end up confusing the issue – this looks at single-copy sequence only) or for metagenomes with high levels of strain variation (where the assembler may collapse strain variants that this estimate will split).

Note: at the moment, the khmer script normalize-by-median.py needs to be updated from branch cleanup/normalize_and_saturate to run this code properly. Once we’ve merged it into the master branch and cut a release, we’ll remove this note and simply specify the khmer release required.

Let’s assume you have a simple genome with some 5x repeats, and you’ve done some shotgun sequencing to a coverage of 150 or higher. If your reads are in reads.fa, you can get an estimate of the single-copy genome size (here known to be 5500 bp) by running normalize-by-median.py

This is off by about 50% for reasons that we don’t completely understand. Note that you can get more accurate estimates for this data set by increasing C and decreasing k, but 20/20 should work about this well for most data sets. (For an E. coli data set, it returns 6.5 Mbp, which is only about 25% off.)

C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering and the Department of Microbiology and Molecular Genetics. He earned his PhD ('06) in developmental molecular biology from the California Institute of Technology. Brown is director of the laboratory for Genomics, Evolution, and Development (GED) at Michigan State University. He is a member of the Python Software Foundation and an active contributor to the open source software community. His research interests include computational biology, bioinformatics, open source software development, and software engineering.