Abstract

The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online ( http://bioinf.spbau.ru/spades ). It is distributed as open source software.

Standard and multisized de Bruijn graph. A circular Genome CATCAGATAGGA is covered by a set Reads consisting of nine 4-mers, {ACAT, CATC, ATCA, TCAG, CAGA, AGAT, GATA, TAGG, GGAC}. Three out of 12 possible 4-mers from Genome are missing from Reads (namely {ATAG,AGGA,GACA}), but all 3-mers from Genome are present in Reads. (A) The outside circle shows a separate black edge for each 3-mer from Reads. Dotted red lines indicate vertices that will be glued. The inner circle shows the result of applying some of the glues. (B) The graph DB(Reads, 3) resulting from all the glues is tangled. The three h-paths of length 2 in this graph (shown in blue) correspond to h-reads ATAG, AGGA, and GACA. Thus Reads3,4 contains all 4-mers from Genome. (C) The outside circle shows a separate edge for each of the nine 4-mer reads. The next inner circle shows the graph DB(Reads, 4), and the innermost circle represents the Genome. The graph DB(Reads, 4) is fragmented into 3 connected components. (D) The multisized de Bruijn graph DB(Reads, 3, 4).

Stage 2 of SPAdes. (A) Bireads are decomposed into pairs of k-mers with estimated genomic distances (B-transformation). These are tabulated into histograms of estimated genomic distances between pairs of h-edges (H-transformation), and peaks in the histograms and paths in the graph are used to reveal the actual genomic distances between h-edges (A-transformation). This may be converted back to genomic distances between k-mers on pairs of h-paths (E-transformation, used for presentation purposes but not needed in the implementation). (B) The h-biedge histogram (α|β,*) corresponding to the exact h-biedge (α|β, 72163) in the assembly graph. path(α) is an h-path (condensed edge representing 72049 edges) in the upper right, and path(β) is an h-path (representing 46097 edges) at the lower left. The histogram collects all distance estimates between α and β derived from bireads. The h-biedge histogram was smoothed using the Fast Fourier Transform (red curve). The peak in the smoothed histogram (marked red) well approximates the actual distance (marked blue). (C) The h-biedge histogram (α|β,*) estimates the distance between h-edges α and β (|path(α)| = 46054, |path(β)| = 72). Because of the directed cycle formed by the two h-paths of lengths 72 and 13, there may be multiple walks through the graph between α and β. The h-biedge histogram has been divided into clusters with centers at 46060 and 46145. Thus SPAdes transforms the entire histogram into two h-biedges: (α|β, 46054) and (α|β, 46139).

Construction of the paired assembly graph for bireads sampled from a circular 24 bp genome Genome = ACGTCAAGTTCTGACGTGGGTTCT (single reads referred to as Reads). The de Bruijn graph DB(Reads, 4) has four hubs (ACG, CGT, GTT, and TCT) (A) and six h-paths , with lengths respectively (B). The h-edge of path Pi, denoted αi, is its first edge. The cycle C in DB(Reads, 4) that spells Genome passes through the h-paths in order P1, P6, P2, P4, P1, P5, P2, P3 (P1 and P2 represent repeats). (B) Reads are paired with separation d = 5, yielding estimated distances D between various h-edges αi and αj, denoted as the h-biedge (αi|αj, D). The 13 h-biedges constructed from all bireads are listed as . (C) The rectangle diagram of h-biedge (α6|α2, 6) is a rectangle (R3) with sides P6 and P2 and 45° line segment y = x + (d − 4) = x − 1, from (1, 0) to (3, 2). Point (1, 0) is labeled by bivertex (GTC|GTT) formed by vertex 1 (GTC) in path P6 and vertex 0 (GTT) in path P2. Point (3, 2) is labeled by bivertex (CAA|TCT) formed by vertex 3 (CAA) in path P3 and vertex 2 (TCT) in path P2. (D) Vertices to glue together from different rectangle diagrams are indicated by dotted red lines. (E) Rectangles glued into a 24 × 24 grid, yielding a cycle (blue path) through the genome.

Topology of selected features within a de Bruijn graph. The red h-path, P, is the current h-path under consideration for deletion (tip removal, chimeric h-path removal) or projection to another path (bulge corremoval). The blue path(s), Q, are alternative paths. Note that other factors such as lengths and coverage are considered in addition to topology, and that the graphs continue past the regions shown. (A) A potential bulge. Q may contain hubs within it, though P does not. (B) A potential tip; h-path P starts or ends at a vertex of total degree 1 (represented as solid), and there is an alternative h-path Q. (C) A potential chimeric h-path. There must be alternative h-paths Q1, Q2 both for the entrance and the exit to P. (D) h-path is a repeat. Note that P starts with a vertex of outdegree one and ends with a vertex of indegree one and has no alternative h-path. These degree conditions differentiate it from (A,B,C).