The generation of short reads by next generation sequencers has lead to an increased need to be able to assemble the vast amount of short reads that are generated. This is no trivial problem, as the sheer number of reads makes it near impossible to use, for example, the overlap layout consensus (OLC) approach that had been used with longer reads. Therefore, most of the available assemblers that can cope with typical data generated by Illumina use a de Bruijn graph based k-mer based approach.

A clear distinction has to be made by the size of the genome to be assembled.

small (e.g. bacterial genomes: few Megabases)

medium (e.g. lower plant genomes: several hundred Megabases)

large (e.g. mammalian and plant genomes: Gigabases)

All de-novo assemblers will be able to cope with small genomes, and given decent sequencing libraries will produce relatively good results. Even for medium sized genomes, most de-novo assemblers mentioned here and many others will likely fare well and produce a decent assembly. That said, OLC based assemblers might take weeks to assemble a typical genome. Large genomes are still difficult to assemble when having only short reads (such as those provided by Illumina reads). Assembling such a genome with Illumina reads will probably will require using a machine that has about 256 GB and potentially even 512GB RAM, unless one is willing to use a small cluster (ABySS, Ray, Contrail), or invest into commercial software (CLCbio_Genomics_Workbench).

Like any project, a good de novo assembly starts with proper experimental design. Biological, experimental, technical and computational issues have to be considered:

Biological issues: What is known about the genome?

How big is it? Obviously, bigger genomes will require more material.

How frequent, how long and how conserved are repeat copies? More repetitive genomes will possibly require longer reads or long distance mate-pairs to resolve structure.

How AT rich/poor is it? Genomes which have a strong AT/GC imbalance (either way) are said to have low information content. In other words, spurious sequence similarities will be more frequent.

Is is haploid, diploid, or polyploid? Currently genome assemblers deal best with haploid samples, and some provide a haploid assembly with annotated heterozygous sites. Polyploid genomes (e.g. plants) are still largely problematic.

Experimental issues: What sample material is available?

Is it possible to extract a lot of DNA? If you have only little material, you might have to amplify the sample (e.g. using MDA), thus introducing biases.

Does that DNA come from a single cell, a clonal population, or a heterogeneous collection of cells? Diversity in the sample can create more or less noise, which different assemblers handle differently.

Technical issues: What sequencing technologies to use?

How much does each cost?

What is the sequence quality? The greater the noise, the more coverage depth you will need to correct for errors.

How long are the reads? The longer the reads, the more useful they will be to disambiguate repetitive sequence.

Can paired reads be produced cost-effectively and reliably? If so, what is the fragment length? As with long reads, reliable long distance paired can help disambiguate repeats and scaffold the assembly.

Can you use a hybrid approach? E.g. short and cheap reads mixed with long expensive ones.

Computational issues: What software to run?

How much memory do they require? This criteria can be final, because if a computer does not have enough memory, it will either crash, or slow down tremendously as it swaps data on and off the hard drive.

How fast are they? This criteria is generally less stringent, since the assembly time is generally minor within a complete genome assembly and annotation project. However, some scale better than other.

Do they require specific hardware? (e.g. large memory machine, or cluster of machines)

How robust are they? Are they prone to crash? Are they well supported?

How easy are they to install and run?

Do they require a special protocol? Can they handle the chosen sequencing technology?

Some steps which are likely common to most assemblies:

If it is within reason and would not tamper with the biology: Try to get DNA from haploid or at least mostly homozygous individuals.

Make sure that all libraries are really ok quality-wise and that there is no major concern (e.g. use FastQC)

For paired end data you might also want to estimate the insert size based on draft assemblies or assemblies which you have made already.

Before submitting data to a de-novo assembler it might often be a good idea to clean the data, e.g. to trim away bad bases towards the end and/or to drop reads altogether. As low quality bases are more likely to contain errors, these might complicate the assembly process and might lead to a higher memory consumption. (More is not always better) That said, several general purpose short read assemblers such as SOAP de-novo and ALLPATHS-LG can perform read correction prior to assembly.

Before running any large assembly, double and triple check the parameters you feed the assembler.

Post assembly it is often advisable to check how well your read data really agrees with the assembly and if there are any problematic regions

If you run de Bruijn graph based assemblies you will want to try different k-mer sizes. Whilst there is no rule of thumb for any individual assembly, smaller k-mers would lead to a more tangled graph if the reads were error free. Larger k-mer sizes would yield a less tangled graph, given error free reads. However, a lower k-mer size would likely be more resistant to sequencing errors. And a too large k might not yield enough edges in the graph and would therefore result in small contigs.

For a more detailed discussion, see the chapter dedicated to pre-processing.

Data pre-processing consists in filtering the data to remove errors, thus facilitating the work of the assembler. Although most assemblers have integrated error correction routines, filtering the reads will generally greatly reduce the time and memory overhead required for assembly, and probably improve results too.

Genome assembly consists in taking a collection of sequencing reads, which are much shorter than the actual genome, and creating a genome sequence which is a likely source of all these fragments. What defines a likely genome depends generally on heuristics and the data available. Firstly, by parsimony, the genome must be as short as possible. One could take all the reads and simply produce the concatenation of all their sequences, but this wold not be parsimonious. Secondly, the genome must include as much of the input data as possible. Finally, the genome must satisfy as many of the experimental data as possibly. Typically, paired-end reads are expected to map onto the genome with a given respective orientation and a given distance from each other.

The output of an assembler is generally decomposed into contigs, or contiguous regions of the genome which are nearly completely resolved, and scaffolds, or sets of contigs which are approximately placed and oriented with respect to each other.

There are many assemblers available (See the Wikipedia page on sequence assembly for more details). Tutorials on how to use some of them are below.

ABySS is a de-novo assembler which can run on multiple nodes where it uses the message parsing interface (MPI) interface for communication. As ABySS distributes tasks, the amount of RAM needed per machine is smaller and thus Abyss is able to cope with large genomes. See here for a tutorial.

Pros

distributed interface a cluster can be used

a large genome can be assembled with relatively little RAM per compute node. A human genome was assembled on 21 nodes having 16GB RAM each

Allpath-LG is a novel assembler requiring specialized libraries. The authors of the software benchmarked ALLPATH-LG against SOAP-denovo and ALLPATH-LG reported superior performance. However it must be noted that they might not have used the SOAP-denovo gap filling module for one of the data set due to time constraints. This would probably have improved the SOAP assembly contiguous sequence length. In our own hand (usadellab) we have seen similar good N50 results and also (Schneeberger et al. 2011), reported good N50 values for ALLPATHS-LG Arabidopsis assemblies. Similarly ALLPATHS-LG was named as well performing in the assemblathon.

Pros

relatively fast runtime (slower than SOAP)

good scaffold length (likely better than SOAP)

can use long reads (e.g. PAC Bio) but only for small genomes

Cons

specially tailored libraries are necessary

large genomes (mammalian size) need a lot of RAM. The publications estimates about 512GB would be sufficient though

Can use a reference genome to anchor reads which normally map to repetitive regions (Columbus module)

Cons

Velvet might need large amounts of RAM for large genomes, potentially > 512 GB for a human genome based if at all possible. This is based on an approximation formula derived by Simon Gladman for smaller genomes -109635 + 18977*ReadSize + 86326*GenomeSize in MB + 233353*NumReads in million - 51092*Kmersize

Newbler is tailored to (mostly) 454 data. Since Ion Torrent PGM data has a similar error profile (predominance of miscalled homopolymer repeats), it may be a good choice there also. Whilst it can accomodate some limited amount of Illumina data as has been described here, this is not possible for larger data sets. The fire ant genome added ~40x Illumina data to ~15x 454 coverage in the form of "fake" 454 reads: first assembling the Illumina data using SOAPdenovo and then chopping the obtained contigs into overlapping 300bp reads, and finally inputting these fake 454 reads to Newbler alongside real 454 data.

As Newbler at least partly uses the OLC approach large assemblies can take time

large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding

Mostly Illumina (or Colorspace)

small genome => MIRA, velvet

medium genome => no clear recommendation

large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding

(For large genomes this is based on the fact that not many assemblers can deal with large genomes, and based on the assemblathon outcome. For 454 data this is based on Newbler's good general performance, and MIRA's different outputs, its versatility and the theoretical consideration that de Bruijn based approaches might fare worse)

Post assembly you might want to try the SEQuel software to improve the assembly quality.

I want to start a large genome project for the least cost

Use Illumina reads with ALLPATHS-LG specification (i.e. overlapping), the reads will work in e.g. SOAP de novo as well

(This recommendation is based on the assemblathon outcome, the original ALLPATHS publication (Gnerre et al., 2011) as well as a publication that used ALLPATHS for the assembly of Arabidopsis genomes (Schneeberger et al., 2011).

Each software has its particular strength, if you have specific requirement, the result from Assemblathon will guide you. Another comparison site GAGE has also released its comparison (Salzberg et al. 2011). Also there exists QUAST tool for assessing genome assembly quality.

Zhang et al., 2011 In depth comparison of different genome assemblers on simulated Illumina read dat. Unfortunately only up to medium genomes were tested. For eukaryotic genomes and short reads Soap denovo is suggested for longer reads ALLPATHS-LG.

Chapman JA et al. 2011 introduce the new assembler Meraculous gathered literature data on the assembly of E. coli K12 MG1655 for Allpaths 2, Soapdenovo, Velvet, Euler-SR, Euler, Edena, AbySS and SSAKE. Allpaths2 had by far the largest Contig and Scaffold N50 and was apart from Meraculous the only misassembly free. Meraculous was shown to even contain no errors.

Liu et al., 2011 benchmark their new assembler PASHA against SOAP de novo (v 1.04), velvet (1.0.17) and ABySS (1.2.1) using three bacterial data sets. Whilst PASHA usually the largest NG50 and NG80 (N50 and N80 calculated with the true genome sizes) SOAP de novo produced the highest number of contigs and soemtimes worse NG50 and NG80. However for one dataset SOAP denovo showed the best genome coverage.

The Assemblathon comparing de novo genome assemblies of many different teams based on a synthetic genome. The Assemblathon 1 competition is now published in Genome Research by Earl et al., 2011.

IGV is the Integrative Genomics Viewer developed by NCBI, the National Center for Biotechnology Information. IGV allows for easy navigation of large-scale genomic datasets, and supports the integration of genomic data types such as aligned sequence reads, mutations, copy number, interfering RNA screens, gene expression, methylation, and genomic annotations. Users can amplify specific areas down to individual base-pairs, and more generally scroll through an entire genome.

It can be used to visualize and share, whole genomes/reference genomes, alignments, variants, and regions of interest; and filter, sort and group genomic data.