We are pleased to announce the release of DISCOVAR de novo, our new assembler that is suitable suitable for large genomes up to human size. DISCOVAR de novo, uses the same cheap data that the original DISCOVAR release does: 250 base paired-end PCR-free Illumina reads. No other libraries are required.

We’ve prepared a short primer on understanding the kmer spectrum plots produced by ALLPATHS-LG. Such plots can prove invaluable when investigating problems, and we recommend that users look at them as a matter of course.

We are contemplating server purchases and would like to get the most bang for our buck. We imagine that some of you are in the same situation. Therefore, to share intelligence, we are creating a table that shows performance stats, along with server configuration information. We are using our new tool DISCOVAR as the basis for this test, but the results should still be of interest to ALLPATHS-LG users.

Please take a look at the current benchmark table, which we will continue to update as we get more results. Better yet – why not participate by benchmarking your systems and sharing the results with us.

DISCOVAR is both a genome assembler and a variant caller. It requires only a single Illumina fragment library to run, leading to cheaper genome assemblies and low cost variant calls. Currently it can assemble small genomes, but we are working hard to add support for large genomes too. However, it can be used as a highly accurate variant caller on any size of genome – making it particularly valuable for understanding human Mendelian diseases. Find out more on the DISCOVAR blog.

DISCOVAR does not replace ALLPATHS-LG, and indeed DISCOVAR is presently unable to assemble large genomes.

For purposes of assessing our assembly methods, we generated some NA12878 clone reference sequences. We believe that these data will be of interest to the community and have therefore decided to make them available to all. These clone sequences and the raw data used to generate them can be found on our FTP site.

The sequences were obtained by randomly selecting ~100 clones from an NA12878 Fosmid library. Two pools of ~50 each were created, then sequenced by MiSeq (250 bases) and PacBio (~3000 bases). There are also some jumps.

We completely assembled 103 clones, without ambiguity, in some cases with manual intervention. Cloning vector has been removed. There are a small number of additional clones in the pools, not included in the assemblies, including a few that had low coverage, some EBV, and some centromeric sequence.

This is version 1.0 of the set. We believe that the error rate on the clones is very low, however we are carrying out laboratory validation and will roll out updated versions as the results come back.

As of release 44849, GCC 4.7.0 (or higher) is now required to build ALLPATHS-LG.

We have made this transition in order to benefit from the many exciting new features afforded by the C++11 standard. If you are unable to access the latest versions of the compiler at this time, please continue to use earlier releases of ALLPATHS-LG which still support GCC 4.4.0 or higher.

The FASTG Format Specification Working Group is pleased to announce version 1.0 of the FASTG specification

FASTG is a format for faithfully representing genome assemblies in the face of allelic polymorphism and assembly uncertainty. Currently genome assemblies are represented linearly, as sequences of bases, recorded in FASTA files. Since chromosomes are in fact linear or circular, this makes sense, so long as one has complete knowledge of the genome. However, many genomes contain polymorphisms that cannot be represented in a simple linear sequence, and almost all assemblies contain errors and omissions, which can result in incorrect biological inferences. The FASTG format aims to address this problem using a flexible graph-based approach to encode any variability in the sequence, along with metadata to score and annotate the source of those variations. Assembly graphs in FASTG can be easily translated into linear FASTA sequences to support current analysis tools for reading mapping, annotation, visualization, etc, but our hope is to develop a next generation of assembly and genome analysis algorithms that can work with the graph structure directly. For the complete specification and additional information on FASTG, please visit:

The immediate plans are to enlist help to develop a reference library and command line suite for parsing, transforming, and querying assemblies in FASTG format, similar to the widely used SAM/SAMTools suite.