Abstract

Genome sequencing is now affordable, but assembling plant genomes de novo remains challenging. We assess the state of the art of assembly and review the best
practices for the community.

Keywords:

DNA sequencing; genome assembly; plant genomics

Review

The plant kingdom is filled with amazing diversity and significance. Plants form the
base of the food chain that provides food for all living organisms, and just 15 crop
plants provide 90% of the world's food intake [1]. Plant species are responsible for maintaining the balance of the carbon cycles [2], for developing and maintaining soil from erosion [3], and are promising sources of renewable energy [4]. Plant byproducts are used in many human medicines [5], and plants have been essential model organisms for studying biological systems such
as the role of transposons and epigenetics [6]. For all these reasons and many more, there is great interest in sequencing plant
genomes, but relatively few plant species have been sequenced compared with the hundreds
of thousands of species around the world.

The first free-living organisms were sequenced less than 20 years ago, starting with
simple microbial genomes [7], and increasing in complexity to the first eukaryotic genomes [8], the first multicellular species [9], and then on to plant genomes, including Arabidopsis thaliana (thale cress) [10], Oryza sativa (rice) [11], Carica papaya (papaya) [12] and Zea mays (maize) in 2009 [13], using first-generation capillary sequencing. Since then many others have been sequenced
leveraging second-generation sequencing, including Fragaria vesca (strawberry) [14], Solanum lycopersicum (tomato) [15] and Cajanus cajan (pigeonpea) [16], and dozens more are nearing completion [17]. This increase in sequenced plant genomes has largely been driven by technological
improvements: whereas the first generation of automated DNA sequencing instruments
could sequence thousands of base pairs per day, current state-of-the-art second-generation
sequencing instruments can sequence many billions of bases per day for hundreds or
thousands of dollars per gigabase instead of millions or billions of dollars per gigabase
[18]. These technologies have been applied to study thousands of genomes across the tree
of life, enabling rich annotation of their gene networks [19], the development of comparative genomics approaches to infer evolutionary and domestication
forces [13], the cataloging of genomic markers to optimize plant breeding [20], and numerous other studies that use the genome sequence as the backbone of the analysis
[21].

In contrast to the tremendous advances in throughput, assembling sequencing reads
remains a substantial endeavor, much greater than the sequencing efforts alone would
suggest [22-24]. Large complex plant genomes remain a particularly difficult challenge for de novo assembly for a variety of biological, computational and biomolecular reasons. Plant
genomes can be nearly 100 times larger [25] than the currently sequenced bird [26], fish [27] or mammalian genomes [28]. In addition they can have much higher ploidy, which is estimated to occur in up
to 80% of all plant species [29], and higher rates of heterozygosity and repeats [30] than their counterparts in other kingdoms. Furthermore, the gene content in plants
can be very complex, as shown by the presence of large gene families and abundant
pseudogenes with nearly identical sequences derived from recent whole genome duplication
events and transposon activity [13]. Plants tend to have high copy chloroplasts and mitochondria organelles, which complicate
assembly of their remnants in the nuclear genome and skew coverage levels [12]. Finally, it is often very difficult to extract large quantities of high-quality
DNA from plant material, making it difficult to prepare proper libraries for sequencing.

For all of these reasons, sequencing and de novo assembling a plant genome can create a highly fragmented result. Instead of large
contigs and scaffolds spanning large chromosome regions seen in recent vertebrate
genome assemblies [31], there is a greater chance to assemble the sequencing reads into isolated gene islands
among the background of high copy repeats [13]. Furthermore, the gene sequences may not always be correct, considering that nearly
identical gene families are notoriously difficult to assemble and may collapse into
a mosaic sequence without necessarily representing any member of the family [32]. If the level of fragmentation and mis-assembly is too great, downstream analysis
will be noisy, and could even lead to false conclusions of the biology [33].

Knowing how to assemble these genomes accurately, how to best make use of the potentially
highly fragmented assemblies and how to perform these applications at the lowest cost
are important in today's funding environment. Genome assembly has always been an incremental
process, and there are only a handful of truly finished large genomes today - even
the latest release of the 'finished' human reference genome has millions of unresolved
nucleotides [34]. Therefore, we need to assess when an assembly is good enough to be useful to the
community, and how the agencies can get the most out of the available funding. Finally,
how can researchers stay afloat in the rapidly evolving landscape with technology
evolving so quickly it is challenging to know what the guidelines for plant assembly
will be in 12 months or beyond. Here we assess the state of the art of de novo assembly, assess what can be expected to develop, and review the best practices for
the plant community.

Assessing the needs

Assembling any genome requires the proper combination of coverage, read length and
read quality [22]. If any of these factors are not met, then it is a mathematical certainty that the
assembly will be fragmented into many small contigs. The Lander-Waterman model offers
an analytic, if optimistic, prediction on the minimum coverage needed to assemble
large contigs [35]. Using this model, a minimum of 15-fold coverage is required to assemble 100 bp reads
into large contigs. However, once coverage has been equalized for errors, ploidy,
sequence biases and other complicating factors, the minimum required coverage level
may be much higher and sequencing to at least 100-fold coverage is recommended [31].

This statistical model also does not consider repeat composition, and short reads
alone may never have the information content to resolve complex repetitive sequences.
Resolving large or complex repeats fundamentally requires longer spanning information
to bridge across the repeats back to unique sequence in the form of longer reads,
mate-pairs, long-range mapping information or a method for fragment localization [32]. Read quality is also not directly considered in the Lander-Waterman model, but low-quality
reads will reduce effective coverage and obscure true overlaps between sequencing
reads, thus fragmenting the assembly and risking collapsing more repeats.

Overcoming these challenges depends on advances in both sequencing technology and
assembly technology. Sequencing technology needs: (1) instrumentation improvements,
including improvements in throughput, cost, read lengths and accuracy; and (2) molecular
protocols, including developing new types of libraries and also new techniques for
multiplexing samples to take advantage of the tremendous throughput available per
instrument run. Assembly technology needs: (1) improved algorithms for accurately
assembling complex genomes at scale; and (2) improved analytics to record, manipulate,
analyze and visualize features to translate the salient assembly information to the
broader plant biology community.

Sequence technology

The highest capacity sequencing instruments available today, such as the Illumina
HiSeq 2000, can sequence nearly 100 Gbp per day, and make it possible to sequence
a 3 Gbp genome to high coverage for less than US$10,000 [36]. Using these technologies, it is also possible to sequence paired-end or mate libraries
ranging in size up to a few thousand base pairs. As such, even large plant genome
projects can count on relatively inexpensive, deep coverage with approximately 100
bp reads and 1 to 5 kbp mate libraries. However, these short reads and small libraries
have substantial limitations for large genomes with large repetitive content. Constructing
high-quality draft genome assemblies for the largest plant genomes absolutely requires
enhanced sequencing approaches to generate longer reads and mate-pair libraries, and
protocols for localizing the sequencing and assembly problem.

One of the strongest needs is for protocols for efficiently generating a mix of larger
libraries, such as 10 kbp, 40 kbp or 150 kbp in addition to standard 5 kbp libraries.
Currently available protocols for these larger sizes, such as with fosmids [37], or bacterial artificial chromosome (BAC)-end sequencing [38], are effective but are laborious, costly and time consuming relative to the sequencing
itself. Furthermore, the larger libraries inevitably have increased size variance
and less reliable mate information. The sequencing itself needs to be improved to
reduce the biases from GC composition, chimeric reads and mates, and other effects
so that the coverage along the genome will be uniform and complete [39].

One promising approach for substantially longer reads and unbiased coverage is the
rise of third-generation sequencing technologies such as that from Pacific Biosciences
[40] and the newly announced instruments from Oxford Nanopore [41]. These platforms promise to generate longer reads that can be used for sequencing
through complex repeats, link gene islands and phase haplotypes. However, these technologies
are relatively immature for immediate widespread application to all large genomes
of interest. Sequencers from Roche/454 make it possible to sequence approximately
700 bp reads, but at greater cost than short read sequencing, and it may not be sufficient
to span the largest repeats [42].

Optical mapping technologies are another possibility for generating very long range
linking information between sequence contigs and have a successful history in plant
genomics [43,44], although the current worldwide capacity is also below the demand. New technologies
such as nanocoding [45], and new instruments from commercial vendors, including OpGen [46] and BioNanoGenomics [47], are expected in the next couple of years and they could expand the capacity for
optical mapping similar to that seen in sequencing.

A complementary approach to improved sequencing and mapping is to develop methods
for localizing sequencing and thus simplifying the assembly problem. There is a successful
history of BAC-by-BAC sequencing of plant genomes [10,11], and this is effective in the sense that assembling an isolated BAC is far simpler
than assembling the entire genome. However, this technology is now prohibitively expensive
without significant enhancement. For example, sequencing large genomes such as maize
using a BAC-by-BAC approach costs tens of millions of dollars and hundreds of thousands
of BAC clones. While next-generation sequencing would certainly reduce this cost,
it is not readily possible to efficiently use next-generation sequencing on the number
of BAC clones needed. This, coupled with the high cost of making and storing the large
numbers of libraries needed, greatly limits the feasibility of BAC-by-BAC sequencing
in the next-generation world.

Versions of BAC-by-BAC using pools of BAC or pools of fosmids is an attractive option
for localizing the problem, assuming such libraries can be efficiently made and barcoding
protocols can be effectively applied to tag the molecules [48]. However, to utilize the capacity of current sequencers fully, so many BACs need
to be pooled in a lane that it would not effectively localize the assembly problem
unless the BACs can be multiplexed and barcoded to a very high degree. Furthermore,
preparing and storing these libraries will still require a substantial cost unless
they can be made in a fully automated fashion. Alternative molecular isolation technologies
that can be used for localizing individual chromosomes in the sample, such as flow
sorting, are promising alternatives and are starting to become more widely available
[49,50].

Assembly technology

Genome assembly has been metaphorically described as the process of assembling a jigsaw
puzzle from the individual reads [22]. In the case of the largest, most repetitive plant genomes, it could be metaphorically
described as assembling a large jigsaw consisting of blue sky separated by nearly
indistinguishable wisps of white clouds of genes - seemingly an impossible task. Assembly
generally follows a hierarchical approach of comparing the individual reads to form
an assembly graph of the overlapping reads or kmers, then simplifying the graph to
form the initial contigs, and finally using mate-pairs and marker information to order
and orient the initial contigs into scaffolds (Figure 1). Assembling a large genome is operationally complicated in that it demands extensive
error correction and filtering, and large computational resources, and is often highly
sensitive to the parameters used. Even beyond these complications, assembly is fundamentally
complicated because repeats introduce ambiguity in how the reads should be ordered
so that no perfect algorithm exists for reconstructing entire genomes even if every
base of the genome has been sequenced to high depth.

Figure 1.Schematic overview of genome assembly. (a) DNA is collected from the biological sample and sequenced. (b) The output from the sequencer consists of many billions of short, unordered DNA fragments
from random positions in the genome. (c) The short fragments are compared with each other to discover how they overlap. (d) The overlap relationships are captured in a large assembly graph shown as nodes representing
kmers or reads, with edges drawn between overlapping kmers or reads. (e) The assembly graph is refined to correct errors and simplify into the initial set
of contigs, shown as large ovals connected by edges. (f) Finally, mates, markers and other long-range information are used to order and orient
the initial contigs into large scaffolds, as shown as thin black lines connecting
the initial contigs.

Several short-read assembly packages have been proven for mammalian-sized genomes
up to the 3 Gbp human genome, including ABySS [51], ALLPATHS-LG [31], the Celera Assembler [52,53], Newbler [54], SGA [55] and SOAPdenovo [56]. These assemblers can produce high-quality assemblies from short reads, although
they generally require servers or clusters with 512 gigabytes of RAM and many terabytes
of disk space available for a gigabase-sized genome [31]. However, these servers are decreasing in costs and can be purchased for under US$35,000
from several major computer vendors [57], and supercomputing centers make them available without any cost [58]. This is promising, but assembling the largest plant genomes currently being sequenced,
such as the loblolly pine genome of approximately 21 Gbp [59], will increase the computational demands by nearly an order of magnitude, for which
there is no proven technology. Enhanced algorithms for compression and distributing
the computation are actively being researched [55].

Two major efforts to evaluate the state-of-the-art in assembly technology were published
last year: the Assemblathon [24] and the Genome Assembly Gold-Standard Evaluation (GAGE) [23]. Both projects evaluated the performance of various genome assemblers in a competitive
framework with both simulated and real datasets. They showed there was great difference
in the quality of the results depending on the assembler and pipelines used. Researchers
planning to assemble a genome of any size are encouraged to study their results, such
as the needs for error correction, recommended assemblers and evaluation criterion.
However, the genomes studied in these projects were relatively small and simple compared
with the most complex plant genomes. The plant community would be well served by hosting
regular competitions with plant genomes, especially since all of the major assemblers
have been developed targeting vertebrate genomes, and no assembler has been proven
with higher levels of ploidy or heterozygosity.

Related to the de novo assembly problem, research is greatly needed to help improve the representation of
assembled genomes, including creating graph-centric and population-aware formats that
can represent the complexities of plant genomes, particularly those that are only
partially assembled [60-62]. Incremental algorithms that can update the assembly and annotation as new data become
available would also be extremely useful [33]. Finally, continued research into assembly validation is necessary for determining
when an assembly is correct and conclusions can be trusted [32,63].

Analytics

Sequencing and assembling a genome are often just the first stages of a larger study.
Immediately following the assembly, the genome will need to be annotated to catalog
genes and other features of interest [64], or aligned to other genomes to enable comparative genomics studies [65]. Several sequencing-based assays, such as RNA-seq [66] and Methyl-seq [67], can be used with the assembly to study transcriptionally or epigenetically active
regions of the genome, and population studies will often attempt to build higher-order
relationships, such as gene networks, or relate genotype to phenotype.

Currently, pipelines are available for carrying out these operations and displaying
results in a 'genome browser', but continued research is needed to make the pipelines
and results more accessible to different types of user. Systems such as Galaxy [68], Gramene [69] and Drupal [70] are among the leading graphical systems for executing workflows, visualizing sequencing
assay results, and enabling collaborative discussions, respectively, but they operate
as separate systems. A fully integrated system such as has been proposed by iPlant
[71], and the DOE Systems Biology Knowledgebase [72] initiatives would lower the barrier for learning to operate these functions. In either
case it is critical that the community enhance these systems and the underlying algorithms
to better support the complexity of plant genomes and their evolving assemblies.

Trends and recommendations

The plant kingdom has incredible variation and diversity, and as a result each plant
sequencing project seems to have its own unique analysis needs. Sequencing and assembly
technologies are evolving so rapidly it is impossible to predict what will be available
even one year in the future. Despite these complexities, certain trends are emerging
as best practices.

Mixed library, high-coverage sequencing

Because of economic and technological reasons, the majority of sequence produced in
the next 18 months will continue to originate from short reads of approximately 100
to 200 bp. Fortunately, sequences of this length can be assembled into high-quality
draft assemblies for genomes as complex as human when sequenced in a mixture of libraries.
In particular, Gnerre et al. [31] recommend 45× paired-end (2 × 100 bp at 180 bp), 45× short jump (2 × 100 bp at 3
kbp), 5× long jump (2 × 100 bp at 6 kbp) and 1× fosmid (2 × 26 bp at 40 kbp) to generate
high-quality draft assemblies. Since the paired-end reads designed in this way overlap
by approximately 20 bp, they can be preassembled into pseudo-long reads of approximately
twice the original length using the built-in capabilities of ALLPATHS-LG [31] or by a standalone preassembler such as FLASH [73]. Assemblers that do not include built-in error correction greatly benefit from then
applying software such as Quake [74] to identify and fix sequencing errors before assembly. The larger libraries are then
needed for ordering the initial contigs into progressively larger scaffolds.

For the largest and most complex plant genomes, even these libraries may not be sufficient
to span the largest or more complex repeats, and it may be necessary to employ a hybrid
approach using a combination of short and long reads, and even long-range mapping
technologies or localization methods. Long reads over 800 bp are available today from
Roche/454, albeit at higher cost than short read sequencing, and third-generation
sequencing technologies promise to provide even longer reads. As sequencing costs
and instrument runtimes continue to drop, researchers are also recommended to sequence
a low coverage 'genome snapshot' to evaluate the genome and library composition before
attempting to sequence the genome to high coverage.

Bioinformatics partnerships

Assembling and analyzing raw sequence data still require substantial bioinformatics
effort and expertise. Before attempting a complex assembly, plant biologists are strongly
encouraged to develop partnerships with bioinformatics laboratories that have sufficient
skills and resources to handle the onslaught of data and diagnosis problems as they
occur. Fortunately, the funding agencies are aware of these challenges, and it is
our hope they would be responsive to requests for appropriate bioinformatics funding.

Bioinformatics laboratories are encouraged to enhance, expand and refine their algorithms
and analytics specifically for the complexities of plant genomes. In particular, because
of high diversity, heterozygosity and ploidy not found in other kingdoms, there is
a strong need to develop a plant-specific genome assembler that can overcome these
challenges and represent the plant genome assemblies in more versatile graph-based
formats along with the supporting tools for analyzing these graphs (Figure 2). Furthermore, the trend in bioinformatics software development is to develop only
enough of a user interface to support the needs of a particular project. If this trend
continues, many groups will reinvent the same software over and over again, wasting
time and resources. Instead, funding agencies would be better served by requiring
software to be developed with a high-quality user-friendly interface or integrated
into a graphical system such as Galaxy, even if it requires modestly more upfront
funding.

Figure 2.Ploidy, heterozygosity and the assembly graph. (a) Schematic representation of a tetraploid genome, such as apple, cotton or cabbage,
consisting of haploid chromosomes A to D with homozygosity/heterozygosity shown as
different colored blocks. (b) Even without repeats or sequencing error, the assembly graph of the homozygous and
heterozygous segments of the genome branch and intertwine in complex patterns. A plant-specific
assembler would need to recognize these branching patterns and attempt to reconstruct
the individual sequences for chromosomes A to D.

Awareness, training and education

Principal investigators need to become better informed to the current best practices
for genome assembly and develop a better understanding of the effort involved to sequence,
assemble, annotate and analyze a new genome. More classes and training are needed
for graduate and undergraduate students to learn the fundamentals of sequence analysis
and quantitative techniques. Better training is needed to teach non-experts to use
the software packages, and to educate everyone about the resources that are available.
The plant sequencing community would benefit by forming and hosting plant genome analysis
competitions in the spirit of the Assemblathon or GAGE to evaluate the state-of-the-art
for assembly, annotation and other assays. The best practices of today are certain
to change as new sequencing, mapping and computational technologies are introduced,
and this will be the only way to monitor these developments.

Final thoughts

We are still many years away from push-button sequencing and assembly of complex plant
genomes into completely finished genomes at low cost. Nevertheless, it is now possible
and affordable to sequence and assemble great numbers of interesting plant genomes
into highly useful draft genome assemblies if one is mindful of the biotechnology
and algorithmic challenges involved. The next frontier for plant genomics is to characterize
the diversity of genomic variations across large populations, deeply annotate their
functional elements, and develop predictive quantitative models relating genotype
to phenotype. Improved sequencing technology and sequencing assays are certain to
play a large role in these studies as well, and we envision a tight relationship between
biology, biotechnology and analytics for years to come.

Abbreviations

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

We thank all of the participants of the meeting on the future of plant genome sequencing
and analysis held at the Banbury Conference Center at Cold Spring Harbor in the summer
of 2010. This work was funded, in part, by NSF award IOS-1135736, the US Department
of Energy, Office of Biological and Environmental Research under Contract DE-AC02-06CH11357,
and NIH RO1 HG006677-12.