八篇基因组组装软件评估文章

http://seq.cn/forum.php?mod=viewthread&tid=12235&reltid=13436&pre_thread_id=0&pre_pos=2&ext=1. Bao, S., et al. (2011). "Evaluation of next-generation sequencing
software in mapping and assembly." J Hum Genet. Next-generation
high-throughput DNA sequencing technologies have advanced progressively
in sequence-based genomic research and novel biological applications
with the promise of sequencing DNA at unprecedented speed. These new
non-Sanger-based technologies feature several advantages when compared
with traditional sequencing methods in terms of higher sequencing speed,
lower per run cost and higher accuracy. However, reads from
next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD
and Illumina/Solexa, are usually short, thereby restricting the
applications of NGS platforms in genome assembly and annotation. We
presented an overview of the challenges that these novel technologies
meet and particularly illustrated various bioinformatics attempts on
mapping and assembly for problem solving. We then compared the
performance of several programs in these two fields, and further
provided advices on selecting suitable tools for specific biological
applications.Journal of Human Genetics advance online publication, 28
April 2011; doi:10.1038/jhg.2011.43.. i* {; O* }+ }5 y. y) l: o
2. Vezzi, F., et al. (2012). "Reevaluating assembly evaluations with
feature response curves: GAGE and assemblathons." PLoS One 7(12):
e52210. In just the last decade, a multitude of bio-technologies
and software pipelines have emerged to revolutionize genomics. To
further their central goal, they aim to accelerate and improve the
quality of de novo whole-genome assembly starting from short DNA
sequences/reads. However, the performance of each of these tools is
contingent on the length and quality of the sequencing data, the
structure and complexity of the genome sequence, and the resolution and
quality of long-range information. Furthermore, in the absence of any
metric that captures the most fundamental "features" of a high-quality
assembly, there is no obvious recipe for users to select the most
desirable assembler/assembly. This situation has prompted the scientific
community to rely on crowd-sourcing through international competitions,
such as Assemblathons or GAGE, with the intention of identifying the
best assembler(s) and their features. Somewhat circuitously, the only
available approach to gauge de novo assemblies and assemblers relies
solely on the availability of a high-quality fully assembled reference
genome sequence. Still worse, reference-guided evaluations are often
both difficult to analyze, leading to conclusions that are difficult to
interpret. In this paper, we circumvent many of these issues by relying
upon a tool, dubbed [Formula: see text], which is capable of evaluating
de novo assemblies from the read-layouts even when no reference exists.
We extend the FRCurve approach to cases where lay-out information may
have been obscured, as is true in many deBruijn-graph-based algorithms.
As a by-product, FRCurve now expands its applicability to a much wider
class of assemblers - thus, identifying higher-quality members of this
group, their inter-relations as well as sensitivity to carefully
selected features, with or without the support of a reference sequence
or layout for the reads. The paper concludes by reevaluating several
recently conducted assembly competitions and the datasets that have
resulted from them. M! l- @; z. Y5 N& i$ j

3. Salzberg, S. L., et al. (2012). "GAGE: A critical evaluation of
genome assemblies and assembly algorithms." Genome Res 22(3): 557-567.
New sequencing technology has dramatically altered the landscape
of whole-genome sequencing, allowing scientists to initiate numerous
projects to decode the genomes of previously unsequenced organisms. The
lowest-cost technology can generate deep coverage of most species,
including mammals, in just a few days. The sequence data generated by
one of these projects consist of millions or billions of short DNA
sequences (reads) that range from 50 to 150 nt in length. These
sequences must then be assembled de novo before most genome analyses can
begin. Unfortunately, genome assembly remains a very difficult problem,
made more difficult by shorter reads and unreliable long-range linking
information. In this study, we evaluated several of the leading de novo
assembly algorithms on four different short-read data sets, all
generated by Illumina sequencers. Our results describe the relative
performance of the different assemblers as well as other significant
differences in assembly difficulty that appear to be inherent in the
genomes themselves. Three overarching conclusions are apparent: first,
that data quality, rather than the assembler itself, has a dramatic
effect on the quality of an assembled genome; second, that the degree of
contiguity of an assembly varies enormously among different assemblers
and different genomes; and third, that the correctness of an assembly
also varies widely and is not well correlated with statistics on
contiguity. To enable others to replicate our results, all of our data
and methods are freely available, as are all assemblers used in this
study.7 d6 F1 u ?/ O6 s, {$ I& c& G( [7 G; n8 m7 x9 F9 |& _, P
4. Zhang, W., et al. (2011). "A practical comparison of de novo genome
assembly software tools for next-generation sequencing technologies."
PLoS One 6(3): e17915." D/ I* F2 R, v/ @
The advent of next-generation sequencing technologies is
accompanied with the development of many whole-genome sequence assembly
methods and software, especially for de novo fragment assembly. Due to
the poor knowledge about the applicability and performance of these
software tools, choosing a befitting assembler becomes a tough task.
Here, we provide the information of adaptivity for each program, then
above all, compare the performance of eight distinct tools against eight
groups of simulated datasets from Solexa sequencing platform.
Considering the computational time, maximum random access memory (RAM)
occupancy, assembly accuracy and integrity, our study indicate that
string-based assemblers, overlap-layout-consensus (OLC) assemblers are
well-suited for very short reads and longer reads of small genomes
respectively. For large datasets of more than hundred millions of short
reads, De Bruijn graph-based assemblers would be more appropriate. In
terms of software implementation, string-based assemblers are superior
to graph-based ones, of which SOAPdenovo is complex for the creation of
configuration file. Our comparison study will assist researchers in
selecting a well-suited assembler and offer essential information for
the improvement of existing assemblers or the developing of novel
assemblers., M3 Z8 v5 {) I1 @) j1 Z+ L. |* `0 V6 d' Z/ Q+ a: i3 M
5. Narzisi, G. and B. Mishra (2011). "Comparing de novo genome assembly: the long and short of it." PLoS One 6(4): e19175.
Recent advances in DNA sequencing technology and their focal
role in Genome Wide Association Studies (GWAS) have rekindled a growing
interest in the whole-genome sequence assembly (WGSA) problem, thereby,
inundating the field with a plethora of new formalizations, algorithms,
heuristics and implementations. And yet, scant attention has been paid
to comparative assessments of these assemblers' quality and accuracy. No
commonly accepted and standardized method for comparison exists yet.
Even worse, widely used metrics to compare the assembled sequences
emphasize only size, poorly capturing the contig quality and accuracy.
This paper addresses these concerns: it highlights common anomalies in
assembly accuracy through a rigorous study of several assemblers,
compared under both standard metrics (N50, coverage, contig sizes, etc.)
as well as a more comprehensive metric (Feature-Response Curves, FRC)
that is introduced here; FRC transparently captures the trade-offs
between contigs' quality against their sizes. For this purpose, most of
the publicly available major sequence assemblers - both for low-coverage
long (Sanger) and high-coverage short (Illumina) reads technologies -
are compared. These assemblers are applied to microbial (Escherichia
coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial
human genome sequences (Chr. Y), using sequence reads of various
read-lengths, coverages, accuracies, and with and without mate-pairs. It
is hoped that, based on these evaluations, computational biologists
will identify innovative sequence assembly paradigms, bioinformaticists
will determine promising approaches for developing "next-generation"
assemblers, and biotechnologists will formulate more meaningful design
desiderata for sequencing technology platforms. A new software tool for
computing the FRC metric has been developed and is available through the
AMOS open-source consortium.! f. v7 w% U. A5 b3 N0 m' d: I

6. Lin, Y., et al. (2011). "Comparative Studies of de novo Assembly
Tools for Next-generation Sequencing Technologies." Bioinformatics.% j9 C+ k. i; k1 _& c
MOTIVATION: Several new de novo assembly tools have been
developed recently to assemble short sequencing reads generated by
next-generation sequencing platforms. However, the performance of these
tools under various conditions has not been fully investigated, and
sufficient information is not currently available for informed decisions
to be made regarding the tool that would be most likely to produce the
best performance under a specific set of conditions. RESULTS: We studied
and compared the performance of commonly used de novo assembly tools
specifically designed for next-generation sequencing data, including
SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo. Tools were
compared using several performance criteria, including N50 length,
sequence cover-age, and assembly accuracy. Various properties of read
data, including single-end/paired-end, sequence GC content, depth of
coverage and base calling error rates, were investigated for their
effects on the performance of different assembly tools. We also compared
the computation time and memory usage of these seven tools. Based on
the results of our comparison, the relative perform-ance of individual
tools are summarized and tentative guidelines for optimal selection of
different assembly tools, under different condi-tions, are provided.
CONTACT: hdeng2@tulane.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available.+ ]/ [$ R# z C* C& O; P

7. Finotello, F., et al. (2011). "Comparative analysis of algorithms for
whole-genome assembly of pyrosequencing data." Brief Bioinform., G$ Y% x* a6 J. Y; T- K+ P- u" _& e+ k
Next-generation sequencing technologies have fostered an
unprecedented proliferation of high-throughput sequencing projects and a
concomitant development of novel algorithms for the assembly of short
reads. In this context, an important issue is the need of a careful
assessment of the accuracy of the assembly process. Here, we review the
efficiency of a panel of assemblers, specifically designed to handle
data from GS FLX 454 platform, on three bacterial data sets with
different characteristics in terms of reads coverage and repeats
content. Our aim is to investigate their strengths and weaknesses in the
reconstruction of the reference genomes. In our benchmarking, we assess
assemblers' performance, quantifying and characterizing assembly gaps
and errors, and evaluating their ability to solve complex genomic
regions containing repeats. The final goal of this analysis is to
highlight pros and cons of each method, in order to provide the final
user with general criteria for the right choice of the appropriate
assembly strategy, depending on the specific needs. A further aspect we
have explored is the relationship between coverage of a sequencing
project and quality of the obtained results. The final outcome suggests
that, for a good tradeoff between costs and results, the planned genome
coverage of an experiment should not exceed 20-30 x.6 c( _; E& c, C P4 d' t
8. Earl, D. A., et al. (2011). "Assemblathon 1: A competitive assessment of de novo short read assembly methods." Genome Res.
Low cost short read sequencing technology has revolutionised
genomics, though it is only just becoming practical for the high quality
de novo assembly of a novel large genome. We describe the Assemblathon 1
competition, which aimed to comprehensively assess the state of the art
in de novo assembly methods when applied to current sequencing
technologies. In a collaborative effort teams were asked to assemble a
simulated Illumina HiSeq dataset of an unknown, simulated diploid
genome. A total of 41 assemblies from 17 different groups were received.
Novel haplotype aware assessments of coverage, contiguity, structure,
base calling and copy number were made. We establish that within this
benchmark (1) it is possible to assemble the genome to a high level of
coverage and accuracy, and that (2) large differences exist between the
assemblies, suggesting room for further improvements in current methods.