Associate Professor Aaron Darling

Biography

Aaron Darling is an Associate Professor in Computational Genomics and Bioinformatics in the UTS Faculty of Science's ithree institute. He has over a decade of experience developing computational methods for comparative genomics and evolutionary modeling and in 2013 moved from the University of California-Davis to start a computational genomics group at UTS.

Darling embarked on his research career at the University of Wisconsin-Madison. Following a bachelor's degree in Computer Science, he worked with members of the UW-Madison Genome Center to sequence and analyze the first genomes of pathogenic E. coli. During this time Darling led the development of some widely used computational methods for analysing genomic data, including the mpiBLAST open source parallel BLAST software and the Mauve software for comparing multiple genome sequences.

Following the award of a Ph.D. at UW-Madison, Darling received a fellowship from the US National Science Foundation to pursue postdoctoral studies at The University of Queensland. After two years at UQ he then returned to UC Davis to develop a research program in computational metagenomics -- the study of uncultivated microorganisms from the environment using computational methods.

Darling now brings his experience to understand the relationship between humans and microorganisms in collaboration with microbiologists at the ithree institute.

Research Interests

Comparative genomics

Designing and developing scalable computational algorithms to identify the complete set of genetic differences between two or more organisms and relating these differences to aspects of the organism's biology. Associating genomic changes to phenotypic changes.

Computational metagenomics

The vast majority of life on the planet is microbial, and most of it can not be studied by laboratory cultivation. Metagenomics involves DNA sequencing of microbes taken directly from the environment. Current metagenomic methods require advanced computational, statistical, and machine learning techniques to identify the organisms present in a sample and characterize their potential for encoding functional proteins.

Genome evolution

Life is thought to have existed on earth for at least four billion years. During this time, evolution has shaped the genomes of modern organisms. Using statistical methods such as continuous time Markov chain models we can infer the history of genome evolution that led to modern organisms. I am interested in applying methods from statistical mechanics and financial market modeling to develop scalable computational methods to reconstruct evolutionary histories.

Next-generation DNA sequencing

DNA is fundamentally a molecule that encodes digital information. New sequencing technology enables us to read this biological information en masse so that it can be analyzed computationally. I am interested in designing sequencing experiments and protocols in ways that maximize the useful information obtained about a biological system.

Can supervise: Yes

I am actively seeking students with a computational, mathematical, or statistical background to undertake Ph.D. studies and research.

Background: During evolution, large-scale genome rearrangements of chromosomes shuffle the order of homologous genome sequences ('synteny blocks') across species. Some years ago, a controversy erupted in genome rearrangement studies over whether rearrang

We describe an efficient local multiple alignment filtration heuristic for identification of conserved regions in one or more DNA sequences. The method incorporates several novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simul

We describe a new method for reliably identifying conserved segments among genome sequences that have undergone rearrangement, horizontal transfer, and substantial nucleotide-level divergence. A Gibbs-like sampler explores different combinations of seque

The nature of large-scale evolutionary processes that shape genomes over time fundamentally differs from the forces governing local evolution within individual genes. Large-scale events such as horizontal transfer, genome re-arrangements, gene duplicatio

Chronic rhinosinusitis (CRS) is a common and potentially debilitating disease characterized by inflammation of the sinus mucosa for longer than 12 weeks. Bacterial colonization of the sinuses and its role in the pathogenesis of this disease is an ongoing area of research. Recent advances in culture-independent molecular techniques for bacterial identification have the potential to provide a more accurate and complete assessment of the sinus microbiome, however there is little concordance in results between studies, possibly due to differences in the sampling location and techniques. This study aimed to determine whether the microbial communities from one sinus could be considered representative of all sinuses, and examine differences between two commonly used methods for sample collection, swabs, and tissue biopsies. High-throughput DNA sequencing of the bacterial 16S rRNA gene was applied to both swab and tissue samples from multiple sinuses of 19 patients undergoing surgery for treatment of CRS. Results from swabs and tissue biopsies showed a high degree of similarity, indicating that swabbing is sufficient to recover the microbial community from the sinuses. Microbial communities from different sinuses within individual patients differed to varying degrees, demonstrating that it is possible for distinct microbiomes to exist simultaneously in different sinuses of the same patient. The sequencing results correlated well with culture-based pathogen identification conducted in parallel, although the culturing missed many species detected by sequencing. This finding has implications for future research into the sinus microbiome, which should take this heterogeneity into account by sampling patients from more than one sinus.

The bacterial 16S rRNA gene has historically been used in defining bacterial taxonomy and phylogeny. However, there are currently no high-throughput methods to sequence full-length 16S rRNA genes present in a sample with precision.We describe a method for sequencing near full-length 16S rRNA gene amplicons using the high throughput Illumina MiSeq platform and test it using DNA from human skin swab samples. Proof of principle of the approach is demonstrated, with the generation of 1,604 sequences greater than 1,300 nt from a single Nano MiSeq run, with accuracy estimated to be 100-fold higher than standard Illumina reads. The reads were chimera filtered using information from a single molecule dual tagging scheme that boosts the signal available for chimera detection.This method could be scaled up to generate many thousands of sequences per MiSeq run and could be applied to other sequencing platforms. This has great potential for populating databases with high quality, near full-length 16S rRNA gene sequences from under-represented taxa and environments and facilitates analyses of microbial communities at higher resolution.

The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium is a novel, interdisciplinary initiative comprised of experts across many fields, including genomics, data analysis, engineering, public health, and architecture. The ultimate goal of the MetaSUB Consortium is to improve city utilization and planning through the detection, measurement, and design of metagenomics within urban environments. Although continual measures occur for temperature, air pressure, weather, and human activity, including longitudinal, cross-kingdom ecosystem dynamics can alter and improve the design of cities. The MetaSUB Consortium is aiding these efforts by developing and testing metagenomic methods and standards, including optimized methods for sample collection, DNA/RNA isolation, taxa characterization, and data visualization. The data produced by the consortium can aid city planners, public health officials, and architectural designers. In addition, the study will continue to lead to the discovery of new species, global maps of antimicrobial resistance (AMR) markers, and novel biosynthetic gene clusters (BGCs). Finally, we note that engineered metagenomic ecosystems can help enable more responsive, safer, and quantified cities.

BACKGROUND: Chromosome conformation capture, coupled with high throughput DNA sequencing in protocols like Hi-C and 3C-seq, has been proposed as a viable means of generating data to resolve the genomes of microorganisms living in naturally occuring environments. Metagenomic Hi-C and 3C-seq datasets have begun to emerge, but the feasibility of resolving genomes when closely related organisms (strain-level diversity) are present in the sample has not yet been systematically characterised. METHODS: We developed a computational simulation pipeline for metagenomic 3C and Hi-C sequencing to evaluate the accuracy of genomic reconstructions at, above, and below an operationally defined species boundary. We simulated datasets and measured accuracy over a wide range of parameters. Five clustering algorithms were evaluated (2 hard, 3 soft) using an adaptation of the extended B-cubed validation measure. RESULTS: When all genomes in a sample are below 95% sequence identity, all of the tested clustering algorithms performed well. When sequence data contains genomes above 95% identity (our operational definition of strain-level diversity), a naive soft-clustering extension of the Louvain method achieves the highest performance. DISCUSSION: Previously, only hard-clustering algorithms have been applied to metagenomic 3C and Hi-C data, yet none of these perform well when strain-level diversity exists in a metagenomic sample. Our simple extension of the Louvain method performed the best in these scenarios, however, accuracy remained well below the levels observed for samples without strain-level diversity. Strain resolution is also highly dependent on the amount of available 3C sequence data, suggesting that depth of sequencing must be carefully considered during experimental design. Finally, there appears to be great scope to improve the accuracy of strain resolution through further algorithm development.

MOTIVATION: Open-source bacterial genome assembly remains inaccessible to many biologists because of its complexity. Few software solutions exist that are capable of automating all steps in the process of de novo genome assembly from Illumina data. RESULTS: A5-miseq can produce high-quality microbial genome assemblies on a laptop computer without any parameter tuning. A5-miseq does this by automating the process of adapter trimming, quality filtering, error correction, contig and scaffold generation and detection of misassemblies. Unlike the original A5 pipeline, A5-miseq can use long reads from the Illumina MiSeq, use read pairing information during contig generation and includes several improvements to read trimming. Together, these changes result in substantially improved assemblies that recover a more complete set of reference genes than previous methods. AVAILABILITY: A5-miseq is licensed under the GPL open-source license. Source code and precompiled binaries for Mac OS X 10.6+ and Linux 2.6.15+ are available from http://sourceforge.net/projects/ngopt CONTACT: aaron.darling@uts.edu.au SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

BACKGROUND: Enterotoxigenic Escherichia coli (ETEC) are a major economic threat to pig production globally, with serogroups O8, O9, O45, O101, O138, O139, O141, O149 and O157 implicated as the leading diarrhoeal pathogens affecting pigs below four weeks of age. A multiple antimicrobial resistant ETEC O157 (O157 SvETEC) representative of O157 isolates from a pig farm in New South Wales, Australia that experienced repeated bouts of pre- and post-weaning diarrhoea resulting in multiple fatalities was characterized here. Enterohaemorrhagic E. coli (EHEC) O157:H7 cause both sporadic and widespread outbreaks of foodborne disease, predominantly have a ruminant origin and belong to the ST11 clonal complex. Here, for the first time, we conducted comparative genomic analyses of two epidemiologically-unrelated porcine, disease-causing ETEC O157; E. coli O157 SvETEC and E. coli O157:K88 734/3, and examined their phylogenetic relationship with EHEC O157:H7. RESULTS: O157 SvETEC and O157:K88 734/3 belong to a novel sequence type (ST4245) that comprises part of the ST23 complex and are genetically distinct from EHEC O157. Comparative phylogenetic analysis using PhyloSift shows that E. coli O157 SvETEC and E. coli O157:K88 734/3 group into a single clade and are most similar to the extraintestinal avian pathogenic Escherichia coli (APEC) isolate O78 that clusters within the ST23 complex. Genome content was highly similar between E. coli O157 SvETEC, O157:K88 734/3 and APEC O78, with variability predominantly limited to laterally acquired elements, including prophages, plasmids and antimicrobial resistance gene loci. Putative ETEC virulence factors, including the toxins STb and LT and the K88 (F4) adhesin, were conserved between O157 SvETEC and O157:K88 734/3. The O157 SvETEC isolate also encoded the heat stable enterotoxin STa and a second allele of STb, whilst a prophage within O157:K88 734/3 encoded the serum survival gene bor. Both isolates harbor a large repertoire of antibi...

The sequencing, assembly, and basic analysis of microbial genomes, once a painstaking and expensive undertaking, has become much easier for research labs with access to standard molecular biology and computational tools. However, there are a confusing variety of options available for DNA library preparation and sequencing, and inexperience with bioinformatics can pose a significant barrier to entry for many who may be interested in microbial genomics. The objective of the present study was to design, test, troubleshoot, and publish a simple, comprehensive workflow from the collection of an environmental sample (a swab) to a published microbial genome; empowering even a lab or classroom with limited resources and bioinformatics experience to perform it.

We present the draft genome sequences for 26 strains of Porphyromonas (P.&nbsp;canoris, P.&nbsp;gulae, P.&nbsp;cangingavalis, P.&nbsp;macacae, and 7 unidentified) and an unidentified member of the Porphyromonadaceae family. All of these strains were isolated from the canine oral cavity, from dogs with and without early periodontal disease.

Porphyromonads play an important role in human periodontal disease and recently have been shown to be highly prevalent in canine mouths. Porphyromonas cangingivalis is the most prevalent canine oral bacterial species in both plaque from healthy gingiva and plaque from dogs with early periodontitis. The ability of P. cangingivalis to flourish in the different environmental conditions characterized by these two states suggests a degree of metabolic flexibility. To characterize the genes responsible for this, the genomes of 32 isolates (including 18 newly sequenced and assembled) from 18 Porphyromonad species from dogs, humans, and other mammals were compared. Phylogenetic trees inferred using core genes largely matched previous findings; however, comparative genomic analysis identified several genes and pathways relating to heme synthesis that were present in P. cangingivalis but not in other Porphyromonads. Porphyromonas cangingivalis has a complete protoporphyrin IX synthesis pathway potentially allowing it to synthesize its own heme unlike pathogenic Porphyromonads such as Porphyromonas gingivalis that acquire heme predominantly from blood. Other pathway differences such as the ability to synthesize siroheme and vitamin B12 point to enhanced metabolic flexibility for P. cangingivalis, which may underlie its prevalence in the canine oral cavity.

We review currently available technologies for deconvoluting metagenomic data into individual genomes that represent populations, strains, or genotypes present in the community. An evaluation of chromosome conformation capture (3C) and related techniques in the context of metagenomics is presented, using mock microbial communities as a reference. We provide the first independent reproduction of the metagenomic 3C technique described last year, propose some simple improvements to that protocol, and compare the quality of the data with that provided by the more complex Hi-C protocol.

Background Clostridium difficile is the leading cause of infectious diarrhea in humans and responsible for large outbreaks of enteritis in neonatal pigs in both North America and Europe. Disease caused by C. difficile typically occurs during antibiotic therapy and its emergence over the past 40 years is linked with the widespread use of broad-spectrum antibiotics in both human and veterinary medicine. Results We sequenced the genome of Clostridium difficile 5.3 using the Illumina Nextera XT and MiSeq technologies. Assembly of the sequence data reconstructed a 4,009,318 bp genome in 27 scaffolds with an N50 of 786 kbp. The genome has extensive similarity to other sequenced C. difficile genomes, but also has several genes that are potentially related to virulence and pathogenicity that are not present in the reference C. difficile strain. Conclusion Genome sequencing of human and animal isolates is needed to understand the molecular events driving the emergence of C. difficile as a gastrointestinal pathogen of humans and food animals and to better define its zoonotic potential.

Background Spiders have evolved pharmacologically complex venoms that serve to rapidly subdue prey and deter predators. The major toxic factors in most spider venoms are small, disulfide-rich peptides. While there is abundant evidence that snake venoms evolved by recruitment of genes encoding normal body proteins followed by extensive gene duplication accompanied by explosive structural and functional diversification, the evolutionary trajectory of spider-venom peptides is less clear. Results Here we present evidence of a spider-toxin superfamily encoding a high degree of sequence and functional diversity that has evolved via accelerated duplication and diversification of a single ancestral gene. The peptides within this toxin superfamily are translated as prepropeptides that are posttranslationally processed to yield the mature toxin. The N-terminal signal sequence, as well as the protease recognition site at the junction of the propeptide and mature toxin are conserved, whereas the remainder of the propeptide and mature toxin sequences are variable. All toxin transcripts within this superfamily exhibit a striking cysteine codon bias. We show that different pharmacological classes of toxins within this peptide superfamily evolved under different evolutionary selection pressures. Conclusions Overall, this study reinforces the hypothesis that spiders use a combinatorial peptide library strategy to evolve a complex cocktail of peptide toxins that target neuronal receptors and ion channels in prey and predators. We show that the ?-hexatoxins that target insect voltage-gated calcium channels evolved under the influence of positive Darwinian selection in an episodic fashion, whereas the ?-hexatoxins that target insect calcium-activated potassium channels appear to be under negative selection. A majority of the diversifying sites in the ?-hexatoxins are concentrated on the molecular surface of the toxins, thereby facilitating neofunctionalisation leading to new toxin...

Metagenomics is a valuable tool for the study of microbial communities but has been limited by the difficulty of binning the resulting sequences into groups corresponding to the individual species and strains that constitute the community. Moreover, there are presently no methods to track the flow of mobile DNA elements such as plasmids through communities or to determine which of these are co-localized within the same cell. We address these limitations by applying Hi-C, a technology originally designed for the study of three-dimensional genome structure in eukaryotes, to measure the cellular co-localization of DNA sequences. We leveraged Hi-C data generated from a simple synthetic metagenome sample to accurately cluster metagenome assembly contigs into groups that contain nearly complete genomes of each species. The Hi-C data also reliably associated plasmids with the chromosomes of their host and with each other. We further demonstrated that Hi-C data provides a long-range signal of strain-specific genotypes, indicating such data may be useful for high-resolution genotyping of microbial populations. Our work demonstrates that Hi-C sequencing data provide valuable information for metagenome analyses that are not currently obtainable by other methods. This metagenomic Hi-C method could facilitate future studies of the fine-scale population structure of microbes, as well as studies of how antibiotic resistance plasmids (or other genetic elements) mobilize in microbial communities. The method is not limited to microbiology; the genetic architecture of other heterogeneous populations of cells could also be studied with this technique.

Like all organisms on the planet, environmental microbes are subject to the forces of molecular evolution. Metagenomic sequencing provides a means to access the DNA sequence of uncultured microbes. By combining DNA sequencing of microbial communities with evolutionary modeling and phylogenetic analysis we might obtain new insights into microbiology and also provide a basis for practical tools such as forensic pathogen detection. In this work we present an approach to leverage phylogenetic analysis of metagenomic sequence data to conduct several types of analysis. First, we present a method to conduct phylogeny-driven Bayesian hypothesis tests for the presence of an organism in a sample. Second, we present a means to compare community structure across a collection of many samples and develop direct associations between the abundance of certain organisms and sample metadata. Third, we apply new tools to analyze the phylogenetic diversity of microbial communities and again demonstrate how this can be associated to sample metadata. These analyses are implemented in an open source software pipeline called PhyloSift. As a pipeline, PhyloSift incorporates several other programs including LAST, HMMER, and pplacer to automate phylogenetic analysis of protein coding and RNA sequences in metagenomic datasets generated by modern sequencing platforms (e.g., Illumina, 454).

Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.

Organisms across the tree of life use a variety of mechanisms to respond to stress-inducing fluctuations in osmotic conditions. Cellular response mechanisms and phenotypes associated with osmoadaptation also play important roles in bacterial virulence, human health, agricultural production and many other biological systems. To improve understanding of osmoadaptive strategies, we have generated 59 high-quality draft genomes for the haloarchaea (a euryarchaeal clade whose members thrive in hypersaline environments and routinely experience drastic changes in environmental salinity) and analyzed these new genomes in combination with those from 21 previously sequenced haloarchaeal isolates. We propose a generalized model for haloarchaeal management of cytoplasmic osmolarity in response to osmotic shifts, where potassium accumulation and sodium expulsion during osmotic upshock are accomplished via secondary transport using the proton gradient as an energy source, and potassium loss during downshock is via a combination of secondary transport and non-specific ion loss through mechanosensitive channels. We also propose new mechanisms for magnesium and chloride accumulation. We describe the expansion and differentiation of haloarchaeal general transcription factor families, including two novel expansions of the TATA-binding protein family, and discuss their potential for enabling rapid adaptation to environmental fluxes. We challenge a recent high-profile proposal regarding the evolutionary origins of the haloarchaea by showing that inclusion of additional genomes significantly reduces support for a proposed large-scale horizontal gene transfer into the ancestral haloarchaeon from the bacterial domain. The combination of broad (17 genera) and deep (5 species in four genera) sampling of a phenotypically unified clade has enabled us to uncover both highly conserved and specialized features of osmoadaptation. Finally, we demonstrate the broad utility of such datasets, for m...

Of the 200&thorn; serogroups of Vibrio cholerae, only O1 or O139 strains are reported to cause cholera, and mostly in endemic regions. Cholera outbreaks elsewhere are considered to be via importation of pathogenic strains. Using established animal models, we show that diverse V. cholerae strains indigenous to a nonendemic environment (Sydney, Australia), including non-O1/O139 serogroup strains, are able to both colonize the intestine and result in fluid accumulation despite lacking virulence factors believed to be important. Most strains lacked the type three secretion system considered a mediator of diarrhoea in nonO1/O13 V. cholerae. Multi-locus sequence typing (MLST) showed that the Sydney isolates did not form a single clade and were distinct from O1/O139 toxigenic strains. There was no correlation between genetic relatedness and the profile of virulence-associated factors. Current analyses of diseases mediated by V. cholerae focus on endemic regions, with only those strains that possess particular virulence factors considered pathogenic. Our data suggest that factors other than those previously well described are of potential importance in influencing disease outbreaks.

Over 3000 microbial (bacterial and archaeal) genomes have been made publically available to date, providing an unprecedented opportunity to examine evolutionary genomic trends and offering valuable reference data for a variety of other studies such as me

Background: A classical example of repeated speciation coupled with ecological diversification is the evolution of 14 closely related species of Darwin's (Galapagos) finches (Thraupidae, Passeriformes). Their adaptive radiation in the Galapagos archipela

Genome sequencing enhances our understanding of the biological world by providing blueprints for the evolutionary and functional diversity that shapes the biosphere. However, microbial genomes that are currently available are of limited phylogenetic breadth, owing to our historical inability to cultivate most microorganisms in the laboratory. We apply single-cell genomics to target and sequence 201?uncultivated archaeal and bacterial cells from nine diverse habitats belonging to 29?major mostly uncharted branches of the tree of life, so-called `microbial dark matter. With this additional genomic information, we are able to resolve many intra- and inter-phylum-level relationships and to propose two new superphyla. We uncover unexpected metabolic features that extend our understanding of biology and challenge established boundaries between the three domains of life. These include a novel amino acid use for the opal stop codon, an archaeal-type purine synthesis in Bacteria and complete sigma factors in Archaea similar to those in Bacteria. The single-cell genomes also served to phylogenetically anchor up to 20% of metagenomic reads in some habitats, facilitating organism-level interpretation of ecosystem function. This study greatly expands the genomic representation of the tree of life and provides a systematic step towards a better understanding of biological evolution on our planet.

Here we present the draft genome of Leucobacter sp. strain UCD-THU. The genome contains 3,317,267&nbsp;bp in 11 scaffolds. This strain was isolated from a residential toilet as part of an undergraduate project to sequence reference genomes of microbes from the built environment.

Here we present the draft genome of an actinobacterium, Curtobacterium flaccumfaciens strain UCD-AKU, isolated from a residential carpet. The genome assembly contains 3,692,614&nbsp;bp in 130 contigs. This is the first member of the Curtobacterium genus to be sequenced.

Here, we present the draft genome sequence of an actinobacterium, Dietzia sp. strain UCD-THP, isolated from a residential toilet handle. The assembly contains 3,915,613 bp. The genome sequences of only two other Dietzia species have been published, those of Dietzia&nbsp;alimentaria and Dietzia&nbsp;cinnamea.

Here, we present the draft genome of Kocuria sp. strain UCD-OTCP, a member of the phylum Actinobacteria, isolated from a restaurant chair cushion. The assembly contains 3,791,485&nbsp;bp (G+C content of 73%) and is contained in 68 scaffolds.

Here, we present the draft genome sequence of Microbacterium sp. strain UCD-TDU, a member of the phylum Actinobacteria. The assembly contains 3,746,321&nbsp;bp (in 8 scaffolds). This strain was isolated from a residential toilet as part of an undergraduate student research project to sequence reference genomes of microbes from the built environment.

Here we present the draft genome of an actinobacterium, Brachybacterium muris UCD-AY4. The assembly contains 3,257,338&nbsp;bp and has a GC content of 70%. This strain was isolated from a residential bath towel and has a 16S rRNA gene 99.7% identical to that of the original B.&nbsp;muris strain, C3H-21.

Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) methods. With this note, we announce the release of version 3.2, a major upgrade to the latest

Despite a growing appreciation of their vast diversity in nature, mechanisms of speciation are poorly understood in Bacteria and Archaea. Here we use high-throughput genome sequencing to identify ongoing speciation in the thermoacidophilic Archaeon Sulfo

We report the sequencing of seven genomes from two haloarchaeal genera, Haloferax and Haloarcula. Ease of cultivation and the existence of well-developed genetic and biochemical tools for several diverse haloarchaeal species make haloarchaea a model grou

Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtai

Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood es

Background: Escherichia coli is an important species of bacteria that can live as a harmless inhabitant of the guts of many animals, as a pathogen causing life-threatening conditions or freely in the non-host environment. This diversity of lifestyles has

Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively ass

High-throughput DNA sequencing technologies have spurred the development of numerous novel methods for genome assembly. With few exceptions, these algorithms are heuristic and require one or more parameters to be manually set by the user. One approach to

Background: Microbial life dominates the earth, but many species are difficult or even impossible to study under laboratory conditions. Sequencing DNA directly from the environment, a technique commonly referred to as metagenomics, is an important tool f

Sponges are an ancient group of animals that diverged from other metazoans over 600 million years ago. Here we present the draft genome sequence of Amphimedon queenslandica, a demosponge from the Great Barrier Reef, and show that it is remarkably similar to other animal genomes in content, structure and organization. Comparative analysis enabled by the sequencing of the sponge genome reveals genomic events linked to the origin and early evolution of animals, including the appearance, expansion and diversification of pan-metazoan transcription factor, signalling pathway and structural genes. This diverse 'toolkit' of genes correlates with critical aspects of all metazoan body plans, and comprises cell cycle control and growth, development, somatic- and germ-cell specification, cell adhesion, innate immunity and allorecognition. Notably, many of the genes associated with the emergence of animals are also implicated in cancer, which arises from defects in basic processes associated with metazoan multicellularity.

Bacteria and archaea reproduce clonally, but sporadically import DNA into their chromosomes from other organisms. In many of these events, the imported DNA replaces an homologous segment in the recipient genome. Here we present a new method to reconstruct the history of recombination events that affected a given sample of bacterial genomes. We introduce a mathematical model that represents both the donor and the recipient of each DNA import as an ancestor of the genomes in the sample. The model represents a simplification of the previously described coalescent with gene conversion. We implement a Monte Carlo Markov chain algorithm to perform inference under this model from sequence data alignments and show that inference is feasible for whole-genome alignments through parallelization. Using simulated data, we demonstrate accurate and reliable identification of individual recombination events and global recombination rate parameters. We applied our approach to an alignment of 13 whole genomes from the Bacillus cereus group. We find, as expected from laboratory experiments, that the recombination rate is higher between closely related organisms and also that the genome contains several broad regions of elevated levels of recombination. Application of the method to the genomic data sets that are becoming available should reveal the evolutionary history and private lives of populations of bacteria and archaea. The methods described in this article have been implemented in a computer software package, ClonalOrigin, which is freely available from http://code.google.com/p/clonalorigin/.

Mauve Contig Mover provides a new method for proposing the relative order of contigs that make up a draft genome based on comparison to a complete or draft reference genome. A novel application of the Mauve aligner and viewer provides an automated reorde

A select set of microalgae are reported to be able to catalyse photobiological H(2) production from water. Based on the model organism Chlamydomonas reinhardtii, a method was developed for the screening of naturally occurring H(2)-producing microalgae. B

Genome evolution underpins all of biology, yet its principles can be difficult to communicate to the non-specialist. To facilitate broader understanding of genome evolution, we have designed an interactive 3D environment that enables visualization of div

Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered fro

Background: In prokaryotes and some eukaryotes, genetic material can be transferred laterally among unrelated lineages and recombined into new host genomes, providing metabolic and physiological novelty. Although the process is usually framed in terms of

Inversions are among the most common mutations acting on the order and orientation of genes in a genome, and polynomial-time algorithms exist to obtain a minimal length series of inversions that transform one genome arrangement to another. However, the m

Acquisition and loss of genetic material are essential forces in bacterial microevolution. They have been repeatedly linked with adaptation of lineages to new lifestyles, and in particular, pathogenicity. Comparative genomics has the potential to elucida

Lateral genetic transfer (LGT) involves the movement of genetic material from one lineage into another and its subsequent incorporation into the new host genome via genetic recombination. Studies in individual taxa have indicated lateral origins for stre

BACKGROUND: In prokaryotes and some eukaryotes, genetic material can be transferred laterally among unrelated lineages and recombined into new host genomes, providing metabolic and physiological novelty. Although the process is usually framed in terms of gene sharing (e.g. lateral gene transfer, LGT), there is little reason to imagine that the units of transfer and recombination correspond to entire, intact genes. Proteins often consist of one or more spatially compact structural regions (domains) which may fold autonomously and which, singly or in combination, confer the protein's specific functions. As LGT is frequent in strongly selective environments and natural selection is based on function, we hypothesized that domains might also serve as modules of genetic transfer, i.e. that regions of DNA that are transferred and recombined between lineages might encode intact structural domains of proteins. METHODOLOGY/PRINCIPAL FINDINGS: We selected 1,462 orthologous gene sets representing 144 prokaryotic genomes, and applied a rigorous two-stage approach to identify recombination breakpoints within these sequences. Recombination breakpoints are very significantly over-represented in gene sets within which protein domain-encoding regions have been annotated. Within these gene sets, breakpoints significantly avoid the domain-encoding regions (domons), except where these regions constitute most of the sequence length. Recombination breakpoints that fall within longer domons are distributed uniformly at random, but those that fall within shorter domons may show a slight tendency to avoid the domon midpoint. As we find no evidence for differential selection against nucleotide substitutions following the recombination event, any bias against disruption of domains must be a consequence of the recombination event per se. CONCLUSIONS/SIGNIFICANCE: This is the first systematic study relating the units of LGT to structural features at the protein level. Many genes have been in...

One of the most satisfying aspects of a genome sequencing project is the identification of the genes contained within it.These are of two types: those which encode tRNAs and those which produce proteins. After a general introduction on the properties of protein-encoding genes and the utility of the Basic Local Alignment Search Tool (BLASTX) to identify genes through homologs, a variety of tools are discussed by their creators. These include for genome annotation: GeneMark, Artemis, and BASys; and, for genome comparisons: Artemis Comparison Tool (ACT), Mauve, CoreGenes, and GeneOrder.

Genome structure variation has profound impacts on phenotype in organisms ranging from microbes to humans, yet little is known about how natural selection acts on genome arrangement. Pathogenic bacteria such as Yersinia pestis, which causes bubonic and p

The Double Cut and Join is an operation acting locally at four chromosomal positions without regard to chromosomal context. This chapter discusses its application and the resulting menu of operations for genomes consisting of arbitrary numbers of circular chromosomes, as well as for a general mix of linear and circular chromosomes. In the general case the menu includes: inversion, translocation, transposition, formation and absorption of circular intermediates, conversion between linear and circular chromosomes, block interchange, fission, and fusion. This chapter discusses the well-known edge graph and its dual, the adjacency graph, recently introduced by Bergeron et al. Step-by-step procedures are given for constructing and manipulating these graphs. Simple algorithms are given in the adjacency graph for computing the minimal DCJ distance between two genomes and finding a minimal sorting; and use of an online tool (Mauve) to generate synteny blocks and apply DCJ is described.

During the course of evolution, genomes can undergo large-scale mutation events such as rearrangement and lateral transfer. Such mutations can result in significant variations in gene order and gene content among otherwise closely related organisms. The Mauve genome alignment system can successfully identify such rearrangement and lateral transfer events in comparisons of multiple microbial genomes even under high levels of recombination. This chapter outlines the main features of Mauve and provides examples that describe how to use Mauve to conduct a rigorous multiple genome comparison and study evolutionary patterns.

BACKGROUND: Comparisons of complete bacterial genomes reveal evidence of lateral transfer of DNA across otherwise clonally diverging lineages. Some lateral transfer events result in acquisition of novel genomic segments and are easily detected through genome comparison. Other more subtle lateral transfers involve homologous recombination events that result in substitution of alleles within conserved genomic regions. This type of event is observed infrequently among distantly related organisms. It is reported to be more common within species, but the frequency has been difficult to quantify since the sequences under comparison tend to have relatively few polymorphic sites. RESULTS: Here we report a genome-wide assessment of homologous recombination among a collection of six complete Escherichia coli and Shigella flexneri genome sequences. We construct a whole-genome multiple alignment and identify clusters of polymorphic sites that exhibit atypical patterns of nucleotide substitution using a random walk-based method. The analysis reveals one large segment (approximately 100 kb) and 186 smaller clusters of single base pair differences that suggest lateral exchange between lineages. These clusters include portions of 10% of the 3,100 genes conserved in six genomes. Statistical analysis of the functional roles of these genes reveals that several classes of genes are over-represented, including those involved in recombination, transport and motility. CONCLUSION: We demonstrate that intraspecific recombination in E. coli is much more common than previously appreciated and may show a bias for certain types of genes. The described method provides high-specificity, conservative inference of past recombination events.

ASAP is a comprehensive web-based system for community genome annotation and analysis. ASAP is being used for a large-scale effort to augment and curate annotations for genomes of enterobacterial pathogens and for additional genome sequences. New tools, such as the genome alignment program Mauve, have been incorporated into ASAP in order to improve display and analysis of related genomes. Recent improvements to the database and challenges for future development of the system are discussed. ASAP is available on the web at https://asap.ahabs.wisc.edu/asap/logon.php.

ASAP is a comprehensive web-based system for community genome annotation and analysis. ASAP is being used for a large-scale effort to augment and curate annotations for genomes of enterobacterial pathogens and for additional genome sequences. New tools, such as the genome alignment program Mauve, have been incorporated into ASAP in order to improve display and analysis of related genomes. Recent improvements to the database and challenges for future development of the system are discussed. ASAP is available on the web at https://asap.ahabs.wisc.edu/asap/logon.php.

GRIL is a tool to automatically identify collinear regions in a set of bacterial-size genome sequences. GRIL uses three basic steps. First, regions of high sequence identity are located. Second, some of these regions are filtered based on user-specified

We determined the complete genome sequence of Shigella flexneri serotype 2a strain 2457T (4,599,354 bp). Shigella species cause >1 million deaths per year from dysentery and diarrhea and have a lifestyle that is markedly different from those of closely r

ASAP (a systematic annotation package for community analysis of genomes) is a relational database and web interface developed to store, update and distribute genome sequence data and functional characterization (https://asap.ahabs.wisc.edu/annotation/php

Genomes evolve as modules. In prokaryotes (and some eukaryotes), genetic
material can be transferred between species and integrated into the genome via
homologous or illegitimate recombination. There is little reason to imagine
that the units of transfer correspond to entire genes; however, such units have
not been rigorously characterized. We examined fragmentary genetic transfers in
single-copy gene families from 144 prokaryotic genomes and found that
breakpoints are located significantly closer to the boundaries of genomic
regions that encode annotated structural domains of proteins than expected by
chance, particularly when recombining sequences are more divergent. This
correlation results from recombination events themselves and not from
differential nucleotide substitution. We report the first systematic study
relating genetic recombination to structural features at the protein level.

Multiple genome alignment remains a challenging problem. Effects of
recombination including rearrangement, segmental duplication, gain, and loss
can create a mosaic pattern of homology even among closely related organisms.
We describe a method to align two or more genomes that have undergone
large-scale recombination, particularly genomes that have undergone substantial
amounts of gene gain and loss (gene flux). The method utilizes a novel
alignment objective score, referred to as a sum-of-pairs breakpoint score. We
also apply a probabilistic alignment filtering method to remove erroneous
alignments of unrelated sequences, which are commonly observed in other genome
alignment methods. We describe new metrics for quantifying genome alignment
accuracy which measure the quality of rearrangement breakpoint predictions and
indel predictions. The progressive genome alignment algorithm demonstrates
markedly improved accuracy over previous approaches in situations where genomes
have undergone realistic amounts of genome rearrangement, gene gain, loss, and
duplication. We apply the progressive genome alignment algorithm to a set of 23
completely sequenced genomes from the genera Escherichia, Shigella, and
Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content
conserved among all taxa and total unique content of 15.2Mbp. We document
substantial population-level variability among these organisms driven by
homologous recombination, gene gain, and gene loss. Free, open-source software
implementing the described genome alignment approach is available from
http://gel.ahabs.wisc.edu/mauve .