Genomics and evolution of bacterial pathogens

Main menu

Tag Archives: assembly

This is a shameless plug for an article and accompanying tutorial I’ve just published together with David Edwards, my excellent MSc Bioinformatics student from the University of Melbourne. It’s currently available as a PDF pre-pub from BMC Microbial Informatics and Experimentation, but the web version will be available soon. The accompanying tutorial is available here.

The idea for this came from discussions at last year’s ASM (Australian Society of Microbiology) meeting, where it was highlighted that there was a lack of courses and tutorials available for biologists to learn the basics of genomic analysis so that they can make use of next gen sequencing. Michael Wise, a founding editor of BMC Microbial Informatics and Experimentation based at UWA in Perth, suggested the new journal would be an ideal home for such a tutorial… so here we are:

High throughput sequencing is now fast and cheap enough to be considered part of the toolbox for investigating bacteria, and there are thousands of bacterial genome sequences available for comparison in the public domain. Bacterial genome analysis is increasingly being performed by diverse groups in research, clinical and public health labs alike, who are interested in a wide array of topics related to bacterial genetics and evolution. Examples include outbreak analysis and the study of pathogenicity and antimicrobial resistance. In this beginner’s guide, we aim to provide an entry point for individuals with a biology background who want to perform their own bioinformatics analysis of bacterial genome data, to enable them to answer their own research questions. We assume readers will be familiar with genetics and the basic nature of sequence data, but do not assume any computer programming skills. The main topics covered are assembly, ordering of contigs, annotation, genome comparison and extracting common typing information. Each section includes worked examples using publicly available E. coli data and free software tools, all which can be performed on a desktop computer.

Four great tools

In the paper and tutorial, we introduce the four tools which we rely on most for basic analysis of bacterial genome assemblies: Velvet, ACT, Mauve and BRIG. All except ACT were developed as part of a PhD project, and have endured well beyond the original PhD to become well-known bioinformatics tools. New students take note!

In the paper, each tool is highlighted in its own figure, which includes some basic instructions. This is reproduced below, but is covered in much more detail in the tutorial that comes with the paper (link at the bottom).

1. Velvet for genome assembly

Possibly the most popular and widely used short read assembler, developed by the amazing Dan Zerbino during his PhD at EBI in Cambridge. Quite a PhD project!

Reads are assembled into contigs using Velvet and VelvetOptimiser in two steps, (1) velveth converts reads to k-mers using a hash table, and (2) velvetg assembles overlapping k-mers into contigs via a de Bruijn graph. VelvetOptimiser can be used to automate the optimisation of parameters for velveth and velvetg and generate an optimal assembly. To generate an assembly of E. coli O104:H4 using the command-line tool Velvet:

This will produce a set of contigs in multifasta format for further analysis. See Additional file 1: Tutorial for further details, including help with downloading reads and using VelvetOptimiser.

2. ACT for pairwise genome comparison

Part of the Sanger Institute’s Artemis suite of tools. Also look at Artemis (single genome viewer), DNA Plotter (which can draw circular diagrams of your genomes) and BAMView (which can display mapped reads overlaid on a reference genome), they are all available here.

• Generate a comparison file between the concatenated contigs and 2 alternative reference genomes using the website WebACT.

• Launch ACT and load in the reference sequences, contigs and comparison files, to get a 3-way comparison like the one shown here.

Here, the E. coli O104:H4 contigs are in the middle row, the enteroaggregative E. coli strain Ec55989 is on top and the enterohaemorrhagic E. coli strain EDL933 is below. Details of the comparison can be viewed by zooming in, to the level of genes or DNA bases.

3. Mauve for contig ordering and multiple genome comparison

Developed by the wonderful Aaron Darling during his PhD, he is now Associate Professor at University of Technology Sydney. Also see Mauve Assembly Metrics, an optional plugin for assessing assembly quality which was developed for the Assemblathon.

• When the alignment is finished, a visualization of the genome blocks and their homology will be displayed, as shown here. E. coli O104:H4 is on the top, red lines indicate contig boundaries within the assembly. Sequences outside coloured blocks do not have homologs in the other genomes.

4. BRIG (BLAST Ring Image Generator) for multiple genome comparison

From Nabil-Fareed Alikhan at the University of Queensland, also as part of a graduate project, which I believe is still in progress…

BRIG is a free tool that requires a local installation of BLAST (we used BRIG 0.95 on Mac OS X). The output is a static image.

• Launch BRIG and set the reference sequence (EHEC EDL933 chromosome) and the location of other E. coli sequences for comparison. If you include reference sequences for the Stx2 phage and LEE pathogenicity island, it will be easy to see where these sequences are located.

• Click ‘Next’ and specify the sequence data and colour for each ring to be displayed in comparison to the reference.

• Click ‘Next’ and specify a title for the centre of the image and an output file, then click ‘Submit’ to run BRIG.

• BRIG will create an output file containing a circular image like the one shown here. It is easy to see that the Stx2 phage is present in the EHEC chromosomes (purple) and the outbreak genome (black), but not the EAEC or EPEC chromosomes.

Tutorial

The tutorial accompanying the article is available here. To give you an idea of what’s covered, here is the table of contents:

1. Genome assembly and annotation…………………………………………………………… 2

1.1 Downloading E. coli sequences for assembly…………………………………………….. 2

1.2 Examining quality of reads (FastQC)………………………………………………………… 2

1.3 Velvet – assembling reads into contigs………………………………………………………. 4

1.3.1 Using VelvetOptimiser to optimise de novo assembly with Velvet………….. 6

1.4 Ordering contigs against a reference using Mauve………………………………………. 7

I’ve been a little slow to catch up on the results of the Assemblathon, a competitive assembly event where teams use their best method(s) to generate assemblies from raw read data and the results are compared by a variety of metrics. The results from the first assemblathon, using simulated read sets, are now available pre-publication from Genome Research. The second assemblathon, using real (Illumina and Illumina+454) data from eukaryotic genomes, is happening now.

Firstly I think this is a brilliant idea and there should be far more of it in bioinformatics! So many of us are engaged in the same basic analysis tasks for dealing with short read sequence data – assembly, mapping and variant calling – but there are so many different programs & approaches (see this compilation over at seqanswers.com) out there that it quickly becomes overwhelming.

So, the results are in but, as always in the comparison of methods, there is not really a clear-cut winner. Each assembly was assessed using an enormous set of metrics (>100 apparently), including N50 (at contig and scaffold levels), miscalled bases, depth of coverage, misassemblies, etc… and unsurprisingly there was no single assembly that scored top on all metrics. BGI’s SOAPdenovo, Broad’s ALLPATHS, and Sanger’s SGA were consistently among the best for most metrics… but with clear differences. E.g. for contig N50 SOAPdenovo and ALLPATHS were both superior to SGA, which performed better than the others on scaffolding N50. SGA had the least substitution errors, but SOAPdenovo had fewer copy number errors and ALLPATHS had the best contig-level stats. For all the gory details see the results website or summaries in the paper [free at Genome Res] or this talk [PDF] presented at the Cold Spring Harbour Lab Biology of Genomes meeting.

I am trying to wrap my head around how informative this is for assembling bacterial genomes. I know a lot of people run their own in-house comparisons to determine the best approach for a particular project, but the assemblathon approach is systematic, and manages to be both competitive and collaborative, which is an awesome combination. While bacterial genomes are small and therefore raise fewer computational issues associated with large data & memory requirements, assembling them is still far from trivial and is often a crucial element of the analysis, because gene content is so variable among even very closely-related bacteria. The parameters I usually have in mind when considering bacterial assemblers are:

impact of different sequencing platforms & error profiles

impact of different insert sizes for paired or mate-pair reads

genomes with high or low GC

genomes with excessive IS elements

I guess the only aspect of this that’s missing from the current/planned assemblathon datasets is the effect of high or low G+C content, i.e. low complexity sequence, which isn’t really bacteria-specific anyway (think e.g. P. falciparum, the malaria parasite).

The reference here is the HPA genome, and the near-completeness of the purple ring (BGI assembly) shows that the two assemblies contain nearly all the same sequences. Any differences could be real, but more likely they are assembly issues with both genomes.

It contains a 73 kbp scaffold with matches to the other available EAEC aggregative adhesion plasmids (scaffold 8 in the HPA assembly). Here is a plot of the newly assembled scaffold and its homology to the other plasmids:

Aggregative plasmid from outbreak strain (reference) and its homology to three available EAEC plasmids. The location of the agg operon (encoding aggregative adhesion fimbriae I, or AAF/I) is shown in orange.

The inner ring is the novel plasmid scaffold, and its GC contig is shown in the black squiggly line. Because this is a scaffold, it contains some bits of unknown sequence between contigs (where it is known from mate pair reads that the contigs are ordered in this orientation, but there are small gaps between the contigs where we don’t know what the sequence is). These gaps are indicated with a series of N (as opposed to A, C, G or T) of the estimated length of the gap…so in the black wiggly line, these gaps show up as solid blocks of black, e.g. the one around the 70kbp mark at the top.

The other three rings (blue, red, green) indicate where there are similar sequences in the other three plasmids as shown in the central legend. You can see that pO86A1 is most similar to the novel plasmid, as it has similar sequence in parts of the novel plasmid where the other two plasmids don’t have matches (eg. between the 50-50kbp mark). There are also some parts where none of the three reference plasmids have good matches.

On the left in orange is the location of the agg operon encoding aggregative adhesion fimbriae (AAF/I). This has only very weak similarity to the other plasmids (the rings are very pale here, indicating low % identity), because the others encode type II and type III aggregative adhesion fimbriae instead (AAF/II, AAF/III). The bit upstream (10-15 kbp mark) that also has low homology contains insertion sequence (IS) elements and downstream is a resolvase, so it is possible the operon is mobile…although proper annotation is needed to sort this out.

As with Ion Torrent, the 454 is prone to errors in homopolymeric tracts (ie runs of the same base, difficult to tell between AAAAA and AAAAAA), which introduces frameshifts into the assembly. So it would be great to have an ERA7 error-tolerant annotation to sort this out.

Update: Most of the sequence that is shared with pO86A1 but not the others are transposases, with the exception of a serine protease autotransporter from a family of mucinases including pic, ipd, sepA:

Location of sepA insertion in EAEC plasmid

The gene is mostly similar to sepA genes found in the virulence plasmids of Shigella:

Bacterial chromosomes and plasmids are circular (i.e. have no beginning & no end), with a few exceptions. This can pose problems with read alignment (since reference genome sequences are always cut down into a linear form with a beginning & end, although some reads will bridge this artificial gap and may be thrown away since they partially match the start & stop equally well) and also de novo assembly (since assembler are generally unaware that sequences can be circular).

Sometimes this doesn’t matter, although in my experience with read mapping, it causes an artificial drop in the depth of reads mapping to the artificial “start” and “end” of the sequence which might affect some analyses.

A few suggestions for dealing with circularity in de novo assemblies were recently discussed on BioStar:

The take home message is that you need to be aware of this issue, because the software packages aren’t! If it matters for your analysis then the solutions are quite simple, but you have to look for this yourself as it will not be reported by mapping or assembly software.