Genomics and evolution of bacterial pathogens

Main menu

Category Archives: Tools

There is a flurry of new papers on this, how exciting! One is through peer review and published in final form at PLoS, and the others in preprint (arXiv and the new BioRXiv)!!

It will take me a while to read and understand these, especially the subtleties of the different applications they are aiming for…. so in the meantime let’s just compare them the lazy way – is the software available, and who has the best pictures?

1. In PLoS Computational Biology, using SARS to demonstrate utility:

Abstract: Recent years have seen progress in the development of statistically rigorous frameworks to infer outbreak transmission trees (“who infected whom”) from epidemiological and genetic data. Making use of pathogen genome sequences in such analyses remains a challenge, however, with a variety of heuristic approaches having been explored to date. We introduce a statistical method exploiting both pathogen sequences and collection dates to unravel the dynamics of densely sampled outbreaks. Our approach identifies likely transmission events and infers dates of infections, unobserved cases and separate introductions of the disease. It also proves useful for inferring numbers of secondary infections and identifying heterogeneous infectivity and super-spreaders. After testing our approach using simulations, we illustrate the method with the analysis of the beginning of the 2003 Singaporean outbreak of Severe Acute Respiratory Syndrome (SARS), providing new insights into the early stage of this epidemic. Our approach is the first tool for disease outbreak reconstruction from genetic data widely available as free software, the R package outbreaker. It is applicable to various densely sampled epidemics, and improves previous approaches by detecting unobserved and imported cases, as well as allowing multiple introductions of the pathogen. Because of its generality, we believe this method will become a tool of choice for the analysis of densely sampled disease outbreaks, and will form a rigorous framework for subsequent methodological developments.

Software: R package ‘outbreaker’, which includes functions for simulating outbreak data. Very nice.

Pictures: Pretty good!

2. In preprint on BioRXiv, using tuberculosis as an example:

Abstract:Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered — how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.

Pictures: I probably shouldn’t post these as it’s still preprint, so you’ll have to just look at the PDF. But there are some nice figures and a few different ways of presenting the data. I particularly like Figure 4A, which shows the phylogenetic tree against a binary matrix indicating SNP alleles in each sample, for those that vary within the group. This would be good to have handy for inspecting fine phylogenies… it looks like it was done in R, I think I might add this to my tree visualisation R functions (to be posted soon).

3. In preprint on arXiv, using a Staph. aureus hospital outbreak example

Abstract:Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered — how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.

Software: No mention of any software package or how this was implemented.

Pictures: I probably shouldn’t post these as it’s still preprint, so you’ll have to just look at the PDF. But in my opinion figures are much less clear and harder to follow than the others (and mostly black and white…why???).

This is a shameless plug for an article and accompanying tutorial I’ve just published together with David Edwards, my excellent MSc Bioinformatics student from the University of Melbourne. It’s currently available as a PDF pre-pub from BMC Microbial Informatics and Experimentation, but the web version will be available soon. The accompanying tutorial is available here.

The idea for this came from discussions at last year’s ASM (Australian Society of Microbiology) meeting, where it was highlighted that there was a lack of courses and tutorials available for biologists to learn the basics of genomic analysis so that they can make use of next gen sequencing. Michael Wise, a founding editor of BMC Microbial Informatics and Experimentation based at UWA in Perth, suggested the new journal would be an ideal home for such a tutorial… so here we are:

High throughput sequencing is now fast and cheap enough to be considered part of the toolbox for investigating bacteria, and there are thousands of bacterial genome sequences available for comparison in the public domain. Bacterial genome analysis is increasingly being performed by diverse groups in research, clinical and public health labs alike, who are interested in a wide array of topics related to bacterial genetics and evolution. Examples include outbreak analysis and the study of pathogenicity and antimicrobial resistance. In this beginner’s guide, we aim to provide an entry point for individuals with a biology background who want to perform their own bioinformatics analysis of bacterial genome data, to enable them to answer their own research questions. We assume readers will be familiar with genetics and the basic nature of sequence data, but do not assume any computer programming skills. The main topics covered are assembly, ordering of contigs, annotation, genome comparison and extracting common typing information. Each section includes worked examples using publicly available E. coli data and free software tools, all which can be performed on a desktop computer.

Four great tools

In the paper and tutorial, we introduce the four tools which we rely on most for basic analysis of bacterial genome assemblies: Velvet, ACT, Mauve and BRIG. All except ACT were developed as part of a PhD project, and have endured well beyond the original PhD to become well-known bioinformatics tools. New students take note!

In the paper, each tool is highlighted in its own figure, which includes some basic instructions. This is reproduced below, but is covered in much more detail in the tutorial that comes with the paper (link at the bottom).

1. Velvet for genome assembly

Possibly the most popular and widely used short read assembler, developed by the amazing Dan Zerbino during his PhD at EBI in Cambridge. Quite a PhD project!

Reads are assembled into contigs using Velvet and VelvetOptimiser in two steps, (1) velveth converts reads to k-mers using a hash table, and (2) velvetg assembles overlapping k-mers into contigs via a de Bruijn graph. VelvetOptimiser can be used to automate the optimisation of parameters for velveth and velvetg and generate an optimal assembly. To generate an assembly of E. coli O104:H4 using the command-line tool Velvet:

This will produce a set of contigs in multifasta format for further analysis. See Additional file 1: Tutorial for further details, including help with downloading reads and using VelvetOptimiser.

2. ACT for pairwise genome comparison

Part of the Sanger Institute’s Artemis suite of tools. Also look at Artemis (single genome viewer), DNA Plotter (which can draw circular diagrams of your genomes) and BAMView (which can display mapped reads overlaid on a reference genome), they are all available here.

• Generate a comparison file between the concatenated contigs and 2 alternative reference genomes using the website WebACT.

• Launch ACT and load in the reference sequences, contigs and comparison files, to get a 3-way comparison like the one shown here.

Here, the E. coli O104:H4 contigs are in the middle row, the enteroaggregative E. coli strain Ec55989 is on top and the enterohaemorrhagic E. coli strain EDL933 is below. Details of the comparison can be viewed by zooming in, to the level of genes or DNA bases.

3. Mauve for contig ordering and multiple genome comparison

Developed by the wonderful Aaron Darling during his PhD, he is now Associate Professor at University of Technology Sydney. Also see Mauve Assembly Metrics, an optional plugin for assessing assembly quality which was developed for the Assemblathon.

• When the alignment is finished, a visualization of the genome blocks and their homology will be displayed, as shown here. E. coli O104:H4 is on the top, red lines indicate contig boundaries within the assembly. Sequences outside coloured blocks do not have homologs in the other genomes.

4. BRIG (BLAST Ring Image Generator) for multiple genome comparison

From Nabil-Fareed Alikhan at the University of Queensland, also as part of a graduate project, which I believe is still in progress…

BRIG is a free tool that requires a local installation of BLAST (we used BRIG 0.95 on Mac OS X). The output is a static image.

• Launch BRIG and set the reference sequence (EHEC EDL933 chromosome) and the location of other E. coli sequences for comparison. If you include reference sequences for the Stx2 phage and LEE pathogenicity island, it will be easy to see where these sequences are located.

• Click ‘Next’ and specify the sequence data and colour for each ring to be displayed in comparison to the reference.

• Click ‘Next’ and specify a title for the centre of the image and an output file, then click ‘Submit’ to run BRIG.

• BRIG will create an output file containing a circular image like the one shown here. It is easy to see that the Stx2 phage is present in the EHEC chromosomes (purple) and the outbreak genome (black), but not the EAEC or EPEC chromosomes.

Tutorial

The tutorial accompanying the article is available here. To give you an idea of what’s covered, here is the table of contents:

1. Genome assembly and annotation…………………………………………………………… 2

1.1 Downloading E. coli sequences for assembly…………………………………………….. 2

1.2 Examining quality of reads (FastQC)………………………………………………………… 2

1.3 Velvet – assembling reads into contigs………………………………………………………. 4

1.3.1 Using VelvetOptimiser to optimise de novo assembly with Velvet………….. 6

1.4 Ordering contigs against a reference using Mauve………………………………………. 7

My only gripe is that only 5/11 (45%) articles in the issue are open access 😦

Luckily the best article is among the open access ones – a fantastic review of metagenomic studies, from experimental design and sampling right through to data analysis and submission to public archives, written by Hanno Teeling and Frank Glöckner from the Max Planck Institute for Marine Microbiology. Full text is online here or as a PDF.

Most of the other articles cover new tools for churning through your metagenomic sequence data and figuring out what is in there in terms of function and/or taxonomy. There are many approaches to this and several tools already out there including the very beautiful MG-RAST and Real Time Metagenomics. I have also been tinkering with these to explore the “pan-genomes” of various bacterial species where we have hundreds of genomes available… not quite what they were intended for but it seems to work quite nicely, and gives you some great insights into the spectrum of accessory genes that are flowing through various bacterial populations.

Just a quick post to say how much I love PHAST, the PHAge Search Tool.

It looks for possible prophages in your bacterial genomes, and makes such beautiful pictures of the results, like this summary of the five phage it found in a new Salmonella genome:

It also draws nice circular diagrams to show you where the phage are located, like this:

And it will even show you a nicely annotated figure of indidual phage it found, using an interactive Flash viewer:

My only gripe is that unlike some of the more visualization-challenged phage finders, PHAST doesn’t output actual annotation files, like GenBank or GFF or even a simple text table that would be straightforward to convert into GenBank… the format in which it prints out the actual information on where each phage is located in your sequence seems to be a home-grown text format that is not easy to parse with existing tools.

Oh well, I suppose I will have to write a little script to turn PHAST’s phage hunt results into a proper annotation… unless someone else has already done this?

Our paper on a mapping-based approach to extracting MLST data from Illumina short reads was recently published in BMC Genomics. We used read mapping because this has greater sensitivity than approaches which rely on assembly, especially for low-coverage data sets of genomes with extreme GC content or other sequencing issues. The approach is called SRST (short read sequence typing), and code and usage instructions are available from srst.sourceforge.net.

However, it is obviously useful to be able to extract MLST info from genome assemblies too. For example, many finished or WGS genome sequences in NCBI do not have ST information attached to them, or it is hard to find. Also, for 454 and perhaps Ion Torrent data, it can be easier to deal with homopolymer issues at the assembly level by using newbler/gsAssembler and then working with contigs.

There is a web service available that is designed to do this, i.e. you can upload your genomes and choose a MLST scheme, and it will return the ST. It is described in this paper and available at this URL. However, unfortunately I have never been able to get the website to load in any of my web browsers, so I’ve not been able to try it. Also, it is a pain to have to upload large amounts of data over the web, and this becomes completely infeasible when dealing with lots of genomes, so instead I use a simple script to extract MLST info via blast, which runs locally on my laptop or cluster.

I’m sure many people have written in-house scripts for this same task, but a few people have asked for mine recently and I figure it might save some others reinventing the wheel. The script simply uses BioPython to run a set of nucleotide blast searches in order to assign STs to genome assemblies. The inputs are just the latest set of allele sequences and profiles for the MLST scheme, and whatever genome assemblies you wish to determine STs for. The script will then determine the ST for each input genome, and if an exact match can’t be found, it will try to figure out the closest matching alleles and ST.

I’ve been working with large, whole genome phylogenies a lot lately, and wanting to overlay metadata associated with taxa in the tree.

For example, I have a table of resistance genes and mutations detected in each bacterial strain, and a phylogenetic tree showing the relationships between strains. I want to quickly and easily plot the tree and data together, so I can see whether the resistance genes are clustered together in a single clade or lineage, or if they are cropping up in lots of unrelated strains.

There are quite a few tools out there that can do something like this, but they all seem to have their drawbacks and issues, so I ended up hacking up an R script to do what I want. Here is a quick round-up of the tools I found, and the script I came up with.

Otherwise known as interactive Tree Of Life. This is probably the one that most people know, it is a good tool and doesn’t require any programming. You just upload a newick tree file and tables of data in various formats (described here), and it can display all kinds of data (see examples on the front page). Figures can be exported in PDF and other formats, which is great for publications.

The data has to be the correct format for iTOL before uploading. For my purposes (data = gene content or mutations) I usually use the heatmap or color-strip data types. Color-strip types are good for showing categorical variables, e.g. I often want to look at the distribution of different types of gyrA mutations that confer reduced resistance to fluoroquinolone-based drugs, and I need to have each mutation in a different colour. It is also good for showing the geographical location of isolates… for this, I like to use the ‘colour branches’ option, so that rather than displaying a box next to each strain indicating its location, the branches are coloured according to the location of the leaves below them…this is a good way to highlight clades within the tree that are geographically clustered (see the example below). To convert a simple table (column 1 = strain names, which match the leaves of the tree, other columns = categorical variables) into one suitable for uploading as a colour strip to iTOL, I wrote this simple Python script.

Here is an example with resistance mutations shown as a colour strip (red/blue/black), plasmid coverage shown as a heatmap, and geographical locations indicated by branch colours (achieved using the ‘colour strip’ data type). Note that this is a screenshot from my browser, as the export option was not functioning correctly today (a drawback of web-based services).

Some drawbacks of iTOL, at least for my purposes, are:

Lack of tools/options for displaying or editing the tree itself. You can have rectangular or circular dendrograms, but you can’t control the look of the tree (e.g. colouring branches or changing branch weights, rotating subtrees, controlling ladderizing, etc). Some of this can be got around by editing the newick file first in some other program, and changing colours and weights later by adjusting the PDF output by iTOL. But this is fiddly and I find the SVG graphics in the iTOL PDFs hard to manipulate in Illustrator.

You can’t display multiple heatmaps on the same figure. With colour strips, you can tick the ‘prevent overlap’ box so that the colour strip is presented adjacent to the previous one rather than over the top of it… so if I have one colour strip representing the location of the isolate, and one representing the presence of different drug resistance mutations, these strips can appear together, side-by-side on the same graph. On the other hand, if I have a heatmap showing the presence of a set of resistance genes, and a second heatmap showing say the MICs of certain drugs, I can’t display these next to each other on the same plot, but only one at a time.

It is web-based. While there are some advantages to this, I find it frustrating to have to upload and download things all the time, especially for a program like this that doesn’t actually need to access any other databases or compute clusters. You can however establish a private account, and keep track of your trees in separate projects and groups, which is nice. There is also an API and batch access, if you want to link it in with other web services.

Another option is the ETE2 module for Python. This is essentially a package for navigating and displaying phylogenetic trees, and it has some really great features. You need to have some familiarity with Python to be able to use this, but if that’s not a problem for you, then it is worth delving into. It allows you to display sequence alignments, images, protein domains, heatmaps, graphs etc overlaid on tree nodes, leaves or next to the tree, and has the option of half-circle tree plots in addition to the usual full circle tree layout. For examples see the web page. ETE comes with its own graphical interface for displaying trees, as well as being able to write them out directly to image files.

ETE2 is incredibly flexible and I can do just about everything I want with it… EXCEPT, to enforce tree ladderizing when plotting a tree alongside a heatmap. It can ladderized trees, but not when you are using the heatmap display. So unless your tree happens to be naturally ladderized, it can look a bit strange. This is very annoying, as for most of my data, the phylogenetic structure is much clearer when the tree is ladderized, like this:

Ladderized (left) vs unladderized (right)

Anyway, assuming ladderizing isn’t of great concern, it is easy to plot a heatmap next to a tree. Like iTOL, you just need a tree in newick format, and a matrix representing your heatmap. Unlike iTOL, you need to make sure first that the names in your tree and matrix match up 1:1. Also, your matrix should be tab-separated and the first column (which contains the leaf names) should be titled ‘#Names’.

You can also colour in the leaf nodes and add other data, akin to iTOL, using ETE2. At the end of the post are some example code for how to colour in the leaf nodes with location data, and add a colour strip indicating our resistance mutations, based on the ETE2 tutorial.

Drawbacks of ETE2, for my purposes:

As with iTOL, there is not adequate control over the layout of the tree, so we can’t get nicely ladderized trees. I also had problems with the branch leading from the root of the tree being drawn as super long by ETE, which makes it awkward to render as a nice image.

For heatmaps, you have virtually no control over the colour scheme. There are 3 options, from the ETE2 reference “0=green & blue; 1=green & red; 2=red & blue. In all three cases, missing values are rendered in black and transition color (values=center) is white.” However when I tried colourscheme 2 (red & blue) with binary data (0, 1), I got 0 showing up as black and 1 showing up as red, however I specified the minimum, maximum and center values.

For me, using Python is not a drawback but mostly a strength, as it allows fine control over some of the aspects of the display, and it means you aren’t relying on web services functioning well, or uploading and downloading files. It can also facilitate building the tree drawing into other data analysis pipelines (much like iTOL batch could be, I suppose). But for many people this will be a hindrance.

(For those of you don’t know R, what follows will not make much sense… maybe look at the pictures to decide if it’s worth learning more about R!)

Heatmaps can be displayed easily in R using the ‘image’ function, and there are loads of heatmap functions that will take a data matrix, cluster the rows and/or columns of the matrix and display the reordered heatmap (plotted using the ‘image’ function on the reordered matrix) alongside a dendrogram representing the clustering. However, I haven’t found a version of this function that allows you to pre-specify the dendrogram as a phylo object, which would allow us to easily display a predefined phylogenetic tree alongside a heatmap representation of associated data.

The R class ‘phylo4d’ does kind of what I want. It can contain a phylogenetic tree and data matrix, mainly for analysis rather than data display, although there are a couple of functions for plotting the data against the tree. The most versatile I could find is ‘table.phylo4d’ function in the ‘adephylo’ package. However, there is still very little control over the way the data is displayed and it can’t give you an actual heatmap. Instead it represents values in the data matrix by the size of circles or squares laid out in a grid. For example, here is what you get by following the examples in the adephylo manual:

This is OK, but I’d rather have a heatmap than scaled circles.

Although we can’t use the ‘image’ command to get this with adephylo, we can approximate it using ‘symbol=”colors”‘ in the table.phylo4d function, setting the plotted points to squares using ‘pch=15’ and specifying a colour set using ‘cols=X”. Here it is with my data, using a simple black=present, white=absent colour scheme:

Note that you could do much the same with continuous data, just leave the default ‘cols=heat.colors()’ or set it to another range of colours of your choosing, as you might with the regular ‘image’ or ‘heatmap’ functions in R.

In this example, I’ve switched off printing of the leaf labels, but you could switch this back on using ‘show.tip.label=T’. Also in this example I’ve specified that the tree should take up 25% of the width of the image, but you can change this by changing ‘ratio.tree=0.25’ to something else.

Note I’ve coloured the leaves of the tree according to location, using a file with the leaf names in column 1 and locations (categories) in column 2:

This is not really a heatmap, and unlike the heatmap or image functions you will need to play with the layout – including symbol size (cex.symbol) and the size of the drawing device in order to get something that looks good and renders all the data points visible.

My simple R function for plotting a heatmap against a tree

While the phylo4d option is pretty good, I figure that since I’m in R, I should be able to use the more powerful ‘image’ function to draw a proper heatmap alongside my tree. So I wrote a little R function to achieve this, called plotTreeData. It uses the ‘ape’ library to read and plot the tree, so you’ll need to have this installed. To use the function, just download the text file here and load it into R like this:

source(“plotTreeData.R”)

The required inputs are:

treeFile (path to tree file in newick format, tip labels must match those in the data file)

matrixFile (path to matrix file in csv format, column 1 must contain strain identifiers that match the tip labels in the tree file; other columns contain the data for plotting as a heatmap; column names should be provided. Note that if you already have the matrix loaded into R (e.g. if you have created or manipulated it in R) you can just provide the R object here.)

e.g. plotTreeData(“tree.nwk”,”matrix.csv”) will generate a figure like this, with a greyscale heatmap, in the R graphics device:

To write the image to a PDF or PNG file, provide a file name for the figure via ‘outputPDF=’ or ‘outputPNG=’ (note you can’t do both at once). You can optionally provide width and height using ‘w=’ and ‘h=’ otherwise the defaults for pdf() or png() functions will be used (note for PDF, the units are in inches while for PNG, the units are in pixels).

So to render a figure as a PDF: plotTreeData(“tree.nwk”,”matrix.csv”,outputPDF=”tree.pdf”,w=10,h=20)

Other options include:

matrix.colours: By default, the heatmap is rendered in greyscale (white to black), but this can be changed using ‘matrix.colours=’, e.g. matrix.colours=heat.colors(10).

To get both the heatmap and the node colouring, we need to create a new ProfileFace (rather than using the default “heatmap” one) that includes both the heatmap and a coloured circle at each leaf indicating the location (see figure below):

I’ve been a little slow to catch up on the results of the Assemblathon, a competitive assembly event where teams use their best method(s) to generate assemblies from raw read data and the results are compared by a variety of metrics. The results from the first assemblathon, using simulated read sets, are now available pre-publication from Genome Research. The second assemblathon, using real (Illumina and Illumina+454) data from eukaryotic genomes, is happening now.

Firstly I think this is a brilliant idea and there should be far more of it in bioinformatics! So many of us are engaged in the same basic analysis tasks for dealing with short read sequence data – assembly, mapping and variant calling – but there are so many different programs & approaches (see this compilation over at seqanswers.com) out there that it quickly becomes overwhelming.

So, the results are in but, as always in the comparison of methods, there is not really a clear-cut winner. Each assembly was assessed using an enormous set of metrics (>100 apparently), including N50 (at contig and scaffold levels), miscalled bases, depth of coverage, misassemblies, etc… and unsurprisingly there was no single assembly that scored top on all metrics. BGI’s SOAPdenovo, Broad’s ALLPATHS, and Sanger’s SGA were consistently among the best for most metrics… but with clear differences. E.g. for contig N50 SOAPdenovo and ALLPATHS were both superior to SGA, which performed better than the others on scaffolding N50. SGA had the least substitution errors, but SOAPdenovo had fewer copy number errors and ALLPATHS had the best contig-level stats. For all the gory details see the results website or summaries in the paper [free at Genome Res] or this talk [PDF] presented at the Cold Spring Harbour Lab Biology of Genomes meeting.

I am trying to wrap my head around how informative this is for assembling bacterial genomes. I know a lot of people run their own in-house comparisons to determine the best approach for a particular project, but the assemblathon approach is systematic, and manages to be both competitive and collaborative, which is an awesome combination. While bacterial genomes are small and therefore raise fewer computational issues associated with large data & memory requirements, assembling them is still far from trivial and is often a crucial element of the analysis, because gene content is so variable among even very closely-related bacteria. The parameters I usually have in mind when considering bacterial assemblers are:

impact of different sequencing platforms & error profiles

impact of different insert sizes for paired or mate-pair reads

genomes with high or low GC

genomes with excessive IS elements

I guess the only aspect of this that’s missing from the current/planned assemblathon datasets is the effect of high or low G+C content, i.e. low complexity sequence, which isn’t really bacteria-specific anyway (think e.g. P. falciparum, the malaria parasite).