Genomics and evolution of bacterial pathogens

Main menu

Monthly Archives: May 2012

I’ve been working with large, whole genome phylogenies a lot lately, and wanting to overlay metadata associated with taxa in the tree.

For example, I have a table of resistance genes and mutations detected in each bacterial strain, and a phylogenetic tree showing the relationships between strains. I want to quickly and easily plot the tree and data together, so I can see whether the resistance genes are clustered together in a single clade or lineage, or if they are cropping up in lots of unrelated strains.

There are quite a few tools out there that can do something like this, but they all seem to have their drawbacks and issues, so I ended up hacking up an R script to do what I want. Here is a quick round-up of the tools I found, and the script I came up with.

Otherwise known as interactive Tree Of Life. This is probably the one that most people know, it is a good tool and doesn’t require any programming. You just upload a newick tree file and tables of data in various formats (described here), and it can display all kinds of data (see examples on the front page). Figures can be exported in PDF and other formats, which is great for publications.

The data has to be the correct format for iTOL before uploading. For my purposes (data = gene content or mutations) I usually use the heatmap or color-strip data types. Color-strip types are good for showing categorical variables, e.g. I often want to look at the distribution of different types of gyrA mutations that confer reduced resistance to fluoroquinolone-based drugs, and I need to have each mutation in a different colour. It is also good for showing the geographical location of isolates… for this, I like to use the ‘colour branches’ option, so that rather than displaying a box next to each strain indicating its location, the branches are coloured according to the location of the leaves below them…this is a good way to highlight clades within the tree that are geographically clustered (see the example below). To convert a simple table (column 1 = strain names, which match the leaves of the tree, other columns = categorical variables) into one suitable for uploading as a colour strip to iTOL, I wrote this simple Python script.

Here is an example with resistance mutations shown as a colour strip (red/blue/black), plasmid coverage shown as a heatmap, and geographical locations indicated by branch colours (achieved using the ‘colour strip’ data type). Note that this is a screenshot from my browser, as the export option was not functioning correctly today (a drawback of web-based services).

Some drawbacks of iTOL, at least for my purposes, are:

Lack of tools/options for displaying or editing the tree itself. You can have rectangular or circular dendrograms, but you can’t control the look of the tree (e.g. colouring branches or changing branch weights, rotating subtrees, controlling ladderizing, etc). Some of this can be got around by editing the newick file first in some other program, and changing colours and weights later by adjusting the PDF output by iTOL. But this is fiddly and I find the SVG graphics in the iTOL PDFs hard to manipulate in Illustrator.

You can’t display multiple heatmaps on the same figure. With colour strips, you can tick the ‘prevent overlap’ box so that the colour strip is presented adjacent to the previous one rather than over the top of it… so if I have one colour strip representing the location of the isolate, and one representing the presence of different drug resistance mutations, these strips can appear together, side-by-side on the same graph. On the other hand, if I have a heatmap showing the presence of a set of resistance genes, and a second heatmap showing say the MICs of certain drugs, I can’t display these next to each other on the same plot, but only one at a time.

It is web-based. While there are some advantages to this, I find it frustrating to have to upload and download things all the time, especially for a program like this that doesn’t actually need to access any other databases or compute clusters. You can however establish a private account, and keep track of your trees in separate projects and groups, which is nice. There is also an API and batch access, if you want to link it in with other web services.

Another option is the ETE2 module for Python. This is essentially a package for navigating and displaying phylogenetic trees, and it has some really great features. You need to have some familiarity with Python to be able to use this, but if that’s not a problem for you, then it is worth delving into. It allows you to display sequence alignments, images, protein domains, heatmaps, graphs etc overlaid on tree nodes, leaves or next to the tree, and has the option of half-circle tree plots in addition to the usual full circle tree layout. For examples see the web page. ETE comes with its own graphical interface for displaying trees, as well as being able to write them out directly to image files.

ETE2 is incredibly flexible and I can do just about everything I want with it… EXCEPT, to enforce tree ladderizing when plotting a tree alongside a heatmap. It can ladderized trees, but not when you are using the heatmap display. So unless your tree happens to be naturally ladderized, it can look a bit strange. This is very annoying, as for most of my data, the phylogenetic structure is much clearer when the tree is ladderized, like this:

Ladderized (left) vs unladderized (right)

Anyway, assuming ladderizing isn’t of great concern, it is easy to plot a heatmap next to a tree. Like iTOL, you just need a tree in newick format, and a matrix representing your heatmap. Unlike iTOL, you need to make sure first that the names in your tree and matrix match up 1:1. Also, your matrix should be tab-separated and the first column (which contains the leaf names) should be titled ‘#Names’.

You can also colour in the leaf nodes and add other data, akin to iTOL, using ETE2. At the end of the post are some example code for how to colour in the leaf nodes with location data, and add a colour strip indicating our resistance mutations, based on the ETE2 tutorial.

Drawbacks of ETE2, for my purposes:

As with iTOL, there is not adequate control over the layout of the tree, so we can’t get nicely ladderized trees. I also had problems with the branch leading from the root of the tree being drawn as super long by ETE, which makes it awkward to render as a nice image.

For heatmaps, you have virtually no control over the colour scheme. There are 3 options, from the ETE2 reference “0=green & blue; 1=green & red; 2=red & blue. In all three cases, missing values are rendered in black and transition color (values=center) is white.” However when I tried colourscheme 2 (red & blue) with binary data (0, 1), I got 0 showing up as black and 1 showing up as red, however I specified the minimum, maximum and center values.

For me, using Python is not a drawback but mostly a strength, as it allows fine control over some of the aspects of the display, and it means you aren’t relying on web services functioning well, or uploading and downloading files. It can also facilitate building the tree drawing into other data analysis pipelines (much like iTOL batch could be, I suppose). But for many people this will be a hindrance.

(For those of you don’t know R, what follows will not make much sense… maybe look at the pictures to decide if it’s worth learning more about R!)

Heatmaps can be displayed easily in R using the ‘image’ function, and there are loads of heatmap functions that will take a data matrix, cluster the rows and/or columns of the matrix and display the reordered heatmap (plotted using the ‘image’ function on the reordered matrix) alongside a dendrogram representing the clustering. However, I haven’t found a version of this function that allows you to pre-specify the dendrogram as a phylo object, which would allow us to easily display a predefined phylogenetic tree alongside a heatmap representation of associated data.

The R class ‘phylo4d’ does kind of what I want. It can contain a phylogenetic tree and data matrix, mainly for analysis rather than data display, although there are a couple of functions for plotting the data against the tree. The most versatile I could find is ‘table.phylo4d’ function in the ‘adephylo’ package. However, there is still very little control over the way the data is displayed and it can’t give you an actual heatmap. Instead it represents values in the data matrix by the size of circles or squares laid out in a grid. For example, here is what you get by following the examples in the adephylo manual:

This is OK, but I’d rather have a heatmap than scaled circles.

Although we can’t use the ‘image’ command to get this with adephylo, we can approximate it using ‘symbol=”colors”‘ in the table.phylo4d function, setting the plotted points to squares using ‘pch=15’ and specifying a colour set using ‘cols=X”. Here it is with my data, using a simple black=present, white=absent colour scheme:

Note that you could do much the same with continuous data, just leave the default ‘cols=heat.colors()’ or set it to another range of colours of your choosing, as you might with the regular ‘image’ or ‘heatmap’ functions in R.

In this example, I’ve switched off printing of the leaf labels, but you could switch this back on using ‘show.tip.label=T’. Also in this example I’ve specified that the tree should take up 25% of the width of the image, but you can change this by changing ‘ratio.tree=0.25’ to something else.

Note I’ve coloured the leaves of the tree according to location, using a file with the leaf names in column 1 and locations (categories) in column 2:

This is not really a heatmap, and unlike the heatmap or image functions you will need to play with the layout – including symbol size (cex.symbol) and the size of the drawing device in order to get something that looks good and renders all the data points visible.

My simple R function for plotting a heatmap against a tree

While the phylo4d option is pretty good, I figure that since I’m in R, I should be able to use the more powerful ‘image’ function to draw a proper heatmap alongside my tree. So I wrote a little R function to achieve this, called plotTreeData. It uses the ‘ape’ library to read and plot the tree, so you’ll need to have this installed. To use the function, just download the text file here and load it into R like this:

source(“plotTreeData.R”)

The required inputs are:

treeFile (path to tree file in newick format, tip labels must match those in the data file)

matrixFile (path to matrix file in csv format, column 1 must contain strain identifiers that match the tip labels in the tree file; other columns contain the data for plotting as a heatmap; column names should be provided. Note that if you already have the matrix loaded into R (e.g. if you have created or manipulated it in R) you can just provide the R object here.)

e.g. plotTreeData(“tree.nwk”,”matrix.csv”) will generate a figure like this, with a greyscale heatmap, in the R graphics device:

To write the image to a PDF or PNG file, provide a file name for the figure via ‘outputPDF=’ or ‘outputPNG=’ (note you can’t do both at once). You can optionally provide width and height using ‘w=’ and ‘h=’ otherwise the defaults for pdf() or png() functions will be used (note for PDF, the units are in inches while for PNG, the units are in pixels).

So to render a figure as a PDF: plotTreeData(“tree.nwk”,”matrix.csv”,outputPDF=”tree.pdf”,w=10,h=20)

Other options include:

matrix.colours: By default, the heatmap is rendered in greyscale (white to black), but this can be changed using ‘matrix.colours=’, e.g. matrix.colours=heat.colors(10).

To get both the heatmap and the node colouring, we need to create a new ProfileFace (rather than using the default “heatmap” one) that includes both the heatmap and a coloured circle at each leaf indicating the location (see figure below):