Friday, February 4, 2011

For my third post on my R tutorials for beginners and intermediate users, I shall finally touch on the subject matter that prompted me to start these tutorials - plotting with group structures in colour.

If you are familiar with R, then you may have noticed that assigning group structure is not all that straightforward. You can have a dataset that may have a column specifically for group structure such as this:

and you'd hope that there is an intuitive and easy way of specifying colour or grouping structure based on this last column. The short answer is, yes, there is. But when I was a novice, it was not necessarily easy to grasp.

But before we go into that, let's review simple plotting in R.

Let's say we want to plot the first 2 principal components from a principal components analysis (PCA) based on the complete dataset introduced briefly above. I just wanted to show PCA plots instead of the original variables because I wanted to represent all three variables in two dimensions. So assuming that we have the PCA results and we call the scores along the first two PCs, PC1 and PC2 respectively, then we can plot them like this:

plot(PC1, PC2)

and it would give you a default plot that looks like this:

The default plot setting is as shown above, with simple open circles. If you don't like the symbols then you can use the pch argument to change it the way you like it:

plot(PC1, PC2, pch=19)

Here, I use pch=19 which is just a filled circle:

Now let's start adding group structure using colour.

First, we must assign the group structure. Let's say that the data table above is called data, then the column containing group information is either data[,4] or data$Family. We need to use this information to assign grouping structure to the plot. One way of achieving this is to assign the contents of this column (or vector in R) as a "factor" and call the R object something like family:

family <- as.factor(data[,4])

A factor in R is basically like categorical data so each unique entry, like "Allosauroidea" or "Tyrannosauroidea", is not just a random line of text but is treated like categories within a factor variable. Now family is an R object of class factor with 12 levels, i.e, the individual clades. So the object family acts as our grouping structure. We can use this grouping structure to assign colours with the col argument to our plot like this:

plot(PC1, PC2, pch=19, col=family)

and the resulting plot is as follows.

In this plot, the data points have been separated into 12 levels each with a corresponding colour. However, if you look closely, there are only eight colours in the above plot, black, red, green, blue, cyan, magenta, yellow and gray. This is because the default colour palette in R is set at these eight colours, and any new colour assignment exceeding this number would just recycle these eight colours in order. For instance, in our case, the first to the eighth clades in family Allosauroidea to Ornithomimosauria get assigned the colours from black to gray, and again for the ninth group onwards, i.e., outgroup = black, Oviraptorosauria=red, Therizinosauroidea=green, and Tyrannosauroidea=blue. We can resolve that problem by using a different palette.

There are several built-in colour palettes that can give your plot quite different looks. You can have a look at some of these by typing help(palette) or help(colors) in R. A few examples are: rainbow, heat.colors, topo.colors, and terrain.colors. In order to change the palette from default to any of these colour palettes, we have to first create a colour palette by one of the following:

where the "12" within the parentheses indicates the number of colours in the palette. And to assign one of these colour palettes for instance col.rainbow as the global colour palette just type:

palette(col.rainbow)

Once you've done that, you can retype the plotting function:

plot(PC1, PC2, pch=19, col=family)

but this time, the colour scheme would be different:

Next, if we use palette(col.topo) instead:

and palette(col.terrain):

All these different colour palettes are pretty and useful in different ways, but maybe they're not exactly what you're looking for, or maybe you want to control what colours to use for each group. For that, you'd have to specify your own colours. To specify the colours you want, you can just directly specify colour in the plotting function:

plot(PC1, PC2, pch=19, col="red")

which gives a plot like this:

But this shows just one colour, which is not bad if you're going to overlay each group using different colours every time, for instance using points()- maybe I'll expand on points() on a later post. You can also specify a vector of colours if you want to use more than one colour.

plot(PC1, PC2, pch=19, col=c("red","black"))

and the resulting plot would look like this:

It's not readily obvious, but this plot just alternates between red and black throughout the whole data set; Acrocanthosaurus = red, Allosaurus = black, Archaeopteryx = red, Bambiraptor = black, and so on. Here's the same plot with taxa names superimposed using text():

The same thing would happen if you provide a vector of 12 colours to reflect our 12 clades; the 12 colours would alternate through the list of 42 taxa instead of through the 12 clades. There are two ways of resolving this (maybe there are more, but I know of two).

First, you can prepare a vector of colour names such as "red", "black" or "burlywood" so that members of the same clade are assigned the same colour, for instance Allosauroidea = red and Tyrannosauroidea = blue. Naturally, you'd have a vector of length 42 (or number of taxa) with 12 different colour names (corresponding to the number of groups). You can view all the available colour names in R by typing colors(). Instead of typing all 42 colour assignments straight into the plotting function, we can create an R object in advance. The simplest way to do this is to create an extra column in your dataset in Excel containing the appropriate colour names. If this column is called Colours within your dataset data then you can call it up as data$Colours, or if it was the fifth column, then data[,5]. And because we'd want R to recognise the colour names as characters it's probably best if we use the as.character( ) function:

colour <- as.character(data$Colours)

If you don't use as.character( ) for instance like this:

colour <- data$Colours

then colour is not a character vector but a factor vector. Plotting using this factor vector will not produce a plot of your preferred colours but based on whatever your palette setting is. So if it is set on default, then you'd just get eight colours.

Using the character vector instead of the factor vector as the colour argument we can plot again:

plot(PC1, PC2, pch=19, col=colour)

and get this plot:

and with text superimposed:

You can now clearly see that each clade has its own colour and every single member of a given clade is of a same colour. For instance, all tyrannosauroids are in blue, while allosauroids are in red.

The second way of getting this plot is to create a character vector of 12 colour names:

col.list <- c("red","slategray","seagreen",....,"blue")

and set this vector as the colour palette:

palette(col.list)

If you plot using the factor vector family as the grouping structure:

plot(PC1, PC2, pch=19, col=family)

you should get the same plot as the one using the character vector colour for col=colour shown above.

I think that is about it for colour plotting, and I hope I did not make it too confusing. In summary, you can specify colour schemes either 1) by referring to a character vector (containing the colour names; e.g., colour) in the plotting function, or 2) by setting a character vector (e.g. col.list) as the colour palette and specifying a factor vector (e.g. family) as the col argument in your plotting function.

In a future post, I plan to expand on the points( ) function I briefly mentioned, but also touch on how to superimpose text using the text( ) function

About Me

I am a postdoctoral research associate in palaeontology at the University of Bristol. My current project involves the phylogeny and evolution of living and fossil cats, but I also have a strong interest in theropod dinosaurs. I use this blog to discuss interesting aspects of palaeontology, whether it be the science, or palaeo art - the latter simply posting my own drawings.