Friday, June 24, 2011

Today, in one of my more productive days, I managed to create a sleek R script that plotted several histograms in a lattice, allowing for easy identification of the underlying trend. Although the majority of the time taken consisted of collecting the data and making various adjustments, it took a not inconsiderable amount of work to write the code.

As I was cursing the apply function – not for the last time I am sure – I suddenly realised the insane level of productivity that I have come to see as "par for the course". The level of computational analysis that can be conducted in a few hours with nothing more than a desktop PC, a broadband connection, and copious amounts of caffeine is phenomenal. No longer can a lack of computing power or software be blamed for a lack of productivity growth: rate-limiting factors are now exclusively human.

Here at least is my contribution to the collective intelligence of biologicals. I have noticed that the most common reason that people avoid R is that they cannot rapidly make graphs that meet the high standards of their clients. In the next few editions of this blog we will build up a basic repertoire of plotting techniques, focusing on graphics that are sickeningly impressive.

Of course, Rome was not built in a day, and a thorough knowledge of R plotting cannot be built in one. Instead we will progress one layer at a time, adding additional levels of complexity and functionality. We start with a simple script that allows us to plot several graphs at the same time, each with a different value of a key variable. Here is the output:

Figure 1: Multiple plots using par

The code

# Clear all objects
rm(list=ls())
# Create a data set using random variables
df<-data.frame(x=rnorm(160),y=runif(160),a=sample(c(1,0,-1,10),
replace=TRUE,160))
df$z<-with(df,{3*a*y+x})
# Create a function that plots the value of "z" against the "y" value
plotM<-function(l){
df.temp<-df[df$a==l,]
plot(df.temp$y,df.temp$z,xlab="Y Value",ylab="Z Value",
main=paste("Value of key variable: ",toString(l)))
abline(lm(df.temp$z~df.temp$y),col="red")}
# Create a grid to plot the different values of "a"
par(mfrow=c(2,2))
# Loop through each value of "a" and call the plotM function
for (i in c(1,0,-1,10)){
plotM(i)}

Walk-through
The first few lines of code create a data frame with 4 variables: x,y,z,a. Three of these variables are randomly generated, with the z variable dependent upon the other 3. Suppose that we are analysing this data set, unaware of the relationship between the x,y and z variables. A preliminary inspection shows that there are only 4 observed values of a. It seems sensible to plot z against y for each of these different a values.

The relevant code to create this plot starts with the function "plotM", standing for plot Multiple. This function takes an argument "l" that determines which value of the variable we are plotting. The line

df.temp<-df[df$a==l,]

filters the data frame to include only those rows where the variable a=l. The next line simply creates a standard R plot of y versus z. Finally, we use the abline function to plot a linear fit to highlight the trend. Too easy.

Now we come to the fun part. Using the par(mfrow=c(2,2)) we create a 2x2 grid in which to place the next four plots. Whatever plots we now create will be placed sequentially into this grid. Hence we can iterate over a vector containing the values of a, calling plotM each time. This gives us our grid.

Comments
Okay, so the graph does not look like it came from NASA – or to be honest with NASA's ailing reputation maybe it does. But note that once we have created the plotM function, we only have to write 3 lines of code to make 4 separate charts. Moreover, the code would not increase even if we were plotting 100 charts.

Of course, we have not yet even drawn upon any of R's custom plotting packages. In the next edition of this blog we will look at how to use the ggplot2 package to add colour and a wide range of other features to our graphs.