A tutorial on organizing and plotting data with R

Every year I teach in the Programming for Evolutionary Biology course held in Leipzig. It’s an intensive three weeks course, in which we take 25 evolutionary biologists with no prior knowledge of programming, we lock them in a room together with some very good teaching assistants, and we keep explaining them how to program until they learn or manage to escape.

Jokes aside, the course is a very nice experience, and people have a lot of fun, as the three weeks are full of discussion about evolutionary biology and about bioinformatics. The nice things is that this year two brilliant former students (Karl and Michiel) organized a whole reunion conference, which has been called the PEB conference and is taking place this week at the CIBIO near Porto. This reunion conference is also an opportunity to follow-up the students, and I have been in charge of organizing a couple of workshops, one about installing software on linux, and one about advanced R programming.

The tutorial for advanced R programming is available on github and below. I think it may be of interest for anybody with some knowledge of R, but wishing to learn some new tricks. In particular the tutorial is about good ways to organize a dataframe, and I tried to cover a few beginner mistakes about data analysis that I saw in Biostar. It describes the differences between a dataframe in a wide or a long format, how to convert one to the other, and what are the advantages of doing that. It also teaches how to calculate group-based summaries with dplyr, and how to plot them with ggplot2.

DIfference between a dataset in a wide or a long format

PEB advanced R workshop

Giovanni M DallOlio

28/04/2015

PEB advanced R workshop

Welcome to the PEB advanced R workshop!

We will take a “messy” gene expression table, and use the tidyr library to restructure it in a format that is better suited for data analysis. We will also group the data and calculate summaries using the dplyr library, and finally plot it using ggplot2.

For a better experience see the full version of this tutorial on github:

The first two columns contain the probe ID and the name of the gene. The other columns contain the expression levels of each gene in an individual. Notice how the name of the columns also encode the population of each individual, e.g. YRI, EUR, or EAS.

A piping system for R

The dplyr package introduced a piping system for R, using the %>% symbol.

This works similarly to the piping system in bash, but using the %>% symbol instead of the pipe |. We first write the name of the dataframe to use, then all the operation that must be executed on it.

For example, the following is equivalent to head(peb)

1

peb%>%head

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

## ID_REF IDENTIFIER GSM11815.YRI GSM11832.YRI GSM12069.YRI

## 1 200000_s_at PRPF8 4254.0 5298.2 4026.5

## 2 200001_at CAPNS1 17996.2 12010.7 10283.5

## 3 200002_at RPL35 41678.8 39116.9 38758.9

## 4 200003_s_at RPL28 65390.9 34806.2 31257.2

## 5 200004_at EIF4G2 19030.1 15813.6 16355.7

## 6 200005_at EIF3D 8824.5 9706.2 10590.0

## GSM12083.YRI GSM12101.YRI GSM12106.EUR GSM12274.EUR GSM12299.EUR

## 1 3498.4 3566.4 4903.1 6372.6 4829.1

## 2 2534.7 11048.4 13354.0 8563.8 17247.6

## 3 32847.7 39633.9 43511.2 46856.7 47032.4

## 4 28308.5 67447.5 56989.9 57972.5 57570.5

## 5 9579.7 14273.5 17217.0 19116.9 17487.6

## 6 6986.7 9400.4 12835.2 10299.0 12375.2

## GSM12412.EUR GSM11810.EUR GSM11827.EUR GSM12078.EAS GSM12099.EAS

## 1 5205.8 2756.8 3932.0 3729.9 3223.4

## 2 16018.5 6077.0 15703.8 10138.5 11614.4

## 3 22152.2 26660.7 26373.6 23809.6 24749.3

## 4 29062.2 35140.9 23629.3 22100.5 21651.0

## 5 14671.6 17733.1 18022.4 17957.4 15958.0

## 6 7645.4 8661.5 7355.7 6973.4 6855.9

## GSM12269.EAS GSM12287.EAS GSM12301.EAS GSM12448.EAS

## 1 3640.5 4886.3 4070.2 3482.1

## 2 8460.5 10282.6 11844.3 9741.6

## 3 21936.8 31462.8 22733.7 25395.5

## 4 18550.7 23496.5 21315.4 28631.4

## 5 15799.8 16685.8 18817.3 17421.1

## 6 7949.2 9486.5 7494.5 7252.1

We can concantenate any number of operations on the same dataset, just as we would do in bash. We will play with this piping system in a few minutes.

Converting to a long format

A dataset can be encoded in a “wide” format (more columns and less rows), or in a “long” format (minimum number of columns and more rows). While the wide format can be more readable for the human eye, the long format is better suited for data analysis.

Our peb data frame is in a “wide” format, as it contains many columns, one for every individual. However, all the functions used in the rest of the workshop require the dataset to be in a long format. Let’s convert it using the gather() function from tidyr:

1

2

peb.long=gather(peb,sample,expression,-c(ID_REF,IDENTIFIER))

head(peb.long)

1

2

3

4

5

6

7

## ID_REF IDENTIFIER sample expression

## 1 200000_s_at PRPF8 GSM11815.YRI 4254.0

## 2 200001_at CAPNS1 GSM11815.YRI 17996.2

## 3 200002_at RPL35 GSM11815.YRI 41678.8

## 4 200003_s_at RPL28 GSM11815.YRI 65390.9

## 5 200004_at EIF4G2 GSM11815.YRI 19030.1

## 6 200005_at EIF3D GSM11815.YRI 8824.5

Explanation:

sample is the name of the new column containing the key variables

expression is the new column containing the values variable

we use the – operator to define which columns must not be converted to the long format

Note how the new format encodes exactly the same data as before, but has only four columns.

Nowadays many recent R libraries and functions are designed for datasets in the long format. A common beginner mistake is trying to apply these functions to datasets in the wide format, while the trick is to restructure the data first. If you learn how to reorganize your dataset into a long format, then you can solve most data analysis problems in R using always the same approach.

If you couldn’t install the tidyr package, you can achieve the same format using the melt function from reshape2

Tidying-up the peb.long dataframe

In a properly structured table each variable must contain only one type of information. If we look at our peb.long dataframe, the sample column contains both the individual ID and its population. Let’s split this column into two, using separate from tidyr:

To prepare the dataset for our workshop, we need to apply a few more data filtering steps, like removing all the “Control” rows, eliminating all the duplicated genes (keeping only one probe per gene), and dropping the ID_REF column.

1

peb.long%>%nrow

1

## [1] 384965

1

2

3

4

5

6

7

8

peb.long=peb%>%

gather(sample,expression,-ID_REF,-IDENTIFIER)%>%

separate(sample,into=c("individual","population"),sep='\\.')%>%

subset(!grepl("Control",IDENTIFIER))%>%

subset(!duplicated(paste(IDENTIFIER,individual)))%>%

select(-ID_REF)

peb.long%>%nrow

1

## [1] 305184

Group operations

Apart from the %>% operator, the dplyr library introduces three useful functions: group_by, summarise and mutate.

group_by is used to define how the rows of a dataset must be grouped. For example, we can group the rows of peb.long by population.

summarise is used to calculate summaries of a grouped dataset – for example we can use it to calculate the mean and standard deviation of the expression for every population:

1

2

3

4

5

6

7

peb.means=peb.long%>%

group_by(population)%>%

summarise(

mean=mean(expression),

sd=sd(expression))

peb.means%>%print

1

2

3

4

5

6

## Source: local data frame [3 x 3]

##

## population mean sd

## 1 EAS 747.4846 2696.084

## 2 EUR 742.9541 2818.128

## 3 YRI 744.1489 2856.510

mutate is used to add columns to a data frame, or to modify existing columns. For example we can use it to annotate whether each gene is over or under expressed, compared to the expression of other genes in the same individual: