the most useful R function of the week: unnest from tidyr

There are many great functions in CRAN and BioConductor, and certainly saying that unnest from the tidyr package is the best is a big exaggeration. However this function solved a big problem in data formatting that made me waste a lot of time in the past, that I was surprised no one had implemented a function for it yet.

Imagine we have a dataframe like the following:

1

2

3

4

5

>mygenes

Entrezsymbols

7841MOGS,CDG2B,CWH41,DER7,GCS1

4248MGAT3,GNT-III,GNT3

5728PTEN,BZS,CWS1,DEC,GLM2,MHAM,MMAC11,TEP1

The first column contains the Entrez of each gene. This columns is fine, as it contains only one value per row, and it is easy to query or join with other dataframes. The second column, however, contains a comma-separated list of gene names, all associated to the same Entrez IDs. This column is a mess to deal with, because we need to use grepl to query it, and we can’t join it with other dataframes as long as it is in this form.

The unnest function from tidyr allows to convert this data frame in a “tidier” form, containing one row for each combination gene symbol and alias:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

>library(tidyr)

>library(dplyr)

>mygenes%>%

mutate(symbols=strsplit(as.character(symbols),","))%>%

unnest(symbols)

Entrez symbols

17841MOGS

27841CDG2B

37841CWH41

47841DER7

57841GCS1

64248MGAT3

74248GNT-III

84248GNT3

95728PTEN

105728BZS

115728CWS1

125728DEC

135728GLM2

145728MHAM

155728MMAC11

165728TEP1

This code makes use of the %>% and some functions from the dplyr package, but it is still R!

Having the dataframe in this long form makes it a lot easier to deal with it. For example, let’s imagine that somebody asks us to get the Entrez IDs for the list of gene symbols DER7 and DEC. We would just have to do a simple subset on the dataframe:

1

2

3

4

5

6

7

8

>unn%>%

mutate(symbols=strsplit(as.character(symbols),","))%>%

unnest(symbols)%>%

subset(symbols%in%c("DER7","DEC"))

Entrez symbols

47841DER7

125728DEC

This is just a silly example, which may have been solved with some application of apply and grepl, but in the real world there are a lot of more complex applications for it. For example, here is some code I used to split Blat output into one line per exon (or blat alignment block):