Every year I teach in the Programming for Evolutionary Biology course held in Leipzig. It’s an intensive three weeks course, in which we take 25 evolutionary biologists with no prior knowledge of programming, we lock them in a room together with some very good teaching assistants, and we keep explaining them how to program until they learn or manage to escape.

Jokes aside, the course is a very nice experience, and people have a lot of fun, as the three weeks are full of discussion about evolutionary biology and about bioinformatics. The nice things is that this year two brilliant former students (Karl and Michiel) organized a whole reunion conference, which has been called the PEB conference and is taking place this week at the CIBIO near Porto. This reunion conference is also an opportunity to follow-up the students, and I have been in charge of organizing a couple of workshops, one about installing software on linux, and one about advanced R programming.

The tutorial for advanced R programming is available on github and below. I think it may be of interest for anybody with some knowledge of R, but wishing to learn some new tricks. In particular the tutorial is about good ways to organize a dataframe, and I tried to cover a few beginner mistakes about data analysis that I saw in Biostar. It describes the differences between a dataframe in a wide or a long format, how to convert one to the other, and what are the advantages of doing that. It also teaches how to calculate group-based summaries with dplyr, and how to plot them with ggplot2.

I would like to give a big thanks to Guangchuang Yu, the author of many cool R libraries like GoSemSim and ggtree, for implementing a Network of Cancer Genes enrichment function in the DOSE R library.

The new function is called enrichNCG, and can be found in the github version of DOSE. You can use it to analyze a list of genes, and determine if they are enriched in genes known to be mutated in a given cancer type. For example, a random list composed by genes having an Entrez Id between 4000 and 9000 is enriched in genes mutated in sarcoma and leukemia:

<img class="alignnone wp-image-1579 size-large"src="http://bioinfoblog.it/wp-content/uploads/2015/04/random1_cnet-1024x729.png"alt="the cnet visualization of a randomly generated dataset of genes, using enrichNCG from DOSE, which derives data from the Network of Cancer Genes."width="640"height="456"/></a>

It is worth to mention that the DOSE package allows to calculate enrichment in the Disease Ontology database, which associates genes to disease terms. In my experience, for bioinformaticians Disease Ontology is more useful than OMIM, because it provides a clear association between genes and disease terms. If you use the raw OMIM data instead, you will have to text mine the descriptions and that can lead to a lot of noisy data.

I don’t usually post online petitions, but this is a bit personal to me as it regards the closure of a big research institute close to my hometown in Italy.

The Mario Negri Sud was a research institute active for the last 30 years in Abruzzo, a region in the center-south of Italy. During all these years the institute achieved excellence in many fields, from cardiovascular diseases to breast tumors, from diabetes to some rare diseases, and much more. I remember reading about an article on polycythemia vera (a rare cancer) published on NEJM just before financial problems halted most of the research, and more work by a neuroscientist who was really affected by the stress of the situation.

Unfortunately last week, after 4 years of financial struggle, the institute was officially declared bankrupt. Still, this is not the worst part. Given the financial situation it is likely that the workers of the institute, aproximately 160 people between researchers and staff, will not receive their pay from the last 18 months, moreover they will not get their pension funds (the Italian TFR), which for some people amounts to tens of thousands of euros, accumulated in more than 25 years of work.

These are people who dedicated almost all their life to research, and it is very unfair that they are threated this way. It is well known that the life of researchers is full of sacrifices and is never financially stable. To think that after many years they are denied a pension and abandoned to their fate is really inconceivable. Politicians were not able to solve the situation, and they are probably guilty of making it worse. Moreover given the geographical isolation of the institute, this situation hasn’t received much attention from the media outside of Abruzzo.

If you want to sign the petition, just click on this link to change.org. The petition is in Italian, and basically asks to the presidents of the Mario Negri institute, the Abruzzo region and the Chieti province (the three founding entitites of Negri Sud), to at least pay the pension and the salary of these workers. The change.org website will ask for your name, direction, postal code, and email. The website may later send you additional emails regarding other petitions, but you can opt out at any time.

My previous post on docker and bioinformatics received some good attention on Twitter. It’s nice because it means that this technology is getting the right attention in the bioinformatics community.

Here are a few resources and articles I’ve found thanks to the conversations in Twitter.

Performances of Docker on a HPC cluster – a nice article showing that running a NGS pipeline in a docker container costs about 4% of the performances. It’s up to you to decide whether this is a big or a small price to pay.

biodocker is a project by Hexabio aiming at providing many containers with bioinformatics application. For example, you can get a virtual machine with biopython or samtools installed in a few minutes. Update: this may have been merged with bioboxes (see discussion)

oswitch is a nice implementation of docker from the Queen Mary University of London, which allows to quickly switch between docker images. I like the examples in which they run a command from a virtual image and then return directly to another environment.

ngeasy, a Next Generation Sequencing pipeline implemented on Docker, by a group from the King’s College of London (I work in the same institute but I didn’t know them!).

Docker is another innovation for data analysis introduced in 2014. I am surprised by how many good things were released last year, including docker and the whole dplyr/tidyr bundle. Let’s see what 2015 will bring!

In the last year I have been part of the team maintaining and updating the Network of Cancer Genes database, also known as NCG.

The Network of Cancer Genes database

The main focus of NCG is to provide a curated list of genes associated to cancer, obtained after a manual review of the literature, and classified by cancer subtypes. Moreover NCG annotates some system-level properties of genes associated to cancer, from their protein interactions to their evolutionary age, and from the presence of paralogs in the human genome to their function.

NCG is a small database and is not supported by any big consortium, but we do our best to fill our niche :-). The following list will describe you what you can get from NCG and how can it be useful to you.

I have recently come across a nice article explaining what Docker is and how it can be useful for bioinformatics. I’ll leave you to the article for more details, but basically Docker is an easy way to define a virtual machine, which makes it very straightforward for other people to reproduce the results of an analysis, with little effort from our side.

For example, let’s imagine that we are just about to submit a paper, and that our main results are based on the Tajima’s D index from the data in 1000 Genomes. The journal may ask us to show how to reproduce the analysis: which files did we used as input? Which tool did we use to calculate the Tajima’s D?

The first part of this docker file will set up an ubuntu virtual machine, and install all the software needed to execute the pipeline: tabix, vcftools, snakemake. The second part will clone the latest version of the pipeline in the virtual machine, and then use tabix to download a portion of chromosome 22 from the 1000Genomes ftp. The third part runs the pipeline, by executing a snakemake rule.

This will take quite a while to run, and will build a docker virtual image in your system. Afterwards, you can run the following:

1

docker run-i-tbioinfoblog_test/bin/bash

This command will open an interactive shell in the virtual machine. From there you will be able to inspect the output of the pipeline, and eventually, if this pipeline were more complex than a mock example, run other rules and commands.

This system makes it very easy to provide an environment in which our results can be reproduced. It is also very useful if we work from more than one workstation – e.g. if we need to have the same configuration at home and in the lab.

There are many great functions in CRAN and BioConductor, and certainly saying that unnest from the tidyr package is the best is a big exaggeration. However this function solved a big problem in data formatting that made me waste a lot of time in the past, that I was surprised no one had implemented a function for it yet.

Imagine we have a dataframe like the following:

1

2

3

4

5

>mygenes

Entrezsymbols

7841MOGS,CDG2B,CWH41,DER7,GCS1

4248MGAT3,GNT-III,GNT3

5728PTEN,BZS,CWS1,DEC,GLM2,MHAM,MMAC11,TEP1

The first column contains the Entrez of each gene. This columns is fine, as it contains only one value per row, and it is easy to query or join with other dataframes. The second column, however, contains a comma-separated list of gene names, all associated to the same Entrez IDs. This column is a mess to deal with, because we need to use grepl to query it, and we can’t join it with other dataframes as long as it is in this form.

The unnest function from tidyr allows to convert this data frame in a “tidier” form, containing one row for each combination gene symbol and alias:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

>library(tidyr)

>library(dplyr)

>mygenes%>%

mutate(symbols=strsplit(as.character(symbols),","))%>%

unnest(symbols)

Entrez symbols

17841MOGS

27841CDG2B

37841CWH41

47841DER7

57841GCS1

64248MGAT3

74248GNT-III

84248GNT3

95728PTEN

105728BZS

115728CWS1

125728DEC

135728GLM2

145728MHAM

155728MMAC11

165728TEP1

This code makes use of the %>% and some functions from the dplyr package, but it is still R!

Having the dataframe in this long form makes it a lot easier to deal with it. For example, let’s imagine that somebody asks us to get the Entrez IDs for the list of gene symbols DER7 and DEC. We would just have to do a simple subset on the dataframe:

1

2

3

4

5

6

7

8

>unn%>%

mutate(symbols=strsplit(as.character(symbols),","))%>%

unnest(symbols)%>%

subset(symbols%in%c("DER7","DEC"))

Entrez symbols

47841DER7

125728DEC

This is just a silly example, which may have been solved with some application of apply and grepl, but in the real world there are a lot of more complex applications for it. For example, here is some code I used to split Blat output into one line per exon (or blat alignment block):

clusterProfiler is a nice R library for doing GO and KEGG enrichment analysis. It has a simple interface and it can produce some clear plots using the ggplot engine. Today I contributed a formula interface to clusterProfiler, making it easier to do enrichment of multiple groups of genes.

Let’s imagine you have a dataframe in which one column contains a list of Entrez Ids, while the other columns encode some grouping variables:

1

2

3

4

5

6

7

8

>mygenes

Entrez group othergroup

1Agood

100Abad

1000Agood

100101467Bbad

100127206Bgood

100128071Bgood

The new formula interface allows to do a GO analysis on each of the groups. For example, we can group them by the column “group”, and compare the classification of the two groups:

In this case group A is enriched in membrane and extracellular region, while group B is only enriched in membrane genes. The groupGO function used here doesn’t provide p-values – we should have used enrichGO instead. I guess that the 3 Entrez ids in group A correspond to 5 genes in Gene Ontology, so that’s why the plots shows a total of 5 in group A. See clusterProfiler’s documentation for better examples.

Of course this example is not much interesting, since it is only 6 randomly chosen genes. However with bigger datasets the formula interface can be much more powerful.

Now that I think of it, it would be better if compareCluster would return a dataframe with multiple columns, instead of merging them into a single column called Cluster. This would be make it possible to plot the results using facets or something more fancy. It would be something like this:

1

2

3

4

5

6

7

8

9

group othergroupIDDescription Count GeneRatiogeneID

1Abad GO:0016020membrane11/2100

2Abad GO:0005576extracellular region11/2100

3Agood GO:0016020membrane11/31000

4Agood GO:0005576extracellular region22/31/1000

5Bbad GO:0016020membrane00/2

6Bbad GO:0005576extracellular region00/2

7Bgood GO:0016020membrane11/2100127206

8Bgood GO:0005576extracellular region00/2

However this would probably require introduce some retro-incompatibility in the library, and it is not a big deal as the Cluster column can be easily split using the separate function from tidyr.

I think that 2014 has been the year in which my R programming style has changed the most. This is because a lot of innovative and nice libraries have been released, like dplyr, magrittr and tidyr. I started in January as a ddply enthusiast, and now instead my code is full %>% instructions and dplyr functions.

If you missed these libraries, a good starting point is the article “Principles of Tidy Dataset“, in which the author Hadley Wickham suggests some best practices for organising a dataset in a “tidy” format before doing any analysis. These practices will be already familiar to you if you have experience with the reshape/reshape2 packages, and if you used ggplot2 in the past. However, it is good to read a good summary as in the article.

Inspired by this article, I wrote a post on Biostar to discuss how a popular format in bioinformatics – the VCF – in a tidy format. Here is the link to the discussion.

The VCF in a tidy format would like more or less as below. On one hand, it would be a bit too redundant, and many columns would be replicated multiple times, making the file more sensible to typos introduced by the users and occupying more disk space. On the other hand, it would be easier to read, more flexible and able to accommodate other informations, like the population of each individual or more info about the genotype quality.