Author: Giovanni Marco Dall'Olio

The last 18 months have been quite a radical career change for me. This is because I made the infamous move: leaving the Academia and starting working in the Industry.

My career from Bologna to Barcelona, and from London to the British Countryside.

To be honest I am quite happy of the change. I’ve learned many things, discovered another way to do science, and possibly made some contributions. Moving from the Academia to Industry sometimes has a bad reputation, but these months taught me that to develop a drug, there are many resources to be involved: not only a smart idea in the lab, but also lot of validation, regulation, planning, marketing, budgeting, understanding the impact on the patients, and much more.

Where am I working exactly?

I am in the pre-clinical department of a big pharma company, GSK. More specifically my department is called Target Sciences, and the main scope is to identify and validate new targets (in layman terms: genes or biological entities) to treat indications (in layman terms: diseases or phenotypes).

The R&D department of GSK is structured in several Discovery Performance Units (DPUs), which are small independent units working on a specific therapy area. For example there it could be a DPU focused on Oncology, or another on Asthma and respiratory diseases. These DPUs are like small start-ups within the company, and they each carry out a few drug target through the drug discovery process.

Drug discovery process – I am in the first phase. Source: https://www.slidegeeks.com/shapes/product/business-steps-powerpoint-templates-marketing-drug-discovery-process-ppt-slides

My department helps all these DPUs identifying and evaluating drug targets, providing several computational biology expertise, together with genetics, stats, and experimental validation. It’s like a center of excellence which interacts with all the rest of R&D.

Identifying the correct target is important because it is the first decision in the drug development process, and an error in this step can be quite expensive. Imagine what happens when a clinical trial fails in phase III because the original drug was targeting the wrong gene: it is quite a big waste of resources, not only for the company but also and more importantly for the patients.

What is target identification, and what is my role

In layman terms, identifying a drug target involves answering the following question: if I want to treat disease X, which would be the best genes to target?

From a computational point of view, there are several ways to answer such question. You may simply go to the literature (e.g. pubmed) and search for relevant articles. Other approaches involve looking at information from several sources, like gene expression, protein interactions, involvement in pathways, and much more. It is usually a matter of data integration, or data science.

If you want to get a more general idea of the types of sources used for target identification, you can have a look at the Open Targets Platform; this is a pre-competive effort to curate and integrate data sources, supported by the EBI, GSK and other pharmas.

My role, in particular, is more focused on data integration and management than pure analysis. It is about making the best use of the datasets we have access to, and understanding what is the value of acquiring a new dataset. It is also about improving communication about data usage, and discovering new technologies and methods to make use of the data.

What is good about working in a pharma, compared to academia?

Let’s say three things:

Team Working. This is the answer that hurts the most, specially me.
If you look at the previous posts in this blog, you can see how much I care about doing science in a agile way, planning properly and sharing information. The problem is that in the academia, the pressure of having to publish first author papers ruins it all.
In the academic world there is a lot of collaboration, specially online, and team meetings and journal clubs; but at the end of the day, your long term prospects are all dependent on your own reputation in the scientific world. This is fair enough, but difficult to reconcile with real team working.

Lots to learn: everybody is usually involved in more diverse projects, and interact with more people from different background. Thus, you tend to specialize less in a specific area, and learn a bit of everything. To be honest, I prefer this approach as it keeps the attention higher. I am glad that I did a PhD, during which I spent several years specializing on a single area, human genetics; however, now that I got older I like learning more about different fields.

Possibility to grow. You are generally more pampered and cared than in the Academia. You are actively encouraged to follow courses and learn new technologies; and my line manager complains if I am still in the office after 6 pm. (to be honest my PhD supervisor also did). There are opportunities to do secondmends in other parts of the company, and learn about clinical trials, finance, or anything related. Every year you define a list of objectives with your line manager, and you are valued depending on how you reach them, in a fair process, and you are valued for your efforts and accomplishments.

What is Bad?

Politics. Unfortunately politics is everywhere, specially in a big international company. Luckily I am still unimportant enough, that this doesn’t affect me much.

Simplification. Interacting with people with different background means that you need to simplify and learn to explain complex biological concepts in a way that is easy to understand. This is not easy and sometimes lead to funny effects, e.g. when you start hearing buzz-words and simplifications. On the bright side, at least I am improving my communication skills.

What’s next?

For personal reasons I haven’t written much in this blog lately, and I may not be able to write much in the near future. However, hopefully I’ll be able to write more about this new adventure, and describe how science is done from the industry side.

The aim of this initiative is to collect data on child growth and development from several sources, to study which factors influence child growth and how to better intervene when there are risks. Currently the data comes from manual annotation of several publications, but future plans include launching a global effort to collect data systematically, and actually one of the objectives of the hackaton was to guide the planning of this effort.

I had a lot of fun during the hackaton and learned a lot. For me personally was an opportunity to learn more about the caret R package, which is a must-known library for doing machine learning in R. My plan for the hackaton was actually to do a trajectory clustering to see if there were different trajectories of growth of the baby during pregnancy, but unfortunately the analysis didn’t return very interesting results 🙂

Publons is a social network for peer reviewer, where you can list of papers you reviewed, get credit for it, and even post new reviews on published papers. I personally like the idea of Publons very much, because I think that reviewing papers is an important part of science, which unfortunately doesn’t get the recognition it deserves.

Preparing the materials for a workshop on bash programming is very difficult, because you never know which level of skill to expect from the people attending it.

Click on the image to access the slideshow.

Most of the times the class will be a mix of absolute beginners and expert Unix users, and it is not easy to prepare a presentation that will interest both. If the materials are too advanced, the beginners will get frustrated and stop paying attention. If the materials are too simple, expert users will get bored soon and get distracted, and start working on their own things and checking facebook.

In an attempt to avoid these issues, I’ve decided to go for a trick that hopefully would get the attention of even the most advanced bash guru, which is: hiding cows in the genome.

More precisely, for a workshop at the Programming for Evolutionary Biology conference held this year in Belgrade, I designed the exercises in a way that the instructions for the next step can be retrieved using the correct bash commands. Students start with a file of randomly generated text, and they have to use grep and other unix tools to proceed to the next exercise. If the exercise is done correctly, they also see a cow.

I think it worked decently, because the students liked the idea and finding cows in the fasta and bed files was fun.

The workshop’s materials are below. (if the iframe doesn’t work, click here). If you are a teacher and organize workshops on bash programming, here I am officially challenging you to include something similar in your next presentation 🙂

Bioconductor does not only contain analysis packages, but also a good suite of data packages, frozen from the most important data sources for bioinformatics (e.g. EBI, NCBI, UCSC, etc..).

These data packages are useful because because they allow to access certain biological relevant data quickly and without having to manually download them from the web. They are used internally by several analysis packages (e.g. to calculate ontology enrichment, get gene coordinates, etc..), and in a way they improve the reproducibility of your analysis, because by updating them within R you will access to the same version of the data frozen as for anybody else using them.

This slideshow provides a quick summary of all the data annotation packages available, how to use them and how this part of bioconductor is evolving.

I’ve prepared the slideshow for the second workshop at the Programming for Evolutionary Biology in Belgrade I’ve presented this year. It is probably less glamorous than the Bash slideshow, as there are no hidden cows, however it may be more useful, specially if you use Bioconductor regularly.

Disclaimer: I am not a bioconductor developer, but just an user. So apologies if I wrote anything wrong 🙂

Our group has been interviewed by LabWorm regarding our recent publication on Network of Cancer Genes 5.0.

I absolutely love the “artist impression” they made of our team:

The NCG team sketched by LabWorm. Thanos Mourikis, me, and Omer An.

LabWorm is a collaborative platform for sharing tools and links related to bioinformatics. They have a very modern and interactive user interface, and they are very active in adding new links and involving people in the platform.

Over my too many years of experience in the bioinformatics field, I saw many attempts at creating collections of bioinformatics tools. Unfortunately many of these failed because of lack of interest or lack of maintenance. However LabWorm seems to be doing things right for the moment, as they really work hard to engage people in their community, and they even publish some blog interviews to researchers.

The bioinformatics community really need a effective way to share tools and links, and I really hope that LabWorm will be successful in their attempt.

A recently published paper by Hart et al presented a genome-wide CRISPR screening to identify fitness genes (a superset of essential genes) in five cell lines. The paper is quite impressive and shows the potentiality of CRISPR to generate large scale knockouts and to characterize the importance and function of genes in different conditions.

In the discussion the authors propose that fitness genes are more likely to be more conserved across species. However they do not follow-up on this hypothesis, probably for lack of space. They can’t be blamed as they already present a lot of results in the paper.

Distribution of conservation scores in the human genome. Are essential genes more conserved than other genes?

This post presents a follow-up analysis on the hypothesis that fitness genes are more conserved than non-essential genes. I’ll take the original data from the paper, get the conservation scores from bioconductor data packages, and do a Wilcoxon test to compare the two distribution. The full code is available as a github repository, and please feel free to contribute if you want to do some free R/Bioconductor analysis.

BioConductor includes many powerful packages for working with genomics data. You can do pretty much everything, from downloading gene coordinates and sequences of any model species, to converting gene ids and symbol, and to accessing ENCODE data and anything in UCSC, Ensembl, and other resources. However these packages are not always well known, and the initial learning curve is a steep, specially for R beginners.

This series of tutorials will describe how to get gene coordinates from bioconductor, intersect these with some interesting dataset from ENCODE, and do an enrichment analysis with DOSE. It will be fun 🙂

Libraries required for this tutorial

For this tutorial we will use only Human data. Most of the packages needed for working with human genomics can be installed with a single command:

org.Hs.eg.db is what you need to convert all gene ids – from entrez to ensembl, to GO, and so on.

The data in these packages is updated periodically (I think every 6 months), and is pretty stable, meaning that anybody using the same packages and version should be able to reproduce the same results. An alternative to using these data packages is biomaRt, but I prefer the data packages as they can be used without internet connection.

The package AnnotationHub is used to retrieve data from multiple sources and will be described later. The BSgenome package is for retrieving the human genome sequence: we will not use it in the tutorial but I included it for completeness.

Note that I also loaded the dplyr package for this tutorial. Although dplyr is not needed for working with genomics data, I consider it one of the most useful packages in R, and this tutorial will make heavy use of it. I apologize if this tutorial is not easy to follow to people not familiar with dplyr.

Retrieving gene and transcript coordinates

The TxDb object can be used to retrieve coordinates of genes, transcripts, and exons in the human genome. For example, we can access all human transcript with the transcript() function:

See help(transcripts) for other functions that can be applied to a TxDb object. For example, genes() retrieve coordinates of genes, while exons() and promoters() work similarly.

In the example above I also specified a “columns” parameter, in order to show the gene id as well. You can use this column to get the coordinates of a specific set of genes. For example, the following will retrieve the coordinates of the genes corresponding to entrez ids 1234, 231, and 421:

1

2

3

4

5

6

7

8

9

10

>subset(human.transcripts,gene_id%in%c(1234,231,421))

GRanges objectwith5ranges and3metadata columns:

seqnamesranges strand|tx_id tx_name gene_id

<Rle><IRanges><Rle>|<integer><character><CharacterList>

[1]chr3[46411633,46417697]+|13703uc003cpo.41234

[2]chr3[46411633,46417697]+|13704uc010hjd.31234

[3]chr7[134127107,134143888]-|31418uc003vrp.1231

[4]chr22[19957402,19966734]-|74543uc002zqy.3421

[5]chr22[19957402,20004309]-|74544uc002zqz.3421

Converting Entrez IDs to symbols and other IDs

One of the most tricky part in bioinformatics is converting gene ids to symbols and other ids. Many errors can be made in this process, and is therefore very important to have a consistent way to convert gene ids.

Luckily, we can use the org.Hs.eg.db for easily converting many ids. This package should already have been loaded with library(Homo.sapiens). To see all the possible conversion tables (bimaps) available, we can either to library(help=org.Hs.eg.db) or simply write “org.Hs.eg” and then hit tab on the R command line .

One of my favorite bimaps is the one to convert gene symbols to entrez. As you may know, for historical reasons the same gene can have more than one symbol. This usually complicates things a lot, and a safe procedure is to always convert symbols to entrez before starting any analysis. The ALIAS2EG bimap is there for this type of conversion:

1

2

3

4

5

6

7

8

>head(as.data.frame(org.Hs.egALIAS2EG))

gene_id alias_symbol

11A1B

21ABG

31GAB

41HYST2477

51A1BG

62A2MD

When you convert symbols to id, it is important to remember that not only the same gene can have more than one symbol, but also the same symbol can match multiple entrez ids. For example, here is the code to get which symbols match more than one entrez id in the human species:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

>as.data.frame(org.Hs.egALIAS2EG)%>%

count(alias_symbol)%>%

arrange(-n)

Source:local data frame[118,097x2]

alias_symbol n

(chr)(int)

1VH36

2MT112

3GPCR11

4HOX110

5ACT9

6HOX29

7PPIase9

8UDPGT9

9ALP8

10NAP18

Note: the “%>%” symbol and the count, arrange functions come from the dplyr package.

The only safe thing to do in these cases is to identify the duplicated symbols and either discard them or manually curate them. Let’s imagine that a fellow researchers asked us to retrieve information for the genes ACT, DOLPP1 and MGAT3. We can use the bimap to identify the genes that map to multiple entrez id, and then go back to our colleague and ask him to tell us which are the correct ids.

1

2

3

4

5

6

7

8

9

10

11

>as.data.frame(org.Hs.egALIAS2EG)%>%

count(alias_symbol)%>%

filter(alias_symbol%in%mygenes)

Source:local data frame[3x2]

alias_symbol n

(chr)(int)

1ACT9

2DOLPP11

3MGAT32

In these examples I converted the bimap to a dataframe and then did an intersection. However the “bioConductor” way to use these bimaps is through the select function:

One big problem with these bioconductor packages is that they clash with many dplyr functions. For example, the select function gets overwritten if you load dplyr after Homo.sapiens, and the only option to avoid headaches is to explicitly refer to the function as AnnotationDbi::select. These conflicts in the namespace can cause a lot of confusion in R, because they let to weird error messages that are completely unrelated to the real problem.

In any case, the advantage of the select function is that it allows to retrieve more id types at the same type. For example here I retrieved both entrez id and ensembl ids, and if you type columns(org.Hs.eg.db) you will be able to see many other possible output columns.

Next parts of the tutorial

I was originally planning to write one big tutorial in the same post, but now I see that it would be much more readable if I split it into multiple posts.

Please let me know if you have any comment regarding this first tutorial, and I will try to improve it and take the feedback into account for the next parts.

This has been a lovely and sunny weekend in London, but I didn’t see any of it because I spent it all crunching dataframes and calculating numbers at my first Data Dive.

Data Dives are events organized by an international organization called DataKind, in which a bunch of data scientists volunteer to dedicate their time to solve data analysis for non-profit companies. For example I have been analysing data for My Help at Home, a company that helps elderly people finding local carers, trying to understand which factors influence the demand and costs of private carers.

DataKindUk has a strict no-sharing policy regarding the results of the Data Dive, in order to protect the data made available by the charities. However in the case of My Help at Home we used only publicly available data, so I guess I can show some of the results, based on the number of Homes, Agencies and Hospitals in UK:

Here are a few thoughts about the experience:

I’ve decided that I will start introducing myself as a data scientist rather than a bioinformatician. Most people from outside the academia do not really understand what a bioinformatician is, and it is easier to explain them that you are a data analyst or scientist working on genetic and biological data. In the end the definition is correct – bioinformaticians truthfully are a specialized type of data scientists.

This has been an opportunity to get in contact with the “real world” of data science outside the academia. Most of the people I met work for the private sectors, like financing, consulting, gambling, and journalism. I only met a couple of people from the academia, and they were both complaining about the lack of organization and planning at the university.

Thanks to dplyr and related libraries, R has become a really powerful tool for merging and assembling datasets. It helped me a lot during the phase of data cleaning and assembly, and I think that for these tasks it is much better than python or bash. I would recommend to anyone starting learning R to skip all the basic syntax and start directly with dplyr (e.g. see the tutorial I wrote for the PEB workshop).

The majority of people used python, in particular the ipython notebooks, for most of the tasks. Currently I am a R and dplyr person, but for machine learning tasks I am starting to think that python and scikit-learn can actually be more powerful.

People working in consulting, who for their work need to able to easily create nice and interactive graphs, used visual solutions such as tableau rather than munching with R or other programming tools. For example, the interactive graph above was created in a couple of minutes with noveau.

I’ve recently been a reviewer for the book “Bioinformatics with Python cookbook” by Tiago Antao, one of the big authors of BioPython. The book is published by Packt Publishing, and it is a collection of recipes for several bioinformatics tasks, from reading large genome files to doing population genetics and other tasks.

Bioinformatics with Python Cookbook on my desktop, together with my zombie mug.

The github account of the author contains a link of all the python notebooks illustrated in the book. These notebook are freely accessible, but there is no explanation of the code, as for that you will need to buy the book. Moreover, the book provides a link to a docker image that can be used to install all the materials and software needed to execute the examples. I think this is a smart way to provide materials for exercises, and I will copy the idea in the future.

Being a reviewer, I was expected to be an expert in all the topics described in the book. However I must admit that I learned a lot from reviewing it, and that some of the recipes presented managed to surprise me. Here is a quick summary of the new things I learned: