Are fitness genes more conserved across mammals?

A recently published paper by Hart et al presented a genome-wide CRISPR screening to identify fitness genes (a superset of essential genes) in five cell lines. The paper is quite impressive and shows the potentiality of CRISPR to generate large scale knockouts and to characterize the importance and function of genes in different conditions.

In the discussion the authors propose that fitness genes are more likely to be more conserved across species. However they do not follow-up on this hypothesis, probably for lack of space. They can’t be blamed as they already present a lot of results in the paper.

Distribution of conservation scores in the phastcons.100way.UCSC.hg19 track. Are essential genes more conserved than other genes?
Distribution of conservation scores in the human genome. Are essential genes more conserved than other genes?
This post presents a follow-up analysis on the hypothesis that fitness genes are more conserved than non-essential genes. I’ll take the original data from the paper, get the conservation scores from bioconductor data packages, and do a Wilcoxon test to compare the two distribution. The full code is available as a github repository, and please feel free to contribute if you want to do some free R/Bioconductor analysis.

Getting the data

Like most bioinformatics pipelines, half of the time is spent in preparing the data and importing other datasets. Luckily with the BioConductor data packages we can access a wide range of datasets without many efforts.

1. loading data & calculating number of rows

The starting point is table S3 from the paper, containing a list of 17,661 gene symbols, along with a bayesian score indicating whether gene is a “fitness gene” (essential) in a given cell line. Unfortunately the table doesn’t contain any gene id, so we will have to convert the gene symbols manually.

The data can be read using the xlsx library. In this analysis I will make an heavy use of dplyr and ggplot2, so let’s load these packages as well:

The left_join command added a new column called gene_id, containing the entrez id of each gene.

At this point we should check whether all the id conversions occurred correctly. In principle if we had a 1:1 correspondence between gene symbol and entrez, we would expect to find 17,649 unique symbols (as we saw previously), and 17,649 entrez. However this doesn’t seem to be the case:

To identify which genes have not been converted correctly, we can just subset the rows where gene_id is NA. To improve readability, we can select only the symbol column, transpose it, and paste it to get a list of symbols that are not converted:

The output shows that there are a lot of Orf genes, that some gene symbols have been converted to dates, and that other symbols are not identified. The latter must be un-official gene symbols used by the authors instead of the official Hugo symbols.

These problems can be solved by manually renaming the “date” genes, and by using the org.Hs.egALIAS table instead of org.Hs.egSYMBOL. This will however generate duplicated ids, which will have to filter out:

The reason why the conservation scores are not in Annotation Hub is that there is a separate package for them. Thanks to Robert Castelo from UPF we can get the scores from the package phastCons100way.UCSC.hg19:

The first instruction gets the coordinates of all human genes from the TxDb object, imported when we loaded Homo.sapiens. We also filter this object to retain only the genes tested in the paper.

The second instruction gets the conservation scores across 100 species, from the phastCons package.

Finally, we print the allgenes object to see if everything went well. Notice that only 17,095 of the 17,352 unique genes have coordinates – the rest are probably withdrawn genes or genes with no annotation.

3. Are core fitness genes more conserved across species?

Now that we have the correct entrez id and the scores, we can join them in a dataframe and start doing some analysis.

4 -10.849 -24.690 1
5 21.261 -33.525 0
6 5.207 -29.754 5
The first command extracted the gene_id and conservation scores from the GenomicRanges object (mcols), transformed it to a data.frame, joined it with the other table to add the bayesian fitness scores, and then added a column to distinguish fitness and non-fitness genes.

Distribution of conservation scores for fitness and non-fitness genes. Unfortunately, there is no much difference there!
Unfortunately, after all these efforts, it seems that there is not much difference!

To verify this further, we can compare the two distributions with a wilcoxon test:

Our p-values here is pretty clear. The two distributions are not significantly different!

Conclusions

This post provided an example on how the Bioconductor data packages allow to easily get the coordinates and ids of a set of genes, and intersect them with other annotation tables from other sources.

Unfortunately my analysis on the conservation of fitness genes proved to be inconclusive, as there is no much difference between the two sets. However there are many other things that can be tried, from using other tests to determine sequence conservation (e.g. dN/dS), to restricting the analysis to fewer species. If you have any other idea or proposal, feel free to drop a comment here or to contribute to the github repository.