Homologene

I am the author of a small package called homologene. The package wraps
the identically named homologene database released by NCBI. This database includes
information about genes that are homologues/orthologues of each other. It is structured
as a simple table. One can simply access it by doing

In general I like using this method because it’s fast, the syntax is simple and it’s easy to extend it at a whim. A single set of gene symbols and IDs are used to identify genes and organism level data is tied to taxonomy identifiers which makes things easier. The obvious alternative to this would be to use biomart1 which to me, is less intuitive and clunky.

What is wrong with homologene?

Not everyone is happy with using homologene for orthology detection. The database has not been updated since 2014. This means it is not created using the latest annotations. The reference genomes have been getting updates and gene symbols change frequently. For example here you can see a list of gene name changes that happened in a 15 day period at February 9th 2019.

That list above is generated by parsing gene_info.gz file provided by NCBI. I use this file to keep an up to date list of gene symbols and their synonyms. It comes in handy when I use data released by other researchers that only include gene symbols as identifiers. Using the data here, we can identify how many genes in homologene have changed their names, assuming the NCBI ids are the same. I will be using the data in the form it is made available in the geneSyonym package.

That is 1418 genes that have a different official symbol than they used to in 2014. And that is only in mouse. Considering this only corresponds to 6.7% of all gene symbols, this number isn’t too much but it does mean that we will be missing a few genes if we were using gene symbols for matching. Alas, since I wasn’t wise enough 4 years ago when I started my projects, there are parts of my pipelines that still rely on gene symbol matching.

Most common transformation I make is translating my lists of brain cell type markers that were created using mouse data2 to their human orthologues. I can see which markers are lost due to differences in gene symbols when making this transformation.

That is 46 genes which corresponds to a 1.8%. This is probably a more realistic image of real world implications since not all genes are equally important. Many genes that are prone to change are pseudogenes, non-coding transcripts etc. Such genes are less likely to come up in an analysis.

Updating homologene gene IDs

One small thing we can do to improve this situation is to manually replace the old gene symbols with new ones. Assuming the gene IDs are the same, we could just replace the old names with the new. However, gene IDs are also prone to changes3 and there is a whole different file that keeps track of these changes. We need to look into that file and update the gene IDs too.

No current ID was ever discontinued. That’s a good sign. I can just trace
the history of discontinued IDs and find the current IDs. Note that when I first tried
this here were 20 IDs here. 20 IDs that were discontinued since I last updated thegeneSynonym package. Since all of these files area updated nightly it is important
to update the databases at the same time.

It’s good to see that only a relatively small number of IDs were discontinued. We
can trace their history to get their current IDs (if it exists). To make this process
faster, we will also take a relevant subset of the gene_history frame.

# get the earliest date where we see one of our ID change events from homologene
# since homologene was updated in 2014, this should be no earlier that that date
earlierst_date = gene_history %>%
filter(Discontinued_GeneID %in% homologeneData$Gene.ID) %$%
Discontinue_Date %>%
min
earlierst_date

## [1] 20100919

This is interesting. Why does homologene includes gene IDs that were discontinued 4 years before it’s creation? How many such IDs are out there

That’s a much smaller search space. Now we can just trace the history of every single
gene with a different ID than before. I will probably end up including a version of this function to thegeneSynonym package as pretty much all gene lists are at least a little out of date.

traceID = function(id){
print(id)
event = relevant_gene_history %>% filter(Discontinued_GeneID == id)
if(nrow(event)>1){
# just in case. if the same ID is discontinued twice, there is a problem...
return("multiple events")
}
while(TRUE){
# see if the new ID is discontinued as well
next_event = relevant_gene_history %>%
filter(Discontinued_GeneID == event$GeneID)
if(nrow(next_event)==0){
# if not, previous ID is the right one
return(event$GeneID)
} else if(nrow(next_event)>1){
# just in case, if the same ID is discontinued twice, there is a problem...
return("multiple events")
} else if(nrow(next_event) == 1){
# if the new IDs is discontinued, continue the loop and check if it has a parent
event = next_event
}
}
}

discontinued_ids$Gene.ID %>%
sapply(traceID) ->
new_ids

Not the most efficient code probably but it’ll be run as a CRON job in the office machine at around 6 am on Sundays
so I don’t really care that much about efficiency.

Lets do a few sanity checks

# are there any ids that had multiple events
any(new_ids == "multiple events")

## [1] FALSE

# are there any new ids that the geneSynonym package doesn't know about?
all(new_ids[new_ids != '-'] %in% modern_IDs)

## [1] TRUE

# how many of the discontinued IDs end up having new IDs instead of getting removed
length(new_ids[new_ids != '-'])

## [1] 1401

# how many IDs are lost forever
length(new_ids[new_ids == '-'])

## [1] 1768

Good.

Updating homologene gene symbols

Since we have the new ids now, we can start building our new homologeneData2 frame
that will be added to the homologene package. I don’t want to overwrite the original data
because people might use it expecting a perfect match with the NCBI database. Also if
I messed up here I don’t want people to suffer without explicitly choosing to trust me
instead of the actual database.

# another sanity check, differences in mouse.
# we did this in the beginningas well
# note that the number is higher than before
# since the new IDs allow more genes to be changed
sum((homologeneData2$Gene.Symbol != new_symbols)[homologeneData2$Taxonomy ==10090])

The new homologene dataset and associated functions to use them should be on Github in a few days.
Then I will set up a CRON job to update it weekly. The CRAN version will have to be perpetually out of date which is somewhat problematic, but it can’t be helped. I may include a function that builds the current version and let users specify the version they want to use.
The new syntax will probably look like this: