Notes from the life of a [data] scientist

Menu

What’s in a (gene) name?

I’ve posted before on standard names (or lack thereof) for genes and proteins and in particular, the whacky names of which biologists are so fond. Hopefully they now realise that in the age of bioinformatics – where we have to find stuff easily – descriptions such as ken and barbie, scott of the antarctic or glass-bottom boat are, um, unhelpful to say the least.

So hot on the heels of my “man, you can publish anything in bioinformatics these days” post comes:

We take stock of current genetic nomenclature and attempt to organize strange and notable gene names. We categorize, for instance, those that involve a naming system transferred from another context (for example, Pavlov’s dogs). We hope this analysis provides clues to better steer gene naming in the future.

So the world of genetics could probably do without “ken and barbie” and a long list of vegetable names for mutations that make fruit flies catatonic. But as such.ire touches on, you’re overlooking some useful aspects of these names. It is problematic to name genes after predicted functions, as they are sometimes proved to have different or additional activities (what do you do when your dismutase acts as a dehydrogenase?). By contrast, fruit fly gene names (which tend to be the extremes in the silly name race) do follow a strict convention: they reflect something about what goes wrong when the gene is altered (the mutant phenotype). Moreover, mere fact that these gene names become adopted by for use in other species speaks to their utility — they are often memorable, indicative of the biology of the gene, and unique such that they serve as a searchable ID.

I actually find these names quite entertaining and fun. However, I don’t think we should name genes so as we can tell what they do by glancing at the name. We name them for easy database retrieval and parsing. Which makes something like “DM00001” hard to beat, if you ask me.

When you start thinking about multi-gene networks, trying to recall that DM0001 is a positive regulator of DM0003 which is a negative regulator of DM0004 which binds to DM0014 would be incredibly challenging. Replace those even with three- or four-letter gene name abbreviations and your brain (or that of a reader or audience member) can suddenly keep track and understand. It’s easy to associate a systematic ID with a more recognizable ‘name’ and there seem to be very compelling reasons to do so.

That said, one of the most difficult problems I have encountered in terms of searching is gene names that match to common English words–like the human gene “WAS” for Wiskott-Aldrich Syndrome–that can make it next to impossible to find what you’re looking for. For example, a search in NCBI Entrez Gene with “homo sapiens was” brings up 8,000 plus entries with the notation “This record was discontinued” whereas when this redundancy is not present (example, “homo sapiens egfr”) you can quickly find what you’re looking for in the same database.

But the trouble comes when science is transmuted into medicine; what works in the lab may be jarring in the clinic.

I find it hard to believe that there’s a real issue here. When was the last time a patient had to be told “There’s a polymorphism in your male chauvinist pigmentation gene”? The article seems to be written from a fantasy world of translational research. Even if a doctor needs to discuss a patient’s SMAD2 or SHH gene, which is highly unlikely, he doesn’t need to get into the history of the name.