at OpenHelix

Tip of the Week: GRAIL for prioritizing SNPs

Perusing my copy of Nature Genetics last week, I was flipping through the pages and noticed an unusual graphic. I looked at it a little closer and was convinced it was one of the Spirographs that I used to make as a kid. (Remember those? I always liked that….) I looked a little bit closer and realized it was somewhat more informative than the Spirographs I used to draw. This represented the relationships between genes, based on the literature. Hmmm….how did they do this, exactly?

This new GRAIL is all about text mining. It is a tool that relies on statistical text mining of the literature for genes in a region and examines the relationships among those genes in the text. The focus in their case is disease regions, but there’s no reason that you couldn’t use it for a variety of other topics. As the authors state:

Given only a collection of disease regions, GRAIL uses our text-based definition of relatedness (or alternative metrics of relatedness) to identify a subset of genes, more highly related than by chance; it also assigns a select set of keywords that suggest putative biological pathways.

So you pull a set of genes out of the literature based on SNPs or locations of interest, and you can begin to assess what’s interesting in the set. Now, the tool makes a lot of assumptions that you should be aware of if you are going to use it. It assumes each region contains a single pathogenic gene. I’m not sure that’s always going to be the case, but for this tool as long as you know that, that’s a fair assumption. They suggest this helps to keep from multigenic regions from dominating the analysis. Fair enough, but…what if that is the interesting aspect? Still–that’s ok as long as you know.

SNPs associated with height; they identify pathways they consider plausible.

Crohn’s disease; they confirm associations that have been seen.

Schizophrenia–and here they used rare deletions as the items of interest; they find related genes, many highly enriched in the CNS. So this suggests using this not only for SNPs but for CNVs this may be a useful strategy.

Their Figure 1 nicely summarizes the strategy:

One curious tweak of the data analysis was that they used the literature prior to December 2006, because right after that there was an onslaught of GWAS papers that would list a whole bunch of genes associated with regions that might be more tenuous still. I understand this in theory, but I imagine it also eliminates more current research on genes of interest from other methods too. I saw in the tool you could choose either pre-Dec 06 or a more up-to-date literature set. It would be useful to try both if you use GRAIL and keep that in mind.

Another point to keep in mind: some genes are just not found in the abstracts, and they mention that is an issue. So the set you can examine are those that were in the abstracts, and were identified properly with nomenclature, spelling, etc. Text mining is cool, but has a lot of limitations around those aspects, and the use of synonyms too in general. It’s not just an issue for GRAIL, but for all text mining tools at this point.

They also devise a way to use Gene Ontology (GO) and some expression data in GRAIL as other “relatedness” metrics. You’ll find those available from the GRAIL tool as well.

They don’t show any spirographs in their figures in this first GRAIL paper. That one that drew me in was Figure 2 in the arthritis paper. So I went over to the software to try to generate these myself. The outcome at this point is a web page with text and links to UCSC Genome Browser, and Entrez Gene (from the individual genes and from the keyword list–keywords collect multiple Entrez Genes). I was a little surprised that the keyword link wasn’t to PubMed as well. Currently it doesn’t provide the graphic, but maybe that will come along over time. If it does I’ll be sure to mention it on the blog.

One final note on the paper: in the supplemental section they compare GRAIL to other tools in this arena. If you are interested in tools like we are here you may find some of them interesting as well. The tools are listed with URLs in Table S5, and the comparison outcome is in Text S1:

So check out GRAIL and see if you find gene relationships. But don’t forget those caveats about the genes not listed in the abstracts, or the literature coverage dates. The software can be found here: http://www.broad.mit.edu/mpg/grail/

I know it’s a beta. But I think it has a lot of potential to help people sift through the results they are getting from a variety of techniques. Check it out.

NOTE: you may find periods that you can’t run GRAIL because it puts a burden on the servers. You should try again during off hours if you are seeing problems with getting it to run.This happened to me during my testing of it last week.

The list of GWAS data I used to test GRAIL came from the NHGRI catalog, which we discussed here: List of GWAS studies. I tried the straight hair SNP list, and got a pretty interesting set of results that certainly included “epidermis” and “skin” as keywords, among other things.

Hello Mary,
I like very much your Open Helix blog. I will follow you more frequently from now on. Is that your voice in the video?

I think GRAIL is relevant to the WikiGenes article: I put it on the wiki, in a table at the end of the document. Please, have a look at it, and complete the information missing, it will help to convince the authors to recognize your contribution.

Thanks, I’ll check on the article. And yeah, it is sad that the tools vanish. We come across that a lot. Sometimes they are temporary though–nobody noticed they were offline and someone will reboot the server for us! But frequently the funding is over and they just aren’t maintained any more.