03 February 2008

The friend I crashed with while attending SODA is someone I've known since we were five years old. (Incidentally, there's actually someone in the NLP world who I've actually known from earlier...small world.) Anyway, the friend I stayed with is just finishing med school at UCSF and will soon be staying there for residency. His specialty is neurosurgery, and his interests are in neural pathologies. He spent some time doing research on Alzheimer's disease, effectively by studying mice (there's something I feel sort of bad about finding slightly amusing about mice with Alzheimer's disease). Needless to say, in the process of doing research, he made nearly daily use out of PubMed. (For those of you who don't know, PubMed is like the ACL anthology, but with hundreds of thousands of papers, with new ones being added by the truckload daily, and will a bunch of additional things, like ontologies and data sets.)

There are two things I want to talk about regarding PubMed. I think both of these admit very interesting problems that we, as NLPers, are qualified to tackle. I think the most important thing, however, is opening and maintaining a wide channel of communication. There seems to be less interaction between people who do (for instance) bio-medical informatics (we have a fairly large group here) and what I'll term as mainstream NLPers. Sure, there have been BioNLP workshops at ACLs, but I really think that both communities would be well-served to interact more. And for those of you who don't want to work on BioNLP because it's "just a small domain problem", let me assure you: it is not easy... don't think of it in the same vein as a true "sublanguage" -- it is quite broad.

I suppose I should give a caveat that my comments below are based on a sample size of one (my friend), so it may not be totally representative. But I think it generalizes.

Search in PubMed, from what I've heard, is good in the same ways that web search is good and bad in the same ways that web search is bad. It is good when you know what you're looking for (i.e., you know the name for it) and bad otherwise. One of the most common sorts of queries that my friend wants to do is something like "show me all the research on proteins that interact in some way with XXX in the context of YYY" where XXX is (eg) a gene and YYY is (eg) a disease. The key is that we don't know which proteins these are and so it's hard to query for them directly. I know that this is something that the folks at Penn (and probably elsewhere) are working on, and I get the impression that a good solution to this problem would make lots and lots of biologists much happier (and more productive). One thing that was particularly interesting, however, is that he was pretty averse to using structured queries like the one I gave above. He effectively wants to search for "XXX YYY" and have it realize that XXX is a gene, YYY is a disease, and that it's "obvious" that what he wants is proteins that interact with (or even, for instance, pathways that contain) XXX in the context of disease YYY. On the other hand, if YYY were another gene, then probably he's be looking for diseases or pathways that are regulated by both XXX and YYY. It's a bit complex, but I don't think this is something particularly beyond our means.

The other thing I want to talk about is summarization. PubMed actually archives a fairly substantial collection of human-written summaries. These fall into one of two categories. The first, called "systematic reviews" are more or less what we would think of as summaries. However, they are themselves quite long and complex. They're really not anything like sentence extracts. The second, called "meta analyses" are really not like summaries at all. In a meta analysis, an author will consider a handful of previously published papers on, say, the effects of smoking on lifespan. He will take the data and results published in these individual papers, and actually do novel statistical analyses on them to see how well the conclusions hold.

From a computational perspective, the automatic creation of meta analyses would essentially be impossible, until we have machines that can actually run experiments in the lab. "Systematic reviews", on the other hand, while totally outside the scope of our current technology, are things we could hope to do. And they give us lots of training data. There are somewhere around ten to forty thousand systematic reviews on PubMed, each about 20 pages long, and each with references back to papers, almost all of which are themselves in PubMed. Finding systematic reviews older than a few years ago (when the began being tagged explicitly) has actually sprouted a tiny cottage industry. And PubMed nicely makes all of their data available for download, without having to crawl, something that makes life much easier for us.

My friend warns that it might not be a good idea to use all systematic reviews, but only those from top journals. (They tend to be less biased, and better written.) However, in so far as I don't think we'd even have hope of getting something as good as a systematic review from the worst journal in the world, I'm not sure this matters much. Maybe all it says is that for evaluation, we should be careful to evaluate against the top.

Now, I should point out that people in biomedical informatics have actually been working on the summarization problem too. From what I can tell, the majority of effort there is on rule-based systems that build on top of more rule-based systems that extract genes, pathways and relations. People at the National Library of Medicine, Rindflesch and Fiszman, use SemRep to do summarization, and they have tried applying it to some medical texts. Two other people that I know are doing this kind of work are Kathy McKeown and Greg Whalen, both at Columbia. The Columbia group has access to a medically informed NLP concept extractor called MedLEE, which gives them a leg up on the low-level processing details. If you search for 'summarization medical OR biomedical' in GoogleScholar, you'll get a raft of hits (~9000).

Now, don't get me wrong -- I'm not saying that this is easy -- but for summarization folks who are constantly looking for "natural artifacts" of their craft, this is an enormous repository.

20 comments:

Sorry to go off-topic a bit but this whole pubmed business always gets me thinking about machine learning repositories. We have the UCI repository for datasets and lately we have the machine learning open source repository but I still have the feeling we are missing out on so much. Wouldn't it make sense for our community to have an arXiv or some other central location with easy access to any paper you could possibly care about. It should be a piece of cake to augment the repository with citation tracking, (blog-comment like) discussions, summaries, meta-analysis, datasets, software and probably many more things?

In the mean time, if I (ever?!?) publish, I'll certainly use a blog to do some of the above.

Re your first comment ("the biomed researcher wants to just type "XXX YYY"). I've heard the same from my biomed collaborators, and several of us at Penn have been working on an approach to giving them what they want, which is in review at the moment (cross fingers).

Bio-medical text is a great problem for many reasons. First, the research biologists are desperate for a solution. Second, there's lots of open source text and an unimaginable amount of structured data. Third, there are conference outlets and funding for the work.

In terms of summarization, geneticists often want to find out about 100 genes they found through differential micro-array or landscape experiments. Minimally, a system needs to (1) find all the mentions of gene, and (2) somehow summarize sets of articles. There are existing tools to do this with ontologies, like Cytoscape and GeneSpring, but nothing that ties in with the literature.

MEDLINE is the NLM's repository of citations. Entrez is their top-level search application:

http://www.ncbi.nlm.nih.gov/sites/gquery

PubMed is the piece of this search application that searches MEDLINE citations.

The rest of Entrez links extensively back into MEDLINE. For instance, check out Entrez Gene (from above link), which is a database indexed by gene and species. Each entry contains aliases, text descriptions of the gene, descriptions of the gene's function, links to related genes (either by homology or interaction), and links into GO (the Gene Ontology, which contains an extensive multiple inheritance hierarchy for genetic function, process and location).

Many biologists are tracking specific diseases, in which case you'll want to check out OMIM (Online Mendelian Inheritance in Man), which is a catalogue of diseases and genes, with many gene variants listed. The beauty of OMIM is that it's text with citations back into MEDLINE.

You'll also need KEGG (Kyto Encyclopedia of Genes and Genomes, a set of "disease graphs" with known pathways and sub-pathways):

http://www.genome.jp/kegg/

And don't forget PubMed Central -- the repository of full text articles (it's part of Entrez).

The search problem's quite a bit more complex than gene plus disease, but limiting to that, it's still nearly impossible even if you know the curated names of the diseases and genes. Too many of them have common names, like ACT or TO, to use simple techniques for search. The listed sets of aliases are woefully incomplete. Context is really critical.

The other big problem is genericity. Gene names and disease names are more like product names than human names. They refer to families, variations, and so on.

Very often papers are about whole families of genes. For example, "the EXT family" contains EXT1, EXT2, EXTL1, EXTL2 and EXTL3. Then there are subparts (e.g. conserved regions under homologies or protein motifs), homologues in other species (e.g. mouse ext1 [they like lower case]), mutated forms (most often described rather than named). Oh, and then there's the metonymy problem -- genes and the proteins they produce tend to have the same names.

To echo Bob's post, there are a truckload of bio text resources of all shapes and kinds. Many learning methods and techniques can find a relevant task in the vast array of these resources. I did a project with bio related search a few years back and was overwhelmed by the size and depth of these materials.

Also, like Hal said, people use this stuff. With a lot of NLP problems, the goal is to provide better tools to the average user. However, in bio these aren't "average" users, these are bio researchers, doctors, academics, etc. They are accustomed to research and use state of the art tools all the time. I think its easier for this group to take, use and understand newer technologies.

There is a recent Web application, Semantic MEDLINE, which attempts to help the researcher answer the type of question you mention in the post. It is based on SemRep technology and automatic abstraction summarization on top of it. It currently works with canned queries right now, but will eventually work with any query. (Once we finish processing entire MEDLINE with SemRep)

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..sesli sohbetsesli chatsesli sohbet siteleri