Mining the Web for Medical Hypotheses: A Proof-of-Concept System

Diana Maclean, Margo Seltzer

Abstract

As the prevalence of blogs, discussion forums, and online news services
continues to grow, so too does the portion of this Web content that
relates to health and medicine. We propose that everyday,
medically-oriented Web content is a valuable and viable data source for
medical hypothesis generation and testing, despite its being noisy. In
this paper, we present a proof-of-concept system supporting this notion.
We construct a corpus comprising news articles relating to the drugs
Vioxx, Naproxen and Ibuprofen, that were published between 1998-2002.
Using this corpus, we show that there was a significant link between
Vioxx and the concept “Myocardial Infarction” well before the drug was
withdrawn from the market in 2004. Indeed, within the Vioxx-related
content, the concept ranks amongst the top 3.3% in terms of importance.
When compared with the Naproxen and Ibuprofen control literatures, the
term occurs significantly more frequently in the Vioxx- related content.