The next time you read a Wikipedia article on flu pandemics, you're also creating data that can help scientists predict the spread of disease.

You need information, you go to Wikipedia—for years the user-curated site has been the default repository for looking up facts. But the information you look up on Wikipedia also creates information about you: Scientists at Harvard have found that studying searches of the online encyclopedia can gauge how many people have the flu in the United States. These findings could lead to new ways to automatically predict flu levels and direct vaccine campaigns.

Most Popular

Each year 250,000 to 500,000 people die of influenza worldwide, with 3000 to 50,000 of those deaths happening in the United States. These deaths are largely preventable by using flu shots, but the Centers for Disease Control and Prevention needs up-to-date knowledge about where influenza is occurring to make sure these vaccines get to where they're needed. The CDC continuously monitors levels of flu-like illness, but it can take a long time to collect and analyze all this activity, which means the data is typically up to two weeks out of date once it's made available.

Recently Google revealed a method to predict current or future levels of the flu by analyzing how often people Google search terms related to the flu. However, while Google Flu Trends is promising, epidemiologists David McIver and John Brownstein at Boston Children's Hospital of Harvard Medical School note it underestimated activity in 2009 and overestimated flu activity in the 2012-2013 flu season.

"While Google's Flu Trends has a good track record in estimating influenza-like illness, it has been shown to be less accurate during times of either increased media attention related to illness, such as during the H1N1 swine flu pandemic in 2009, or when there are more flu cases than might normally be observed in a season, like last years 2012-2013 flu season," McIver says. "When the media brings increased awareness to the amount of influenza, or some other disease, circulating in a population, it can entice people to take to the Internet . . . This can be troublesome because many of the people who are searching for influenza-related keywords may not actually have the flu."

McIver and Brownstein sought to find a new way to estimate flu activity that is less susceptible to errors from media attention. They focused on Wikipedia, since all its data is freely available to investigate, unlike Google's proprietary system. Previous research suggested Wikipedia could be a useful tool for monitoring the emergence of breaking news stories and to see what topics are trending in the public sphere.

The researchers looked at roughly 30 influenza- or health-related Wikipedia articles that were accessed every day from December 2007 to August 2013, including articles such as "common cold," "avian influenza," and "1918 flu pandemic," and analyzed the number of times they were viewed. They compared this data to official flu-activity levels provided by the CDC.

Wikipedia viewing is also susceptible to spikes because of media attention. However, "by including a broad range of health-related Wikipedia articles in our algorithms, we were able to work around much of this noise in the data, and get an accurate estimate of influenza-like illness activity during those peak media times," McIver says.

The model could estimate flu levels in the U.S. population nearly in real time, up to two weeks sooner than data from the CDC becomes available. Moreover, it accurately estimated the week of peak flu activity 17 percent more often than Google Flu Trends.

"We found it very exciting that we were able to get such accurate estimations of influenza-like illness activity," McIver says. McIver and Brownstein detailed their findings online April 17 in the journal PLOS Computational Biology.

"These results really help confirm that information gleaned from either Internet-related searches or social media can be used to help explore influenza activity," says infectious disease physician Philip Polgreen at the University of Iowa, who did not take part in this research. "What I find interesting about the source of information used here is that it's open and easy to use—anyone can download Wikipedia search logs."

Future research could add location data so that scientists can get flu activity estimates at the state, county, or city level. Another possibility is to explore whether Wikipedia data could be used for surveillance of other diseases as well.

"We haven't explored this yet, but we see potential to monitor for other issues of public health concern, such as diabetes and heart disease," McIver says. "While these conditions don't have the same time-varying, seasonal effect as influenza, we may still be able to estimate overall disease prevalence."