There’s a Fly in My Tweets

By Henry Kautz

June 21, 2013

ROCHESTER — MANY important public health questions are difficult and costly to answer. What kind of risks do highly localized sources of pollution, like dry cleaners that use volatile chemicals, pose to the health of nearby residents? Are people with many friends healthier, or do those friendships increase the likelihood of infectious disease? Do frequent visits to public spaces like bars, gyms and restaurants affect a person’s health?

Researchers have been striving for generations to answer such questions, using health surveys of samples of individuals and computational studies of simulated populations. Now, however, the rise of social media and the burgeoning field of data science provide powerful tools to find high-precision, real-world answers with little cost or effort.

The millions of people posting to sites like Twitter and Facebook can be viewed as a vast organic sensor network, providing a real-time stream of data about the social, biological and physical worlds. While people use social media to build and maintain their social ties, the “data exhaust” of their postings can be analyzed to provide an enormous range of information at a population scale.

For example, my research group at the University of Rochester has analyzed Twitter postings from millions of cellphone users in New York City to develop a system to monitor food-poisoning outbreaks at restaurants.

We began by creating algorithms that can identify tweets about a given topic with near-perfect precision, even if the words and phrases used vary widely. The GPS information embedded in tweets sent from cellphones lets us integrate them with a variety of geographic databases.

We then feed the information into what we call the nEmesis system, whose development was led by our graduate student Adam Sadilek, now a researcher at Google. It begins by finding tweets that are sent from restaurants, which we can locate on Google Maps with 97 percent accuracy, thanks to GPS coordinates.

When a user is identified as having been at a restaurant, all of his or her tweets, from anywhere, are collected for the next 72 hours and analyzed to discover if any appear to report food poisoning symptoms, like vomiting, diarrhea, abdominal pain, fever or chills.

Image

CreditOlimpia Zagnoli

Such reports are rare but significant. Over a four-month period, our system collected 3.8 million tweets, from which we were able to trace 23,000 restaurant visitors and found 480 reports of likely food poisoning. Restaurants were then scored by the number of food poisoning reports from their patrons.

The Twitter reports are not an exact indicator — any individual case could well be caused by factors unrelated to the restaurant meal. But in aggregate the numbers are revealing. Working with Vincent Silenzio, who teaches in the department of community and preventive medicine at our medical school, we compared the results with the current database of restaurant inspections conducted by New York City’s Department of Health and Mental Hygiene. We found significant correlation between restaurants’ violation scores and the Twitter-based scores.

Our project isn’t alone. While an army of corporations are busy data mining social media for marketing, a small but growing number of research groups have initiated similar efforts to leverage the torrent of online information for social good.

Groups at Brigham Young University and the University of Iowa have done extensive work on influenza monitoring via Twitter posts. Researchers at Microsoft are helping to identify women who are at risk of severe postpartum depression by analyzing changes in their online behavior. And researchers at Cornell are mining the social media stream to gather data for urban planning and environmental conservation.

THE most daunting challenges in making sense of social media are data incompleteness and noise: not knowing whether you have all the information, and how to sort out what’s relevant. These problems drive fundamental research on statistical machine learning and data-mining algorithms. nEmesis and its kin provide large-scale test-beds for developing and testing solutions to these challenges.

Further, nEmesis has immediate public-policy applications. While city health inspections capture a wide variety of data that is difficult to obtain from online social media (like the presence of rodents in a restaurant’s storage room), the Twitter signal measures a perhaps more useful quantity: a probability estimate of your becoming ill if you visit a particular restaurant.

Put differently, inspections are thorough but largely sporadic. A cook who occasionally comes to work sick and infects customers for a few days at a time is unlikely to be detected by current methods. Similarly, a batch of potentially dangerous beef delivered by a truck with a faulty refrigeration system could be an outlier, but nonetheless cause loss of life.

Obviously, public-health officials can’t rely solely on tools like nEmesis. But social-media-based systems have the potential to greatly complement traditional data-collection methods, producing a more comprehensive — and timely — model for public health policy.

Henry Kautz is the chairman of the computer science department at the Hajim School of Engineering and Applied Sciences at the University of Rochester.