How to uncover patterns in vast data sets

December 19, 2011

If researchers printed on paper each potential relationship in a recent data set containing abundance levels of bacteria in the human gut, the stack of paper would reach to a height of 1.4 miles, six times the height of the Empire State Building (credit: Sigrid Knemeyer, Broad Communications)

Part of a suite of statistical tools called MINE (Maximal Information-based Nonparametric Exploration), it can tease out multiple patterns hidden in data on global health, gene expression, major-league baseball, and the human gut microbiota, for example.

From Facebook to physics to the global economy, the world is filled with data sets that could take a person hundreds of years to analyze by eye. Sophisticated computer programs can search these data sets with great speed, but fall short when researchers attempt to detect different kinds of patterns in large data collections.

“There are massive data sets that we want to explore, and within them, there may be many relationships that we want to understand,” said Broad Institute associate member Pardis Sabeti, senior author of the paper and an assistant professor at the Center for Systems Biology at Harvard University. “The human eye is the best way to find these relationships, but these data sets are so vast that we can’t do that. This toolkit gives us a way of mining the data to look for relationships.”

This graphic depicts the top 0.25 percent of the relationships that the researchers' techniques found in data on the concentration of microbes in the human gut (credit: David Reshef)

The researchers tested their analytical toolkit on several large data sets, including one dealing with the trillions of microorganisms that live in the gut. They used MINE to make more than 22 million comparisons and narrowed in on a few hundred patterns of interest that had not been observed before.

One of the tool’s greatest strengths is that it can detect a wide range of patterns and characterize them according to a number of different parameters a researcher might be interested in. Other statistical tools work well for searching for a specific pattern in a large data set, but cannot score and compare different kinds of possible relationships. MINE is able to analyze a broad spectrum of patterns.

MINE is especially powerful in exploring data sets with relationships that may harbor more than one important pattern. As a proof of concept, the researchers applied MINE to social, economic, health, and political data from the World Health Organization (WHO) and its partners. When they compared the relationship between household income and female obesity, they found two contrasting trends in the data. Many countries follow a parabolic rate, with obesity rates rising with income but peaking and tapering off after income reaches a certain level. But in the Pacific Islands, where female obesity is a sign of status, countries follow a steep trend, with the rate of obesity climbing as income increases.

Researchers can use MINE to generate new ideas and connections that no one has thought to look for before.

“Our tool is a hypothesis generator,” said Yakir Reshef, a co-first author of the paper and a Fulbright scholar at the Weizmann Institute of Science. “The standard paradigm is hypothesis-driven science, where you come up with a hypothesis based on your personal observations. But by exploring the data, you get ideas for hypotheses that would never have occurred to you otherwise.”

Researchers from many different fields, including systems biology, computer science, statistics, and mathematics, all contributed to this project. A video about this work is available at here.