Researchers awarded funds to work on big data

Two big data research projects here at Carnegie Mellon received a combined total of over $1.7 million from the National Science Foundation (NSF) and the National Institutes of Health (NIH) last week.

Aarti Singh and Christos Faloutsos, the principal investigators of the two big data projects at Carnegie Mellon, received these grants as a result of the NSF-NIH 2012 Big Data Research and Development Initiative Launch. Grants, totaling $15 million, were awarded to eight big data projects throughout the country that focus on unique aspects of the problems present in big data.

But what is big data, and why is it important?

As society and technology’s relationship matures and becomes increasingly complex, scientists and engineers find themselves faced with a rapidly growing phenomenon: big data. As it implies, the phrase describes any data set that is very, very large. But as Suzi Iacono, senior science adviser at the NSF, described in the grants’ press release, “Big data is characterized not only by the enormous volume or the velocity of its generation, but also by the heterogeneity, diversity, and complexity of the data.”

That is to say, all of our twitter posts, satellite data, purchases, and medical records generate a lot of data: 2.5 quintillion — that’s a 1 followed by 18 zeros — bytes of data a day. There is a current need for innovative ways to process it all.

One aspect of big data that makes it so complex is its high dimensional nature. In a 2D system, x causes y and the relationship can be described by a function. But when more variables are added to the system, there are not only more data and data types that the variables represent, but the relationship that describes the data set is also harder to determine.

One way to sort through this data is by data mining, a technique that combines computer science and statistics. Both Singh’s and Faloutsos’ research focus on improving current data mining methods and applications.

Singh, an assistant professor in machine learning, received $820,000. She works with Timothy Verstynen, an assistant professor of psychology, and Barnabás Póczos, a postdoctoral fellow in machine learning. Together, they are the principal investigators for research that runs out of the Auton Lab, part of Carnegie Mellon’s computer science department.

According to its website, the lab wants to study “how to efficiently estimate certain important functionals of high-dimensional distributions ... and use these estimators in machine learning algorithms.” By doing so, the researchers can further the high-dimensional challenge of mapping the brain’s billions and trillions of neurons and neural networks. If their research is successful, not only can it be applied to just neuroimaging data analysis, but also to “other scientific fields where collective behavior of a population is important.”

Faloutsos’ research focuses on the big data associated with language processing. Faloutsos teaches in the machine learning department and works with Tom Mitchell, also in the department, and Nikolaos Sidiropoulos of the University of Minnesota. They work on graph mining groups of words, or “strings,” in a matrix to better understand language processing.

Their data contains millions of triplets that house three strings with a subject-verb-object relationship, such as the relationship between the strings “Obama,” “is the President,” and “of the United States of America.” With enough similar data, Faloutsos explained, the computer can extrapolate from “David Cameron,” “prime minister,” and “U.K.” that Cameron is probably a politician. And when it’s extended to “Ben Roethlisberger,” “quarterback,” and “Steelers,” the computer automatically builds another layer and groups Roethlisberger with other people of leadership. Mitchell calls this research project the Never Ending Language Learner (NELL).

Related to NELL and included in the NSF grant is a project in neurolinguistics that attempts to understand how human brains process language. The researchers use a functional MRI to view how the brain responds to a word, then try to predict how the brain will process other words, studying how those words relate to one another. The data associated with this research “will force us to develop new algorithms or to rethink algorithms and new theories that help us analyze new data sets that were almost impossible to analyze before,” Faloutsos said.