Room on Campus for Big Data

By Aaron Krol

November 26, 2013 | Since joining the Gordon and Betty Moore Foundation in 2008, Chris Mentzel has wanted to spur investment in the cause of data science in the American university system. The Moore Foundation is committed to fostering the earliest, most speculative stages of scientific research, the sort of pushing at the barriers of knowledge that is more at home in the academy than in risk-averse private enterprise. Lately, the navigation of increasingly complex and heterogeneous datasets has come to define the boundaries of fields as diverse as genetics, astronomy, political science, and economics. To push the Foundation into this expanding big data landscape, Mentzel participated in the creation of the Data-Driven Discovery Initiative, a Moore Foundation program of which he is now the governing Program Officer.

“The common wisdom,” says Mentzel, “back in 2010 when we really began our investigation in earnest, was that fields like astrophysics pretty much have this figured out. And that really what was needed was for those techniques to be deployed in disciplines that weren’t as, quote, ‘advanced’… We found that that was a false sense that folks had. Essentially every discipline, including astrophysics, has major challenges. They’re different – sometimes related, and sometimes at different scales – but we started seeing that what folks were calling ‘big data’ was really just big in the eyes of the beholder.” Mentzel saw a need for major investments in the academic data science environment even in those fields where the use of big data was already well-established.

The Data-Driven Discovery Initiative was approved in August of 2012, and now, just over a year later, its major pilot project has taken off. At a meeting sponsored by the White House Office of Science and Technology Policy on November 12, the Moore Foundation and the Alfred P. Sloan Foundation announced a $37.8 million grant to a trio of universities, to develop dedicated academic spaces for the practice and future advancement of data science. Invitations to apply were extended to about a dozen universities, but the three chosen – New York University, the University of Washington, and the University of California, Berkeley – stood out for their existing commitments to interdisciplinary work with complex data. These universities will be responsible for hiring new faculty in relevant fields, and for building facilities to house data science efforts, while the Moore and Sloan Foundation funds will be used to bring on new graduate students, post-docs and engineers. “Primarily these funds will go toward hiring folks that exist at this intersection of domain science and methodological sciences,” says Mentzel. “We’ll try to help define what career paths look like for these people.”

A Home for Big Data

Equally important is the establishment of physical facilities for data science. While data experts from domains like computer science, statistics, and applied mathematics have grown increasingly involved in the natural and social sciences, an absence of defined spaces and incentives for these collaborations has held back their impact. “One thing that became very clear,” Mentzel says, “is that there is a need for recreating the water cooler effect – this notion that a lot gets done as you casually talk with people around the water cooler. And it’s this serendipitous connection that was seen as essential.”

One professor who knows firsthand the power of the water cooler effect is David Hogg, an astrophysicist at NYU and the executive director of that university’s Data Science Environment, the institution officially accepting the new grant. Hogg has been involved in numerous research projects that take a comprehensive view of the datasets gathered by astronomical projects like the Sloan Digital Sky Survey, including multiple collaborations with computer scientist Rob Fergus to use computer vision methods in the analysis of massive telescopic photos. Hogg credits the cross-cutting data science environment at NYU for making this research possible. “Just by running into each other at NYU,” he told Bio-IT World, “[Fergus and I] realized we were both working on very similar things. We had large collections of images and were trying to measure things about the world in those images.”

Hogg and Fergus are both associated with the NYU Center for Data Science, a space that could serve as a model for future big data hubs within academia. A core mission of the Center, whose inaugural class of graduate students pursuing the country’s first MS in Data Science arrived in fall of 2013, is to train a new generation of academics who view big data as intrinsically connected to the natural and social sciences. By nurturing efforts like this, the Moore and Sloan Foundations hope to break down barriers between the “methods” side of data science, where software engineers and mathematicians learn to mine big datasets for valuable information, and the “domains” side, the departments like genomics, systems biology, cosmology and particle physics that benefit from complex data analytics but don’t traditionally practice it internally.

Hogg believes it is in the nature of the best data science applications that learning is not distributed between domain experts and methods experts, but shared by both. In his own research, he says, “the details of how an astronomical observation is made, and how a project is designed, and the sources of noise – those things all enter into the methods we use and the questions we’re asking.” Computer scientists can’t design useful data mining tools without learning about the underlying scientific problems, and domain scientists can’t take advantage of their data without understanding the analytical tools available. “For me,” Hogg continues, “the methods like applied mathematics and computer science are not black boxes. For me they are intellectual partners, and we’re all learning together.”

As part of the terms of the grant, NYU will invest its own resources in nine new faculty positions for data experts and domain scientists who have used big data techniques in their research. The university is also looking to expand the Center for Data Science to encompass a PhD program and a variety of undergraduate courses. UC Berkeley and the University of Washington will be making similar investments, with each university’s specific projects tailored to their existing strengths and the student communities they attract. The three institutions will also have the opportunity to share resources and discoveries, both through formal annual retreats and through informal hack-a-thons or subject-specific boot camps. An idea that has garnered particular enthusiasm from the universities’ representatives is the creation of a shared physical space that is interconnected virtually across the country. “We plan to build what we call ‘wormholes’ between the institutions,” says Hogg, “where there will be always video conferencing between the three centers. So there’ll be a kind of formal interaction space at the three universities,” where students and faculty can collaborate or share lessons in real time even in the absence of organized events.

Mentzel hopes to see the benefits of these investments spill over past the nominal bounds of this initial grant. “We have to focus within this partnership,” he says, citing the Moore Foundation’s existing commitment to the non-biomedical natural sciences. “But there’s really nothing special about the fields that we’ll be focusing on… the idea is that additional disciplines can benefit from the learning or even the structures that are put in place.”

“This is the base of the Data-Driven Discovery Initiative,” he adds. “So the [lessons] that come out of this program… will be used and deployed in our other grant-making activities.” As the Data-Driven Discovery Initiative expands, its ambitions should come to match those of the data scientists it serves. After witnessing how big data methodologies have transformed the discovery process in astronomy, Hogg is anxious for these skills to be seen as foundational to 21st-century research in any field. “The hope,” he says, “is over the next ten or twenty years, our conception as scientists of what constitutes science, or being a scientist, might evolve… We’ve tended to think of people who move the data, and take the data, and manage the data, as being engineers who are sort of on the side. What we’d like to do in this project is bring those people into the scientific endeavor, and treat them as part of the core scientific enterprise.”