Harvard Dives Into Data Science

David Parkes and Francesca Dominici, co-directors of Harvard's new data-science initiative

Photograph by Kris Snibbe/Harvard Public Affairs and Communications

[2]

[3][3][3][3]

The two faculty leaders of a new data-science initiative announced today that Harvard aims to build a significant data-science institute in Allston to support research, education, and entrepreneurship in a rapidly growing field University leaders say is clearly “a new discipline.” In the context of the University’s aspirations for Allston development, creating such an institute would provide a new intellectual and physical commons for collaboration among almost every school at Harvard–and particularly among those physically located there, including the Harvard Paulson School of Engineering and Applied Sciences (SEAS, which will have a new home on Western Avenue by 2020[4]), Harvard Business School, the i-lab incubator and its affiliates, and research-intensive businesses that the University expects to attract to its Allston “enterprise research campus.”[5] (For an overview, see Harvard Magazine’s “Why ‘Big Data’ Is a Big Deal.”[6])

The institute and the planning leading to it represent an opportunity to bring faculty and students together in new interdisciplinary collaborations, harnessing methodological and computing expertise that resides in departments such as statistics, biostatistics, and computer science, to create new knowledge through the application of expertise in these fields.

Colony professor of computer science David C. Parkes[7], area dean for computer science in SEAS, and professor of biostatistics Francesca Dominici, senior associate dean for research at the Harvard Chan School of Public Health (HSPH), have been appointed co-directors of the data-science initiative. They will lead the effort to launch the data-sciences collaborations—work that, Parkes explained in an interview, will be “the path to get us to the institute. That is what we're excited about. What we are doing now will enable us to explore what we should be doing when the institute launches.”

With Harvard’s array of leading professional schools and affiliated hospitals and institutions, each with its own data-science expertise, the University is well positioned to pursue data science, and has many of the resources already in hand. “We need to somehow bring them together,” Parkes said, “and go from this distributed excellence” to creating a place where the people and their skills “sing together.”

The initiative proceeds, he said, with the understanding that data science is a demanding discipline that “can be done badly.” To be done right, data must be handled correctly, experiments properly designed, and the data presented in ways that are comprehensible and compelling. Harvard has the ingredients needed to do “world-leading data science: methodologists; compelling applications questions; and contextualization—the context of society” Parkes said. “That is the magic triangle.”

A Trio of Master’s Degrees

in data science, in the Faculty of Arts and Sciences (FAS). (The latter program will also be open to Ph.D. candidates to pursue as a secondary field.)

Professor of economics and of statistics Neil Shephard, chair of the department of statistics (FAS), and Walcott professor of biostatistics Xihong Lin, chair of the department of biostatistics (HSPH), co-chaired an education committee that facilitated the coordinated design and launch of these three-semester master’s degree programs. The first two will be offered in the fall of 2017, while the FAS program enrolls students a year later in the fall of 2018.

The FAS degree, approved unanimously by a faculty vote in March, demonstrates the kinds of skills and competencies that will be taught in a typical data science curriculum, and that will be of interest to business leaders. The program is designed to

roundly educate the next generation of leaders to be knowledgeable not just in the basics of data handling and calculation. Students will learn the principles of experimental design, visualization, massive and distributed computing, and software engineering. They will fully comprehend the implications of dependencies within databases, appreciating the difference between cause and correlation while gaining an understanding of the legal, ethical, and computational issues associated with data privacy and security[14].

Built around a series of core courses that range from the technical to highly technical, the program will welcome students in degree programs at the schools of government, business or education, enabling them to take courses most appropriate to their interest and prior training.

“But education,” he emphasized, “is only one part” of the data-science initiative (which has so far not committed to fund new faculty appointments). The cross-University planning committee for the initiative also recommended development of “a career track for professional data scientists who can provide the lubrication, the acceleration, the catalyst to do the things that make the difference between a successful project or an unsuccessful project.” For example, if a researcher gets a “messy dataset that would take…a long time to restructure, to move onto the right hardware, or to be processed by the right algorithms,” Parkes explains, “these are the people who can help to accelerate that part of the process.”

Francesca Dominici elaborated on the role of these professional research data scientists: “There is an enormous amount of expertise and human resources needed to get a useful signal from messy data. We need professionals who can work closely with the faculty and with the students to unlock bottlenecks”; even Ph.D. students and postdoctoral fellows could get bogged down by such specialized work, and they are constantly turning over—creating what she terms a “sustainability problem.”

“There are a lot of these people actually…at the University already,” she continued. “But the problem right now is that they tend to be locked into individual labs, they are a little bit isolated, and there is not a way of connecting with each other. Step number one for the initiative is to start identifying who all these people are around the University, connecting them with each other, and working with administrative offices…to create a career path, so that at least we can cultivate what we already have—and then to bring in additional individuals” who are at the forefront of the field.

Pointing to her own research, in which she analyzed billions of Medicare claims from the past 20 years and linked increased daily hospital admissions to higher levels of air pollution, she said, “I was able, for example, to hire one of these people, who has been the most valuable member of my team. If they can connect to each other, they will know where the pockets of data are, how to access new data, new software, how to access very focused expertise, whether it is in data linkage or in accessing new data-privacy software.”

This goal would embed these professionals in the as-yet-unfunded data science institute. For now, two spaces are being outfitted to serve as homes for the data-science initiative, in the Science Center, in Cambridge, and the Countway Library in the Longwood Medical Area. “These are spaces to help us bootstrap what we want to be,” Parkes said, “until we get to a permanent space.” The institute is expected to eventually occupy a building near the new SEAS facility, though the exact location has yet to be determined.

An obvious proponent for an Allston location is Harvard Business School (HBS) dean Nitin Nohria. In his annual January letter[20], he noted the synergies that having SEAS nearby in Allston will enable, and described two joint faculty research symposiums that took place at HBS in 2016[21]. “This is but one step toward finding areas of mutual benefit—ultimately, perhaps, culminating in a new degree proposal, but leveraging ways to get to know and learn about one another in the interim,” Nohria wrote. And he pointed out, “An Allston enterprise zone that realizes the potential of the interrelationships between HBS, SEAS, a University-wide Data Science Initiative, and the i-lab ecosystem will be transformative not just for us and Harvard, but for all of Boston and Cambridge.” It is perhaps notable that HBS has exceeded and formally increased its $1-billion capital-campaign goal, and may be helpful in realizing University data-science aims—one of the priorities bruited about throughout the fund drive.

A second strategy for uniting Harvard’s far-flung data expertise is the data-science postdoctoral fellows program, “one of the largest in the world, to our knowledge,” said Dominici. Offers have been made to a class of eight inaugural fellows, all of whom were asked to identify between two and four faculty members with whom they would like to work, from any combination of schools or departments. “Applicants identified people from the departments of psychology, social sciences, biomedical sciences, computer science, statistics, and so on,” she said. “So, already this encompasses 25 to 30 faculty members that will be mentoring these people” and coordinating work with each other through these fellows, “on top of the 55” professors who are already working on the data-science initiative[22].

Dominici likens the fellows to rocks that stretch across a river. “I was in Hiroshima running beside a river and wanted to get to the other side. But when the nearest bridge is far,” she points out, sometimes the decision is not to bother crossing. But then, “I arrived at a place where there were little rocks…beautifully positioned so that you could jump from one rock to another all the way across the river. These rocks,” she said, referring to the unifying impact of postdoctoral fellows, “are a way to start building the network and the group. When they start connecting with each other, that is the seamless, baby step that we want to get.”

Dominici and Parkes declined to compare Harvard’s data-science plans to those elsewhere, or to put a timeline on the goal of a physical data-science institute in Allston. But, she said (and Parkes agreed), “I think we are going to end at an unprecedented scale and a most important impact.”