The Obama Campaign’s Chief Data Scientist on the Future of Civic Data

Rayid Ghani left Accenture to lead Obama for America’s data analytics team. Now he’s going to the University of Chicago to change how cities, nonprofits, and governments use information.

By Whet Moser

Published May 20, 2013

Photo: Courtesy of the University of Chicago

Arguably Chicago’s most important and most successful startup in recent years was the Obama campaign and its celebrated data, analytics, and technical staff, which attracted some of the city’s best brains and re-elected a president.

Basically every politico I’ve ever talked to describes campaigns as start-ups: a small group of smart people working impossible hours and trying to remake the industry on the fly. But unlike start-up companies, campaigns have a strict end date, disappearing as quickly as they’re begun, and leaving a group of top talent on the job market.

What the Obama campaign’s talent will do has been a topic of speculation. And the chief scientist of the campaign’s data analytics team has chosen his path. Rayid Ghani, who worked as a director of analytics research at Accenture before joining Obama for America in 2011, will be working on a host of projects at the University of Chicago. He’s taking a role as chief data scientist for the new Computation Institute’s Urban Center for Computation and Data (a multidisciplinary organization focusing on the use of data at the civic level), directing a data science fellowship for social good, and working on collaborations with the Harris School of Public Policy and the city.

I had the opportunity to ask Ghani some questions about the future of civic data and the open government movement, the use of machine learning and public data to improve civic services, the campaign’s use of data, and more (our conversation has been edited and condensed).

Where can non-profit and social organizations improve in their data use?

The reason this area is exciting is that non-profits and other social organizations have spent the past few years collecting data, because there’s an understanding that it’s important to collect, and cheap to collect. But what they haven’t been able to do to is three things:

One is to really figure out how to ask the questions they really want to answer. They might have a question about, “how can we be more effective in this area?” That’s not a data question directly. You have to first translate that into a question: how do you frame that problem?

Second: they haven’t been able to hire the people to solve those analytics and data problems. There aren’t enough of them, and they’re in great demand. And they don’t necessarily think that their skills are useful in the nonprofit and social world. So they go off mostly into the corporate world.

I think the third thing is that ones that have these resources and the right data haven’t really been able to figure out how to make that into things they can take action with. A lot of the people in the nonprofit world haven’t really been in an environment where decisions are made based on data.

So the primary goal for the fellowship is training the fellows in solving real problems that have a social impact, and get them excited in this area, but also to help nonprofit and social organizations learn more about what does it mean to do an analytics project: how do you work with these people, how do you talk to them, how do you get them interested in these projects, and how do you take action based on the outcomes of these types of analyses?

One thing you’ve talked about is that one of the great successes of the Obama campaign is not only that you had a strong analytics team, but that you were able to get that data throughout the organization. What’s the civic equivalent of that?

You don’t have the same deadline-focused enthusiasm and passion. If you look at an education nonprofit, people are passionate about education, but because there’s no deadline—you can’t get the millions of people to go and do these things right at the same time—the thing you want to get out is at multiple levels.

One, in your core group of people, teachers and guidance counselors. You can have all the sophisticated analytics in the back, but you want to make it very easy to get this information so they can act on it.

On the second is more on the grassroots side, people who care about an issue. They might be really good advocates for an issue. Teachers and guidance counselors, it’s their job to do this. But other people in the neighborhood and communities, friends and relatives, just getting that information out in the communities is a channel we haven’t really looked at enough. People who care an issue or the community may be able in interested in finding out how they can help.

If you can get the right information to them, the biggest value to analytics is that you can scale very easily. By having these technologies, the same way in the campaign we were able to get information to millions of people, so they can then have home conversations, in-person conversations, the only way you can do that is the kind of analytics we were working on.

What kind of data, hypothetically, would you present to the community?

Let’s go back to the hypothetical high-school dropout example, where we’ve figured out the individuals or kinds of people who are at risk, and here are the neighborhoods that are at risk, and here’s a kind of intervention that actually helps. It could be that the intervention is that getting people more involved in park district programs helps. It could be as simple as basketball.

Or connecting them with other people like them: if you’re a high-potential student, and you live in an area where there aren’t students like you, but there’s somebody a couple miles away. We could have kids mentoring other kids.

You can use the community to hold events that improve those kinds of outcomes. If we have data around what improves certain outcomes, then these can be passed on to all these different organizations, and influence their programs.

We’ve got a pretty robust open-data community in Chicago—what recommendations do you have for the people in it?

I’ve had a lot of conversations with people in the open-data community. What they’ve been really good at is exposing a lot of data sets—it’s been made accessible by the city and the government, and they’ve been good at exposing it in easy-to-inform ways, like a map or a web app that shows where the buses are.

That’s been good because the data is now in front of people, because the normal consumer isn’t going to go into a portal, and download a file, and open it in Excel. So what they’ve been really good at is bring that data to the people.

I think the next thing is using that to make inferences, to make predictions, to improve certain outcomes. So it’s great that you can look at this data and see where the buses are, but the next step is to ask “can I improve the bus routes? Can I work with CTA to find better scheduling?” What I’m pushing them towards is taking the same data they’ve been having people look at, and asking “how can I improve the process that’s generating this data?”

[For example], the 311 calls that are coming in: does that tell me about the state of different neighborhoods in the city? How often the calls are coming, how often the calls are being taken care of, what kinds of calls in what areas…and really help the people who own the process that generates the data.

The next step is not just looking at it, but using it to make predictions about the future and improving the outcomes for the people who are consuming those services.

What differences do you see this data making?

I think it will make the people living in the city more informed, it will make people question things more, but also appreciate when they see services happen. It makes better consumers, more informed consumers—they trust the city more because the data is more transparent,and you can’t easily hide certain things that have been hidden in the past.

That’s one dimension of how things will change. The other is that, hopefully, that’s hopefully going to result in the city providing more data-driven services that will be more efficient and effective at the things they provide.

The kind of things that will be changing in five years are places where you have the ability to take somewhat more real-time action, where the data is most detailed. What’s going to be harder to do is more long-term changes. It’s easy to do quick fixes, harder to change how things are done at the very core.

One of the challenges is that the city doesn’t really have data about people. The city has data about buildings, or buses, or cars, and that makes things challenging, because a lot of the interesting data is about people. So there’s going to be a lot of collaboration between the city, the county, and the state. And I think that’s where things hopefully will move to—that collaboration will lead to the real breakthroughs in services.

You’ve said before that the Obama campaign’s use of commercial, proprietary data has been overrated in the press. Could that kind of data be used at the civic level?

Most of the data we had wasn’t really from commercial databases. Our basic data was from the voter registration data—all those things about name, address, phone number, all the things you fill out on the registration form. The other set of data was, if you voted, in which elections you voted, which is also public data.

Everything else that we pretty much got was through our own data collection efforts, so they weren’t proprietary commercial databases. If you gave money to us, we knew how much money; if you volunteered with us, if a volunteer called you and asked you who you support, and what you care about, we collected that data. So most of the data we used was data we collected on our own, or that was public through voter registration.

There really wasn’t that much commercial data. People have that assumption, because a lot of marketing companies tend to use commercial data; we didn’t really use it because it wasn’t that useful to us, because when you’re trying to predict people’s voting behavior, their past voting behavior is the best predictor, not what magazine they read, and what pets they have, and what car they drive.

Do you think it might be possible for cities to design the equivalent of that for civic use?

I think so. People are not inherently against giving data if you are able to explain to them what you are doing with that data, and if you are able to give them something in return.

If you wanted to give them a particular service, [you ask] “where do I send you that service?” What you’re asking for is information that allows you to give me that service. And that happens all the time, right? If you’re looking at students, they fill out the FAFSA to get financial aid. You fill out your income information to get a credit card.

And the city is not any different as as service organization. As long as the city uses that data to improve the service they provide… it’s an empirical question, but other organizations have shown that if you’re improving the service you’re giving people, they’re willing to give you more, because they’re getting a benefit in return.

One thing you’ll be working on is the use of machine learning to improve city services. What does that mean and how do you see it working?

In my mind machine learning is sort of a synonym for data and analytics and statistics… if I have data for a particular process and I’m using computer algorithms to improve it, that’s machine learning. So from the public policy side, public policy for me is very much things that concern individuals in a society: social problems. And then you aggregate them up to define policy.

Let’s say you’re looking health care problems, and childhood obesity. Right now the U.S. has these growth curves, where you have height and weight for babies. So you look your baby and you say, “ah, it’s in the 80th percentile. Today those are completely independent of the background, the gender, ethnicity, parents’ height and weight, where clearly there’s no generic growth curve.

What you do have is individual data on all the kids in this country about how they’ve grown, and you use that to build more personalized growth curves.

That’s the first step. The way the machine learning comes in is you look at the data, and detect, for this particular baby… for your background and your genetic history, for the national growth curve you’re fine, you’re about average. But you have a very high risk of obesity. You would predict people who are at risk for those kind of behaviors, and then you would come up with interventions to change that outcome. And then you would weigh them again in two weeks, or a month, and see how that’s changed—whether the intervention is working or not working.

The machine learning is to predict the risk of certain kinds of outcomes, come up with an intervention that’s designed to change and improve that outcome, and then get feedback as you do that over time, to then change your predictions and interventions and keep going… pretty much forever.