PovcalNet Unchained – Justin Sandefur and Sarah Dykstra

If data wants to be free, then PovcalNet, the world’s leading dataset on global poverty, is happier today because it was recently made available for download in bulk by my guests on this week’s Wonkcast, CGD research fellow Justin Sandefur and research assistant Sarah Dykstra. Scraping the data was no easy task: it required devising code that queried the database for one answer at a time, 23 million times, over nine weeks, then reassembling the 8 million resulting data points answers into a single dataset. They then posted the dataset and a related paper online for the use of researchers around the world.

Justin and Sarah tell me that they were motivated to scrape the PovcalNet website in part because they needed the full dataset for their own research, and in part because they knew other researchers had a similar need. Lacking the full dataset, they and others previously had no option but to spend hours pointing and clicking, one number at a time, to get the specific information they needed. (The code needed to run the queries was beyond what we could manage here at CGD, so the pair turned to Sarah’s brother, independent programmer Benjamin Dykstra.)

Since individual data points were already online—albeit not in a readily accessible format—the project involved no “hacking.” I ask whether they tried first just asking the World Bank for the dataset. Justin explains that, "...the underlying raw data isn’t even available to many researchers within the Bank.”

“There’s a lively internal debate in the World Bank about whether or not this data should be public,” Justin tells me. “But not all data that the World Bank has are covered by the open data policy…it was pointed out to us that PovcalNet is not.”

Justin says that the entire process illustrates the importance of making research data publicly available.

“We’re living in a new era where there are a lot of people participating in this analysis and this conversation, and a million eyeballs can find lots of mistakes.” Justin says. “So let’s put all the data and the code in the public domain and open up that conversation.”

So, what exactly was the World Bank’s response to their efforts and the resulting new poverty estimates?

“Annoyance is probably the right word,” Justin says. “The stance of the research department now seems to be, reading between the lines, that ‘we don’t really trust these [new PPP] numbers, and we’ll reserve judgment on whether we should use them yet.’”

It’s an exciting story, with some unexpected twists and turns. To hear it, and learn what Justin and Sarah have planned next, tune in to the full Wonkcast.