Suggested Readings List

A short list of online articles and references on data journalism

This is a list of both useful and eclectic articles and guides to data journalism, in no particular order, though I’ve sorted them into rough categories for now. This list is auto-generated from a Google Spreadsheet

Daniel Gilbert had spreadsheets with thousands of rows of data. "There was a story there, he was certain. But control-f would not find it." After he took a class on how to use Microsoft Access, he could finally ask and answer the questions he knew were in the data. Gilbert's meticulously detailed series on landowners being cheated by oil and gas companies ended up winning a Pulitzer.

WNYC's political team wanted to know: how much money was being raised by candidates for a state legislative district from within the district itself? John Keefe walks through his use of Fusion Tables and additional database techniques for the analysis and visualization.

One day John Keefe was just browsing the official NYC data site and tucked away some code that involved the city's flood zones. When Hurricane Irene stuck, NYC.gov was flooded (with web traffic) and WNYC became an invaluable resource.

This 2013 Pulitzer Public Service winning story epitomizes the best of watchdog journalism, as its expose of a deadly public injustice sparked swift reform. But it deserves an award for its ingenious do-it-yourself way of collecting the kind of data that, in theory, was non-existent.

OK, this is not a traditional data journalism book, though it's one of the best true crime and/or journalism books ever written. Simon does keep and review a tally of the bodies and where they marticulate in the Balitmore justice system throughout the year. Also, there's a memorable passage about the "murder board", a simple visualization that has unintended consequences on the efficency of the homicide squad.

This landmark study that studied whether financial disclosure laws were actually being followed was the result of old-fashioned data collection: travelling to Minnesota and manually copying documents, before analyzing them in Access.

"That intellectual curiosity. That bullshit detector for lack of a better term, where you see a data set and you have at least a first approach on how much signal there is there...That stuff is kind of hard to teach through book learning. So it’s by experience. I would be an advocate if you’re going to have an education, then have it be a pretty diverse education so you’re flexing lots of different muscles...You can learn the technical skills later on"

When there is no data, build your own. After discovering that no one was keeping track of Washington D.C.'s numerous homicide victims, Laura and Chris Amico simply started counting. The hand-built data source became a community resource as well as a source for data analysis.

"But when you open up the Excel spreadsheets from the VBA, you’ll find lots of figures with little explanation. So we’ve put together this guide to get you started in your pursuit of good local, state, regional and national stories that are to be had in the data. A grounding in Excel basics is needed to follow along."

"Most of what we can do to make the world a better place involves, not doing the unprecedented, but doing what matters and what works, whether unprecedented or not. This might not be as exciting as the unprecedented, but it’s desperately needed."

The importance of limiting colors in a visualization is one of the most un-obvious principles to grasp. And yet, once it's explained to you, you'll immediately look down on everyone else who failing to see the tackiness.

Google Refine is now Open Refine and its interface and feature sets have improved quite a bit since I wrote this essay. But it's still a good overview of how a single digital tool can make a data-intensive investigation possible.

This says everything you need to know about the complexity and annoyance of real-life data: "For example, in the Senate it is possible for the Majority Leader and Minority Leader to alter the rules of math when it comes to how many senators constitute a three-fifths majority."

Citizen data journalism at its best, from John Krauss: "I already knew that they were finally releasing -- after the Council forced them to -- crash data as idiotically obfuscated PDFs, but reading that they justified this out of concern for "the integrity of the data," was so galling that it goaded me into action. I would make the data accessible as friendly, parseable CSVs."

When it comes to large data-projects involving messy data, don't always get hung up on the technology to use. Coming up with an efficient way to divide and distribute the data collection and cleaning, even within your own newsroom, can often do the job

This document is aimed towards game developers, but it is the most thorough explanation of how to best incorporate creative writing into a dynamic code-generated experience, while giving writers the most degree of freedom possible. It's a great examination of the core facets of good writing and goes deep into the technical implementation.

Simulations don't yet have much place in journalism, but I like this analysis of laundry room behavior because it acknowledges the limitation of analysis and the impact of assumptions, as well as intuitively walking through the design of the simulation (with Julia code snippets). Also, it tackles a question near to all of us.

Saunders's blog is fascinating to me, as he describes his struggles (and triumphs) as a hobbyist programmer in a field - bioinformatics - that you'd think would embrace his type of coding skills. Here, he shows the code needed to analyze the writing tics found in 47 gigabytes of PubMed abstracts.

Because even his elite engineering and mathematics friends didn't understand how easy it is to make a spell-checker, Norvig explains the theory and shows the less-than-a-page-worth of code needed to achieve 80 to 90% accuracy.

"This text­book is intended to pro­vide a com­pre­hen­sive intro­duc­tion to fore­cast­ing meth­ods and present enough infor­ma­tion about each method for read­ers to use them sen­si­bly. We don’t attempt to give a thor­ough dis­cus­sion of the the­o­ret­i­cal details behind each method, although the ref­er­ences at the end of each chap­ter will fill in many of those details."

"Rap’s history has been traced many ways -- through books, documentaries, official compilations, DJ mixes, university archives, even parties. But until now you haven’t been able to look at the development of the genre through its building blocks: the actual words used by emcees."

"The states where a single vote was most likely to matter are New Mexico, Virginia, New Hampshire, and Colorado, where your vote had an approximate 1 in 10 million chance of determining the national election outcome. On average, a voter in America had a 1 in 60 million chance of being decisive in the presidential election."

"As far as I'm concerned, Gould's The Median Isn't the Message is the wisest, most humane thing ever written about cancer and statistics. It is the antidote both to those who say that, "the statistics don't matter," and to those who have the unfortunate habit of pronouncing death sentences on patients who face a difficult prognosis"

In a tournament of freestyle chess, the winners are not a human grandmaster, nor the best supercomputer, or even a grandmaster with a supercomputer, but amateurs who best know how to use their mediocre computers.

I don't know what's more astounding: that 30 years ago, before computers were in our homes, Arthur Luerhmann could so clearly describe the difference between being taught by a computer and learning to teach the computer, or that 30 years later, so few people even realize there's a difference.