Using Data in Investigative Reporting

Data are typically a bunch of numbers. But the best examples of data journalism sometimes don’t have any numbers – or at least very few of them.

How can that be?

Ron Nixon, a Washington correspondent for The New York Times, talked the how’s and why’s of data journalism in a session with Paul Miller fellows at the National Press Foundation. He said some data-focused stories are so packed with numbers that they’re indecipherable – akin to the sci-fi movie “The Matrix,” he said.

Instead, he points to a definition by a former Times colleague who is now at Arizona State University, Sarah Cohen: “Data reporting is deeply rooted in investigative journalism and isn’t just about statistics; it’s investigating how a system works compared to how it’s supposed to work.”

As Nixon added: “Everything for me is people-centered. Data for me is just another source. I’ll have a graf or two about the data, but most of the story is people.”

Nixon (bio, Twitter), whose work has focused on a range of investigative topics with a recent emphasis on homeland security, led fellows through a description of what tools to use, and why to use them. He also has experience as a data expert at the Minneapolis Star Tribune and Investigative Reporters and Editors, which maintains a library of federal databases and trains journalists in the practical skills of getting and analyzing electronic information. He described the methods he uses to decide when to dive into a data story – as well as what to avoid.

Over his career, he has used statistical software such as SAS and R; spreadsheet programs such as Excel; ArcGIS for mapping; SQLite for simple database work; and Python or Ruby for heavy data analysis.

He counsels journalists not to be afraid of the math inherent in data-focused stories – despite the well-known phobia of many reporters and writers to numbers and stats. For starters, he said, it’s vital these days to know math and statistics, since they help run the world and are the language of policymakers.

But beyond that, it doesn’t have to be that complicated. “A lot of this is the same math you would have done in the eighth grade,” he said.

Beyond simple data tools such as Excel and SQLite, sometimes reporters need to up the game and use more sophisticated and powerful tools. For those cases, Derek Willis of ProPublica learned how to build his own databases, often by teaching himself the programming skills necessary to do so.

Of course, that doesn’t mean he’s a savvy enough in programming that a tech start-up is in his future.

“My programming talents are in the service of journalism, not the service of programming,” Willis (bio, Twitter) said.

What he does is learn by doing: “The way you use tools such as programming languages is you use it badly, and then you use it less badly. Eventually, you hope to be adequate.”

Much of programming deals with repetitive tasks, such as retrieving the same kind of data from multiple websites. Willis explained “web scraping,” which is the process of taking data from the websites in an efficient, automated way.

“If there are parts of your reporting data that involve repetitive tasks, consider automating part of it,” he said.

He uses the programs Perl, Ruby, Python; any of them will work – just pick the one you’re most comfortable with (a Wiki comparing them is here).