Data journalism, data tools, and the newsroom stack

MIT’s recent Civic Media Conference and the latest batch of Knight News Challenge winners made one reality crystal clear: as a new era of technology-fueled transparency, innovation and open government dawns, it won’t depend on any single CIO or federal program. It will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, whatever form it is delivered in.

The themes that unite this class of Knight News Challenge winners were data journalism and platforms for civic connections. Each theme draws from central realities of the information ecosystems of today. Newsrooms and citizens are confronted by unprecedented amounts of data and an expanded number of news sources, including a social web populated by our friends, family and colleagues. Newsrooms, the traditional hosts for information gathering and dissemination, are now part of a flattened environment for news, where news breaks first on social networks, is curated by a combination of professionals and amateurs, and then analyzed and synthesized into contextualized journalism.

Data journalism and data tools

In an age of information abundance, journalists and citizens alike all need better tools, whether we’re curating the samizdat of the 21st century in the Middle East, like Andy Carvin, processing a late night data dump, or looking for the best way to visualize water quality to a nation of consumers. As we grapple with the consumption challenges presented by this deluge of data, new publishing platforms are also empowering us to gather, refine, analyze and share data ourselves, turning it into information.

In this future of media, as Mathew Ingram wrote at GigaOm, big data meets journalism, in the same way that startups see data as an innovation engine, or civic developers see data as the fuel for applications. “The media industry is (hopefully) starting to understand that data can be useful for its purposes as well,” Ingram wrote. He continued:

… data and the tools to manipulate it are the modern equivalent of the microfiche libraries and envelopes full of newspaper clippings that used to make up the research arm of most media outlets. They are just tools, but as some of the winners of the Knight News Challenge have already shown, these new tools can produce information that might never have been found before through traditional means.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

The Poynter Institute took note of the attention paid to data by the Knight Foundation as well. As Steve Myers reported, the Knight News Challenge gave $1.5 million to projects that filter and examine data. The winners that relate to data journalism include:

Overview, which is a tool to help journalists find stories in large amounts of data by cleaning, visualizing and interactively exploring large document and data sets. Associated Press data journalist Jonathan Stray called Overview the “Drupal of data visualization.

I talked more with the AP’s Jonathan Stray about data journalism and Overview at the MIT Civic Media in the video below. For an even deeper dive into his thinking on what journalists need in the age of big data, read his thoughts on “the editorial search engine.”

The newsroom stack

With these investments in the future of journalism, more seeds have been planted to add to a “newsroom stack,” to borrow a technical term familiar to Radar readers, combining a series of technologies for use in a given enterprise.

“I like the thought of it,” said Brian Boyer, the project manager for PANDA, in an interview at the MIT Media Lab. “The newsroom stack could add up to the kit of tools that you ought to be using in your day to day reporting.”

Boyer described how the flow of data might move from a spreadsheet (as a .CSV file) to Google Refine (for tidying, clustering, adding columns) to PANDA and then on to Overview or Fusion Tables or Many Eyes, for visualization. This is about “small pieces, loosely joined,” he said. “I would rather build one really good small piece than one big project that does everything.”

PANDA and Overview are squarely oriented at bread-and-butter issues for newsrooms in the age of big data. “It’s a pain to search across datasets, but we also have this general newsroom content management issue, said Boyer. “The data stuck on your hard drive is sad data. Knowledge management isn’t a sexy problem to solve, but it’s a real business problem. People could be doing better reporting if they knew what was available. Data should be visible internally.”

Boyer thinks the trends toward big data in media are pretty clear, and that he and other hacker journalists can help their colleagues to not only understand it but to thrive. “There’s a lot more of it, with government releasing its stuff more rapidly,” he said. “The city of Chicago is dropping two datasets a week right now. We’re going for increased efficiency, to help people work faster and write better stories. Every major news org in the country is hiring a news app developer right now. Or two. For smaller news organizations, it really works for them. Their data apps account for the majority of their traffic.”

Bridging the data divide

There’s some caution merited here. Big data is not a panacea to all things, in media or otherwise. Greg Borenstein explored some of these issues in his post on big data and cybernetics earlier this month. Short version: humans still matter in building human relationships and making sense of what matters, however good our personalized relevance engines for news become. Proponents of open data have to consider a complementary concern: digital literacy.

To make open government data sing, infomediaries need to have time and resources. If we’re going to hope that citizens will draw their own conclusions from showing public data in real-time, we’ll need to educate them to be able to be critical thinkers. As Andy Carvin tweeted during the MIT Civic Media conference, “you need to be sure those people have high levels of digital literacy and media literacy.” There’s a data divide that has to be considered here, as Nick Clark Judd pointed out over at techPresident.

It looks like those concerns were at least partially factored into the judges’ decision on other Knight News Challenge winners. Spending Stories, from the Open Knowledge Foundation, is designed to add context to news stories based upon government data by connecting stories to the data used. Poderapedia will try to bring more transparency to Chile using data visualizations that draw upon a database of of editorial and crowdsourced data. The State Decoded will try to make the law more user-friendly. The project has notable open government DNA: Waldo Jaquith’s work on OpenVirginia was aimed at providing an API for the Commonwealth.