Open source spotlight: How DocumentCloud adds depth to digital journalism

The corporate media is facing a deep-going, digital-driven crisis: Dropping print advertising revenue, a growth in online ad spend that hasn't plugged the revenue hole, and the unwillingness, in general, of readers to pay for online news. But while the internet has played a big role in the financial haemorrhaging of the media — in Australia, swathes of editorial staff have been cut over the last few years, particularly at some of the biggest publishers — it has also made possible new ways of doing journalism.

Although plenty of parallels with the Pentagon Papers have been drawn, an organisation like Wikileaks would not be able to function without the internet. The Guardian's much-publicised investigation into British MPs' expense claims used online crowdsourcing to trawl through masses of documents. And interactive elements are a staple of every major metropolitan's website.

DocumentCloud is an open-source-based project designed to be part of the digital revolution in news reporting.

"It is a site that is designed specifically to help journalists work with source documents, with the ultimate goal of encouraging news organisations to be more transparent in their reporting by ultimately making documents available to the public via the Web," explains Aron Pilhofer, who, along with Eric Umansky and Scott Klein from investigative journalism non-profit ProPublica, founded DocumentCloud. Pilhofer is also editor of interactive news at The New York Times.

DocumentCloud lets journalists organise and annotate collections of documents, such as those received as a result of a freedom of information request, and offers reporting teams a secure environment to collaborate on them, Pilhofer says. "We make publishing those documents as simple as embedding a YouTube video," he explains

In the wake of the US Supreme Court June ruling on US President Barack Obama's healthcare legislation, nine of the top 11 news websites used DocumentCloud to annotate and post the ruling online. "That was amazing to see," Pilhofer says.

"DocumentCloud was designed by journalists for journalists, so we have a number of features — a redaction tool, for example — that we knew journalists would need and want," Pilhofer explains.

Behind the scenes the project is driven by software including Apache's Solr/Lucene search platform. DocumentCloud also uses the Tesseract OCR engine developed by HP and open sourced in 2005. "We, in turn, have been giving back to the open source community as well," Pilhofer says.

"Every line of code in DocumentCloud has been released under an open-source licence, and some of our libraries — most notably Backbone.js and Underscore.js — have become wildly popular in their own right.

"In short, we are huge believers in the value of open source software. Without it, DocumentCloud wouldn't have been possible."

The project is funded by the Knight Foundation and Open Society Foundations. Initial funding came through the Knight News Challenge, which offers grants to further media innovation. "DocumentCloud was nothing but an idea when we initially applied, and we really didn't have any idea whether it would work or not," Pilhofer says.

"Knight gave us the chance to find out. Without the News Challenge, I'm afraid to say I don't think DocumentCloud ever would have been possible."

More than 700 newsrooms have signed up to use DocumentCloud, uploading 6 million pages spread across some 420,000 documents. "On a weekly basis, we are seeing about a million document hits per week," Pilhofer says. The documents on the service collectively receive about a million hits every week.

The DocumentCloud team is eyeing international expansion, adding support for non-English documents. "I think DocumentCloud could have enormous potential in cross-border investigations," Pilhofer says.

"Imagine how useful it could be for collaborations that involve journalists in many countries. We've seen a bit of that at The New York Times, where we have had collaborations on document sets that involve dozens of journalists, multiple news organisations and cross several continents. I don't think we've begun to scratch the surface of what DocumentCloud can do just yet."

The gap that once existed between journalism and technology, with news gathering and reporting still shaped by a dying medium — print — is narrowing. "You're seeing many newsrooms in the States starting to bring technology and technologists in, including my own," Pilhofer says. "My 'day job' is running a team of technologists in the newsroom of The New York Times as well as the social media and community teams. You are seeing that more and more."

"You are starting to see many good examples of journalists bringing technology to the pursuit of journalism," Pilhofer adds. An example of another digital tool for journalism he cites is Overview an open-source data mining and visualisation tool.

"It's designed to be a way for journalists to use complex clustering algorithms as a way to discover interesting tidbits within large collections of documents," Pilhofer says.

"[Overview project lead] Jonathan Stray, who is both a journalist and a very good developer, is leading the way here using essentially the DocumentCloud model: That is, to find ways to bring technology to the journalist, and not the other way around."

"There's lots of amazing technologies out there that computer scientists understand perfectly, and journalists, for the most part, understand not at all," Pilhofer says.

"I think I am most interested in ways to use basic filters and text analysis to create monitoring systems that can flag anomalies when they occur. An example might be a system that constantly looks at campaign donations, and could look for donations that seem out of the ordinary in some ways. This isn't really pie-in-the-sky. It's what forensic accounting is all about: looking at a body of data and applying various smell tests to it.

Copyright 2015 IDG Communications. ABN 14 001 592 650. All rights reserved.
Reproduction in whole or in part in any form or medium without express written permission of IDG Communications is prohibited.