DocumentCloud: The innovation $1m in Knight money could buy

Here’s some more information about the Knight News Challenge application by ProPublica and The New York Times that generated some buzz and criticism earlier this month. They’re seeking a $1 million grant to develop an online repository of primary-source documents that anyone could contribute to or take from. I spoke at length with developers at both organizations, and they discussed the technology behind their effort, how it could benefit investigative journalism, and why they’re seeking seven figures to launch the project.

The venture, which is called DocumentCloud, seems like it could vastly improve document-based journalism. (That’s separate from the issue of whether they’re deserving of a News Challenge grant.) At the moment, when a reporter gets her hands on paper documents, the best she can typically do is post them online as scanned PDFs, where they often can’t be searched and will likely be forgotten by the end of the day. Worst of all, it’s a one-sided experience: The reporter drops a dead tree in a forest and has no idea if it ever makes a sound.

DocViewer, which is the technology behind DocumentCloud, promises several features that would address the current failings of the PDF model. It would allow users to run their documents through an OCR (optical character recognition) service that would enable full-text searches of otherwise impenetrable material. Then DocViewer relies on OpenCalais, a web service developed by Thomson Reuters, which can tag documents with the names of known people and places found within the text. Any reporter who has ever attempted to wade through a thick stack of paper on deadline will immediately realize how helpful this would be.

“The problem we’re trying to solve here is the problem that TPM Muckraker had when they got thousands of pages of attorney general documents, and then just sort of threw it up online and said, ‘Take a look through this,'” said Aron Pilhofer, editor of interactive news technology at the Times. That effort, which won a Polk Award, broke new ground in crowdsourced journalism — a topic, incidentally, that we’re discussing in this month’s Lab Book Club. (And the TPM Muckraker blogger who posted those docs, Paul Kiel, now works for ProPublica.)

But the process wasn’t perfect. TPM readers had to navigate large PDF files and post their observations in the comments section of a blog post, which was helpful in the moment but limited in its long-term usefulness. “Those comments become more than just comments,” Pilhofer said. “They become actual data.”

DocumentCloud seeks to make the most of such data by allowing journalists and readers to annotate documents for all to see and benefit. Think of it as highlighting for the crowd. Pilhofer said the current proof of concept for DocViewer includes an annotation feature that’s similar to the notes users can leave on photographs in Flickr. Users will also be able to link directly to specific pages or even phrases in a document.

To get a sense of DocumentCloud’s potential, take a look at the database of Guantánamo Bay detainees that the Times made public on Nov. 3, when it was accompanied by a 1,500-word story. Each record is linked to relevant government documents that have been made public since “enemy combatants” were first held there in 2002. Pilhofer said the database isn’t using a full-featured version of DocViewer, but it certainly demonstrates the benefit of browsing documents grouped by subject rather than, say, the order in which the Defense Department happened to release them. What’s remarkable about the Gitmo collection, aside from its massive scope, is that the Times has offered up this information at all. As Pilhofer said, “It’s not usually in a newsroom’s DNA to release something like that to the public — and not just the public, the competition, too.”

Scott Klein, the director of online development for ProPublica, said that sharing — a maxim of the Internet, if not of newsrooms — would be the real power of DocumentCloud. The objective, he said, is to maximize the work of collecting documents that’s already been done on a particular topic and allow other journalists to build from there. “How can we collect those documents so the next reporter doing a story on this subject can find this information and use it and display it in a much more satisfying way?” he said.

ProPublica and the Times are asking the Knight Foundation for $1 million over three years to cover their anticipated costs. Klein said expenses would include staff to facilitate the program as well as hosting and bandwidth costs. I asked Pilhofer to respond to criticism of their application leveled by NYU’s Jay Rosen, who suggested that the for-profit Times Company shouldn’t be seeking foundation grants for its journalism. Here’s what Pilhofer said:

I can understand why some would feel that way, but I think it’s more a misunderstanding of what the project is and who it’s intended for…This is a grant submitted by us, but it’s not for us…The project is to create what we’re calling a consortium, some sort of entity that is not The New York Times, that is not ProPublica. Ideally, this will incorporate all sorts of media organizations and bloggers and watchdog groups and universities…If anything, Professor Rosen has it kind of backwards: We’re contributing to this effort. We’re contributing development resources, we’re contributing our time.

Obviously, I’m a fan of DocumentCloud and hope it sees the light of day. But whether they should receive a Knight grant is another question and depends, as my boss Josh asked, on whom the News Challenge is for. Based on the comments at my original post and around the web, it seems like DocumentCloud has generated some resentment among other News Challenge applicants more desperate for funding. One commenter also questioned whether ProPublica’s editor-in-chief, Paul Steiger, has an unfair advantage because he sits on the board of Knight, whose CEO, Alberto Ibargüen, is on the board of ProPublica. That web of ties could certainly help DocumentCloud’s application.

But what will help the project most is that it’s a good idea. And having waded through many News Challenge applications this month, I’ve seen that there’s truly a shortage of good ideas — or, at least, ones with clear potential to immediately improve journalism on a broad level. Kristen Taylor, Knight’s online community manager, said as much to me when she visited Cambridge in October. So while $1 million is a lot of money — a fifth of what Knight has committed to spend on News Challenge projects this year — but I’d bet that much cash that DocumentCloud will be one of the winners when they’re announced next fall.