Sunlight, as an organisation which complains about this often enough, has much better tools at their disposal than complaining about it. As people using computers in 2010, we all have better tools to use on PDFs than we currently use. We often complain about how inaccessible PDFs are, without doing the basic, simple, automatable tasks which can make them readable.

Opening the PDF in acrobat, pressing the “Recognise text using OCR” [button] and then [you’ll find that] it’s searchable, and Sunlight could republish this for everyone to use (or put up a webservice which adds the OCR text in such a way that when you search, what you get highlighted is the relevant bits of the page where the OCRed text matches). That is possible now.

But, as a community, we prefer to stick to the notion that anything in PDF is utterly locked up in a way which no one can get at.

It’s not (really).

It is far from ideal, it’s a bugger to use, and it is not the best format for most things, but it’s what we’ve got. And showing how valuable this data is will get us far further than complaining that we can’t read a file that most people clearly can in the tools they use. It’s the tools we choose to use that are letting us down. And, as a movement, open data has to get better at it, and then it’ll be less of a problem for us, and we can spend more time doing what we claim to be wanting to do.

I appreciate the response, but I disagree. Nothing Sam says about what technology makes possible is wrong, per se. And better tools are of course useful and desirable. But the last thing I want is for government to begin thinking that OCR can make up for bad document workflows. It simply can’t: even though it happens to work well on the Geithner schedule, OCR remains a fundamentally lossy technology.

If there were reasons to expect that to change, the situation might be different. But it’s been 80 years now, and this still isn’t a solved problem — and it’s not as if the quality of OCR results is bounded by computational power. To me this sounds like a recipe for continued arduous, incremental improvement. So there probably aren’t any magic bullets forthcoming. And as long as that’s the case, I think it’s smarter to try to encourage better practices than it is to try to brute force our way through the mess government produces.

This is especially true for a document like the Geithner schedule which, frankly, isn’t all that valuable. I’m very pleased that Pro Publica put it in a better format — people were clearly interested in the schedule, and the effort is a nice showcase for what DocumentCloud can do. But making this document easier to read only conveys limited benefits — reading it was never the problem. The hard part was knowing that we wanted to read it — and, just as important, knowing that we didn’t want to read his other schedules instead. It’s not sufficient to OCR documents we already know we’re interested in: we would need to OCR every document, because identifying interesting documents is where the real costs lie.

Right now we mostly rely on reporters to comb through this stuff, looking for morsels that might be of interest. This works pretty well, but some things will inevitably be missed. Technology can help with this problem, though — it’ll never be able to perfectly predict a document’s importance, but for certain classes of this problem, like looking for notable names in officials’ schedules, it can work very well. From Sunlight’s perspective, this means faster, cheaper and more complete monitoring of our government.

That’s the dream, anyway. I think it’s attainable. But as we collaboratively build this system, we ought to start from solid foundation. Brute forcing our way to a lossy input stream is clearly a kludge; it’s a bad design pattern, and we should avoid it if we can. So sure, let’s figure out smarter ways to use OCR; but more important is to push for the ability to use it less.