On Monday the House of Representatives delivered, as promised, an electronic dump of House Expense Reports. We, at Sunlight Labs had a plan. We knew it was going to be a huge PDF, but we have all the infrastructure in place. We had plenty of bandwidth, knew when the data was coming out, roughly how it was going to look, and that it was likely we wouldn’t be able to parse it all with computers. “We’ll use TransparencyCorps,” we thought, to get that last mile out of the data, so that eventually we’ll end up with a parseable database.

Then it dropped. All 3000 pages. And we started working on our plan. Two of our engineers started taking a look at the data, and by the end of the evening they’d given up. They were chunking up the data and trying to parse the PDF files with no avail. Weird columns inside the PDF prevented us from parsing it. I even contacted some folks at Adobe and they responded with “wow, yeah, wish they attached the source document.” I sent a desperate plea to our Google Group that got a few responses.

Early the next morning, one of our newer hires without the “seasoned open government chops” of some of our other tech staff claimed to have had a solution. Luke opened the file, hit select-all, hit CTRL+C, and then opened up a text pad, and hit paste. By golly, what came out was parseable text. The other two didn’t believe it. Immediately they went to work. Shortly thereafter, Josh Tauberer chimed in on the labs list with a tip on using a special option in pdftotext

A better, more accurate solution than copying and pasting had come about. And presto, a day later, we have our House Disbursements online.

It’s a lesson in humility– there’s no way we can know it all, and sometimes the simplest solutions work. So here’s to Luke who managed to help us screw our robot heads back on straight, and to Josh for remind us to check our man pages.