PDF data woes

We do not provide these tables in Excel or CSV format. You will have to cut and paste from the pdf.

— A government group that provides a lot of data

If you’re going to provide a dataset to the public, or anyone for that matter, please don’t use PDF as your one and only format. At the very least, provide it in Excel. You can easily export spreadsheets to PDF. I don’t hold anything against the person who sent me this message. She was just doing her job. But organizations need to get with the times and provide data in a way that is actually usable.

34 Comments

I actually had this issue with my past employer. We had to take about 600 pages worth of a survey across maybe 12 waves and literally copy and paste each line individually into excel using the copy table function. There is a way of getting around this now and we were able to do so and export the pdfs for the most part almost instantaneously. There were two or three exceptions but it took us about 3-4 years to actually figure out that you can export a pdf into microsoft word. The other surveys we had to do a heck of a lot copy and pasting for.

I’ve dealt with this frustration so often trying to get FOI data on behalf of journalists lately that we’ve batted around the idea of making some kind of “How to get your data when stumped with a PDF” flow-chart, one that would be partly for humour (xkcd-stylez http://xkcd.com/844/ ) and partly for education …

Earlier this year I saw a tweet from a city agency boasting the “staggering” amount of data they produced. But the document the tweet referenced was in PDF format. And one that was obviously produced from an Excel file in the first place. But they were keen on sending around the PDF and only after I prodded them a bit did they think to share the raw data (to my knowledge they still haven’t shared the underlying data files). Here’s my writeup of my PDF woes: http://wp.me/pBfcP-9M

The embarrassing thing is that the agency is in New York City, the self-avowed leading digital city in the US. I think New York still has quite a way to go before legitimately achieving that title.

I get this all the time working with state transportation agencies. It’s incredibly frustrating. The lack of knowledge about individuals have about their own agencies’ data is mind boggling. People actually tell me that they maintain the data in no other format! … Really, you only maintain your databases in PDF, really???

Do you know a program that works well every time? I’ve tried a few and they always get tripped up on the headers and footers of a PDF document. Not to mention if a PDF was scanned, if that’s the case forget about it!

Having dealt with a lot of government agencies and other sources that send data only as PDF, I’ve tried a bunch of different methods and programs to convert them into some sort of actual data. I’ve had the best luck with Able2Extract, tho depending upon the format of the PDF, the layout of the table and the structure of the data, it may or may not do the trick.

Nathan,
There are a few reasons for this: government-types convert data to .pdf files in order to create an electronic equivalent of a paper file (which is what most bureaucrats would rather give you), in a commonly accessible format that can be emailed or posted on a web 1.0 site. Government doesn’t trust *anybody* not to alter the original file and claim it to be the original data. It is waaaaaaay beyond the technical capabilities of most people in the government workforce to tag a file with appropriate metadata or sign a file (using their government-issued key).

About 3 years ago, I did a side job with a TV station who wanted to catalog information on daycare facilities around their city, which was in 3 counties. 2 counties provided CSV files, one provided PDF. I got hired to extract and table 900 PDF reports, which all had pretty much the same layout, by converting the PDF into a text file and then using a mess of regular expressions to extract the data into a tabular form. It was a fun exercise for me that got me some coffee money (I don’t do this regularly, I was a friend of someone there who knew I had done something similar before for my own personal stuff) and some more experience as to how to handle problems like this that shouldn’t exist but do.

I second Stephen’s comment. I went through a six month FOIA battle over aid data, and the resulting document was five pages, which was formatted in such a manner that it had *clearly* just been exported from .xls, and I’d put 2:1 odds on a wager that they were trying to make it more difficult to analyze.

First convert the PDF to XML with http://poppler.freedesktop.org/ . Then parse the XML and load up a table data structure. Sort the text boxes by position on the page, because otherwise they are likely to be out of order. Don’t get fooled by the fonts. For example, a math font has the Greek letter mu as an m. In my data, this meant that the numeric prefix for micro was read as milli, causing an error of a factor of 1000! Keep track of the font in order to watch out for this.

PDF appears to be the native file format for Adobe Illustrator.

As a last resort you can print and OCR. Accuracy from a fresh printout on decent paper is not too bad. Much better than you get from an old book.

If there is much interest from this, I can write up more about it and put it on tomacorp.

Henceforth, every FOIA-type boilerplate request should include the phrase “…in the original format used to create the file(s)” and “Please tell us the program or application — and the version of those programs — used to create the original file(s).”