How I parse PDF files

Much of the world’s data are stored in portable document format (PDF) files.
This is not my preferred storage or presentation format, so I often convert
such files into databases, graphs, or spreadsheets.
I sort of follow this decision process.

Do we need to read the file contents at all?

Do we only need to extract the text and/or images?

Do we care about the layout of the file?

Example PDFs

I’ll show a few different approaches to parsing and analyzing
these PDF files
(also available here).
Different approaches make sense depending on the question you ask.

These files are public notices of applications for permits to dredge or fill
wetlands. The Army Corps of Engineers posts these notices so that the public
may comment on the notices before the Corps approves them; people are thus
able to voice concerns about whether these permits would fall within the rules
about what sorts of construction is permissible.

Theses files were
downloaded daily
from a no-longer-available version of the
New Orleans Army Corps of Engineers website
and renamed according to the permit application and the date of download.
They fed into this program, which was primarily
used by the Gulf Restoration Network in their efforts to protect the wetlands
(until the Army site changed and we never updated the system).

If I don’t need the file contents

Basic things like file size, file name and modification date might be useful
in some contexts. In the case of PDFs, file size will give you an idea of how
many/much of the PDFs are text and how many/much are images.

Let’s plot a histogram
of the file sizes. I’m running this from the root of the documents repository,
and I cleaned up the output a tiny bit.

The histogram shows us two modes. The smaller mode, around 20 kb, corresponds to
files with no images (PDF export from Microsoft Word), and the larger mode
corresponds to files with images (scans of print-outs of the Microsoft Word
documents). It looks like about 80 are just text and the other 170 are scans.

This isn’t a real histogram, but if we’d used a real one with an interval scale,
the outliers would be more obvious. Let’s cut off the distribution at 400 kb
and look more closely at the unusually large documents that are above that
cutoff.

It might actually be fun to see relate these variables to each other. For
example, when did the Corps upgrade from PDFMaker 9.1 to PDFMaker 10.1?

Anyway, we got somewhere interesting without looking at the files. Now let’s
look at them.

If messy, raw file contents are fine

The main automatic processing that I run on the PDFs is a search for a few
identification numbers. The Army Corps of Engineers uses a number that starts
with “MVN”, but other agencies use different numbers. I also search for two
key paragraphs

My approach
is pretty crude. For the PDFs that aren’t scans, I just use pdftotext,
which is part of poppler-utils.

# translate
pdftotext "$FILE" "$FILE"

Then I just use regular expressions to search the resulting text file.

pdftotext normally screws up the layout of PDF files, especially when they
have multiple columns, but it’s fine for what I’m doing because I only need to
find small chunks of text rather than a whole table or a specific line on
multiple pages. You can try pdftotext -layout if you need to preserve more
of the layout.

pdftotext -layout file.pdf

As we saw earlier, most of the files contain images, so I need to run OCR.
Like pdftotext, OCR programs often mess up the page layout, but I don’t
care because I’m using regular expressions to look for small chunks.

I don’t even care whether the images are in order; I just use pdfimages
to pull out the images and then tesseract to OCR each image and add that
to the text file. (This is all in the
translate
script that I linked above.)

If I care about the layout of the page

If I care about the layout of the page, pdftotext probably won’t work.
Instead, I use pdftohtml or inkscape. I’ve never needed to go deeper,
but if I did, I’d use something like
PDFMiner.

pdftohtml

pdftohtml, also part of poppler-utils,
is particularly useful because of its -xml flag.

One of the things that I try to extract is the “CHARACTER OF WORK” section.
I do this with regular expressions, but we could also do this with the XML.
Here are some XPath selectors that get us somewhere.

# This is python
print pdf2xml.xpath('//text/b[text()="CHARACTER OF WORK"]/../text()')
print pdf2xml.xpath('//text/b[text()="CHARACTER OF WORK"]/../following-sibling::text/text()')

Inkscape

Inkscape can convert a PDF page to an SVG file. I have a
little script that runs this across
all pages within a PDF file. You can also install it from NPM.

npm install -g pdf2svg

Once you’ve converted the PDF file to a bunch of SVG files, you can open it
with an XML parser just like you could with the pdftohtml output, except
this time much more of the layout is preserved, including the groupings of
elements on the page.

Here’s a snippet from one project where I used Inkscape to parse PDF files.
I created a crazy system for receiving a very messy PDF table over email and
converting it into a spreadsheet that is hosted on a website.

This function is contains all of the parsing functions for a specific page of
the pdf file once it has been converted to SVG. It takes an
lxml.etree._ElementTree object like the one we get from lxml.etree.parse,
along with some metadata. It runs a crazy XPath selector (determined only after
much test-driven development) to pick out the table rows, and then runs a bunch
of functions (not included) to pick out the cells within the rows.

I’d like to point out the string() xpath command. That converts the current
node and its decendents into plain text; it’s particularly nice for
inconsistently structured files like this one.

Optical character recognition

People often think that optical character recognition (OCR) is going to be
a hard part. It might be, but it doesn’t really change this decision process.
If I care about where the images are positioned on the page, I’d probably
use Inkscape. If I don’t, I’d probably use pdfimages, as I did here.

I prefer the
ones earlier in the list when the parsing is less involved because the tools
do more of the work for me. I prefer the ones towards the end as the job gets
more complex because these tools give me more control.

If I need OCR, I use pdfimages to remove the images and tesseract to run
OCR. If I needed to run OCR and know more about the layout, I might convert the
PDFs to SVG with Inkscape and, and then take the images out of the SVG in order
to know more precisely where they are in the page’s structure.

Broader ideas

PDF is a weird semi-proprietary binary format that can be annoying to read,
so my first step in reading PDF files is converting them to something that is
supported better and that I’m more used to. Once I’m not that, I can mostly
ignore that the file was originally a PDF file.

There are all sorts of ways of encoding data in PDF files, so it’s not like
there’s a straightforward PDF-to-spreadsheet conversion. (This is just like
any other file format.) Figure out what data you want to extract from the
files, and select your parsing strategy accordingly.

And if it seems like too much work to get exactly what you want, try to come
up with something else that is easier to extract but still tells you most of
what you wanted to know.

Finally, it’s quite hard to convert data formats without losing information,
so don’t worry about losing information. You can always write a new parser
for any other information that you want.