For my final project, we are considering posting court cases on our
site, and so I did some work today analyzing how best to convert the PDF
files the courts give us to HTML that people can actually use. I looked
briefly at google docs, since it has an amazing tool that converts PDF
files to something resembling text, but short of spending a few days
hacking the site, I couldn’t figure out any easy way to leverage their
technology in any sort of automated way.

The other two tools I have looked at today are
pdftotext and
pdftohtml, which, not surprisingly,
do what their names claim they do. Since we’re going to be pulling cases
from the 13 federal circuit courts, I wanted to figure out which method
works best for which court, and which method will provide us with the
most generalizable solution across whatever PDF a court may crank out.

The short version is that the best option seems to be:

pdftotext -htmlmeta -layout -enc 'UTF-8' yourfile.pdf

This creates an html file with the text of the case laid out best as
possible, some basic html meta data applied, and the UTF-8 encoding applied.

Before coming to this conclusion though, I looked at two settings that
pdftohtml has. With the -c argument, it can generate a ‘complex’ HTML
document that closely resembles that of the original. Without the -c
argument, it will create a more simple document. Although the complex
documents are rather impressive in appearance, they’re abysmal when it
comes to the quality of the HTML code that is generated. For an example,
look at the source code for this this
file.
If, on the other hand, the -c argument is not run, and the simple
documents are generated, the appearance of the final product is worse
than the simple text documents that are created by pdftotext. Check out
this
one
for example.

For thoroughness, here is a table containing the results from this test.

A caveat regarding pdftotext: This library is developed by a company
called Glyph & Cog. Although
the code is open source, I couldn’t for the life of me figure out how to
file a bug against it. This doesn’t particularly bode well for using
something as a dependency. On the flip side, Glyph & Cog is happy to
provide support for the product.