In the previous guide, we describe several methods for turning PDFs into data usable for spreadsheets. However, those only handle PDFs that have actual text embedded within them. When a PDF contains just images of text, as they do in scanned documents, then the problem isn't just how to convert them into neat tabular data, but how to extract any text, period.

In this tutorial, we'll explain how to write a program to extract the data into tabular format. Here's an overview of the basic steps:

1. Determine the positions of the lines that divide the rows and columns on a page.

2. Break the image apart along those lines to create (hundreds of) individual image files, one for each cell.

4. Reassemble these (hundreds of) text files in the same order that you divided the main image, creating a (text) spreadsheet of the data.

A caveat: The code examples provided here are specific to the FlashPaper version of Eli Lilly's doctor payment disclosures, such as the black outlines of its table cells.

Background

Eli Lilly became the first major drug company to post its physician payments online in July 2009. However, Lilly was criticized in an April 2010 New York Times article for using a proprietary and discontinued format -- Adobe's "FlashPaper" -- which made the data virtually impossible to download or copy. In fact,
PharmaShine, a company that maintains a commercial database of physician payments, said it had to manually retype the entire list.

Eli Lilly disagrees with the Times' story's characterization that it had "purposely made its report impossible to download." In an e-mail to ProPublica, Lilly spokesman J. Scott MacGregor called the characterization "misleading, as it was never our intention to make it difficult for people to access information" and said that preserving integrity of the data was the reason for not making it downloadable initially.

In any event, Lilly now provides their data as a PDF, with copyable text. However, when we began work on Dollars for Docs, Lilly was only providing data in the FlashPaper format. We came up a way to download the file and programmatically to extract tabular data from it.

Downloading the FlashPaper Document

The Safari browser's Activity window also will reveal the files downloaded to your browser; data.swf is what we want.

Download the file and open it in your web browser. There's a printer icon in the top-right. Click on it and choose PDF as the output. You'll end up with a PDF weighing over 200MB. When you open it up, you'll see that it appears to be a normal table of text. But you won't be able to highlight-and-copy (which is also the case for secure PDFs), and saving as text will create an empty file.

So let's convert it to the TIFF image format, the only format Tesseract can read. You can do this either with Adobe Acrobat Pro's Export function, or my colleague Jeff Larson's pdf-splitter:

Click the printer icon, circled in red, and print the file as a PDF.

pdf-splitter data.pdf.pdf '%d.tif' 2500

This will result in .tif files for every page in the original PDF.

Step 1: Reading Lines With RMagick

RMagick is a library for Ruby that allows us do a variety of graphics operations in our program, such as changing an image's color. In fact, that's the first thing we want to do because Tesseract works best with black-and-white images. This is done using RMagick's quantize method. We also can reduce the gray boosting the image's contrast.

We had limited success doing this programmatically and so just used a Photoshop batch operation to get the desired black-and-white contrast. For the purposes of this tutorial, you can use this sample black-and-white excerpt: table-to-ocr.tif

require 'rubygems'
require 'rmagick'
img = Magick::Image::read('1.tif').first
# Your mileage will vary with this following method; we ended up just using a Photoshop batch job to correctly produce a
# black-and-white version
img = img.sigmoidal_contrast_channel(5,40,true).quantize(16, Magick::GRAYColorspace).posterize(5)
# write out the result just to see what it looks like
img.write('bw-1.tif'){|f| f.depth = 8}
# the block sets the saved image to a depth of 8-bits

Detecting the lines in the .tif file simply involves finding the lines in which all the pixels are non-white (some may be gray, which you'll see if you zoom in at the pixel level). This can be done with using RMagick's get_pixels method on every horizontal and vertical line; get_pixels returns an array of pixels within the boundaries we specify.

RMagick's Pixel class has red, blue, and green attributes. Examining the red, blue, and green values of white pixel should give you 65535 for each; a black pixel will return the value 0. Gray pixels are anywhere in between.

So first we crop the image to remove the white space surrounding the table (using bounding_box). Then we examine each pixel of a line and record the positions where every pixel in that line had color values less than a dark gray (63000 seems to be enough tolerance):

Step 2: Breaking It Into Pieces

Now that we have two arrays defining rows and columns, we iterate through each one and call RMagick's constitute method, which creates new images based on the dimensions we provided it. We then write each image to a file named column_numberxrow_number.tif:

You should end up with a directory called cell-files with nearly 500 TIFF files in it.

Step 3: Tesseract Each Image

Tesseract is a free optical character recognition program, first developed by HP and now maintained as open-source software by Google. Its operation is simple: point it to an image file, and it produces a text file with what it interprets as text from that image. So, in the above code, we simply run Tesseract on each TIFF as it is created. Add this to the above code, after the constitute call:

`tesseract /cell-files/#{j}x#{i}.tif /cell-files/#{j}x#{i}.txt `

Now you should have nearly 500 text files in cell-files.

Step 4: All Together Now

In the previous code, we're essentially stepping through the image column by column, line by line. While we're in this loop, we might as well record each of the text files' content into one master text file. If we add a delimiting character each time, we end up with tabular data.

So, in the previous block of code, open a text file called "1-table.txt" and after the tesseract call, write the contents of that tiny text file into "1-table.txt." The combined code for Steps 2-4 is:

Cleanup Time

If that seemed too easy, it was. Open the resulting text file in a spreadsheet:

Tesseract's imperfect translation of the images we sent it.

Tesseract isn't perfect, and on the first pass it may mistranslate many characters, especially ones that look similar to another, such as 'O' and '0' (zero). You can train the Tesseract engine, though it's tedious and involves giving it a character-by-character correction of a sample test.

Even then it won't be perfect. Your next step will be to determine the best way to clean this data. You could start by validating each column against what you know it should contain (such as a currency format). Or you could design a Mechanical Turk task in which you send an individual text file and TIFF for each cell and ask workers to perform a simple verification. You could even write your own Rails application to display the images and text-values side by side, so that your co-workers can collaboratively do the verification (this is what we did until Lilly released their data as PDFs).

Again, the code here is specific to Lilly's format and may not be as successful on a scanned document where, for example, the lines aren't as easy to determine.

We hope to craft a more generalized version of this guide in the near future. We'll call the project "Tableract" for now.

5 comments

Nice guide. All of what you describe can be done with GNU/Linux (e.g., Debian, Ubuntu, Fedora, etc.), oftentimes using the same or similar tools. You might want to add Unpaper, which is a nice command line tool for cleaning up images of scanned text before running Tesseract.

In splitting apart the pdf, Harris Corner recognition is great as well. Open CV can perform this. Box corners usually are easy to segment by finding the corners, and following the ‘line’ from them or finding basic stats. Clustering works well with the stats.

I’m wondering, really wondering: your organisation seems to be extremely tech-savvy in doing what it does. I applaud all your efforts!

However, you do not seem to ever have heard of *Ghostscript*?! Ghostscript does not only run on ‘Mac OS X Snow Leopard’, but on all OSX versions. And on all Windows versions. And on all Linux versions. And on all Unix versions….

It is easy to convert any PDF into a TIFF (G4, or colored TIFFs if you want). For the case of TIFF G4, just run:

gs -o myfile-page-d.tif -sDEVICE=tiffg4 myfile.pdf

Each page of the PDF will be a separate TIFF, named `myfile-page-001.tif`, `myfile-page-002.tif`, etc.

Should you want a multi-page TIFF instead, just use as output name `myfile.tif` ...