Breaking out of the Digital Graveyard

Breaking out of the Digital Graveyard

In 1973, a fire broke out at the St. Louis National Personnel Records Center, destroying 16 to 18 million military service records from 1912 to 1964. If these records had been digitized they'd have been safe, but not necessarily any more accessible.

Scanned PDF images, the low-cost, high-speed method for digitizing images, can be duplicated and stored in many places. But you can't find anything in them, except by a human being searching through the handwritten text by eye. And the 1940 U.S. Census, for example, consists of 3.6 million PDF images.

Commercial services like Ancestry.com employ thousands of human workers who manually extract the meaning of a small, profitable subset of these images so they can be searched by computer, says Kenton McHenry of the National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign. "But government agencies just don't have the resources" necessary to make most of them accessible in this way. The danger is that scanned documents may become a digital graveyard for many historically, culturally and scientifically valuable old documents—we know where it is, but it's impossible to resuscitate.

Liana Diesendruck, an NCSA research programmer, McHenry and colleagues in his lab have used a number of XSEDE resources to begin cracking this formidable problem. Using the 1940 Census as a test case, they have created a framework for automatically extracting the meaning from these images—in essence, teaching machines to read cursive script.

The project employed the Steele supercomputer at Purdue University and Ember at NCSA to do much of their initial data processing. NCSA's Mass Storage System held the initial data set, with the group moving the data to the Illinois Campus Cluster when the MSS retired. They used Pittsburgh Supercomputing Center's (PSC's) Blacklight supercomputer in their most recent work, using Blacklight's large shared memory to enable their search system to enter alpha testing. The group has now also begun storing data on PSC's Data Supercell.

Throughout, the researchers made use of XSEDE's Expanded Collaborative Support Service (ECSS) to optimize the performance of these resources.

Teaching Machines to Read

"Before we could even think about extracting information, we had to do a lot of image processing," says Diesendruck. Misalignments, smudges and tears in the paper records had to be cleaned up first. But the difficulty of that process paled compared with the task of getting the computer to understand the handwritten text.

It's relatively simple to get a computer to understand text that is typed in electronically. It knows that an "a" is an "a," and that the word "address" means what it does. But early Census entries were made by many human workers with different handwriting and styles of cursive script. These entries can be difficult for humans to read, let alone machines.

Having the computer deconstruct each hand-written word, letter by letter, is impossible with today's technology. Instead, the investigators made the computer analyze the words statistically rather than trying to read them. Factors such as the height of capital "I"s, the width of a loop in a cursive "d" and by how many degrees the letters slant from the vertical all go into a 30-dimensional vector—a mathematical description consisting of 30 measurements. These measurements constitute a kind of address that the computer can use to match words it knows with ones it doesn't.

PSC's Blacklight proved ideal for the task, McHenry says. Part of the computational problem consists of crunching data from different, largely independent entries as quickly as possible. Blacklight, while not as massively parallel as some supercomputers, has thousands of processors to do that job. More importantly, Blacklight's best-in-class shared memory allowed the team easily to store the relatively massive amount of data their system had extracted from the Census collection—a 30-dimensional vector for each word in each entry. This allowed the calculations to proceed without many return trips to the disk. Eliminating this lag to retrieve data made the calculations run far faster than possible on other supercomputers.

"Good enough" accuracy: But scalable!

The system can retrieve word matches despite the idiosyncrasies of the handwriting. Of course, as in any computer-vision based system, it also returns incorrect results. The idea is quickly to produce a "good enough" list of 10 or 20 entries that may match a person's query rather than taking far longer to try to make it exact.

"We get some results that aren't very good," Diesendruck says. "But the user clicks on the ones he or she is looking for. It isn't perfect, but instead of looking through thousands of entries you're looking at 10 or 20 results."

Search engines like Google have made Web users very demanding in terms of how much time a search takes. But while users expect fast, they don't expect extreme precision: they don't tend to mind scanning short lists of possible answers to their query. So the script search technology is similar to what people are used to seeing on the Web, making it more likely to be accepted by end users.

There's another virtue to how the system works, McHenry points out. "We store what they said was correct," using the human searcher's choice to identify the right answers and further improve the system. Such "crowd sourcing" allows the investigators to combine the best features of machine and human intelligence to improve the output of the system. "It's a hybrid approach that tries to keep the human in the loop as much as possible."

Today the group is using Blacklight to carry out test searches of the 1940 Census, refining the system and preparing it for searching all sorts of handwritten records. Their work will help to keep those records alive and relevant. It will also give scholars studying those records—not just in the "hard" and social sciences, but also in the humanities—the ability to use and analyze thousands of documents rather than just a few.