Tracking Human Migration Through Archives and Digital Curation

You are here:

Battle of the T’s: Transcription and Technology by Scott Harkless

A few months ago, when I approached Dr. Heger for involvement with the Overseas pensions project, my heart was full of ambition and my head was full of all manner of strange and altogether wrong ideas of what actually goes into making a material available and useful in a digital form. I had been fooled by the idea that digitization and simple technology would make most of my job easy.

So to cure me of this notion I began a transcription of the document “List of Pensioners on the roll January 1, 1883; giving the name of each Pensioner, the cause for which Pensioned, The Post Office Address, The Rate of Pension per Month, and The Date of Original Allowance, as called for by Senate Resolution of of December 8, 1882. Volume V”. Yes, that huge thing is the name of a single document, in fact this document was hundreds of page long, and contained the name and location of every single American living domestically or overseas who was gaining a pension from the United States government. The document was hundreds of pages long, luckily I only had to transcribe 26 pages or so.

26 pages of the document was enough to list all the pensioners; by this I mean the soldiers, widows, wives, dependents, and others receiving a pension outside of the united states in that year. It also listed their locations, their wounds, their pay, serial numbers, their date of allowance, and quite a bit of other information. In total there were records in that one section of document of 1582 people.

Transcribing the document first and foremost involves endurance. I found that I could transcribe a page in between half an hour to an hour depending on the complexity of the page. For 26 pages therefor I would need between 13 and 26 hours of effort. The longer I worked on a section the more likely I was to make a mistake; to overlook a name or duplicate a number from a previous entry. When this happened I would have to take additional time going painstakingly through the document until I found and corrected my error.

I was not the only one making errors though. The original printers of this document sometimes misspelled words, used unusual truncations of words (and not always the same in a single page, the amount of ways to indicate the word “left” for example is astounding.), or misprinted a letter. The difficulties in determining the difference for example between 6, 9, and 0 made in some cases me give my best guess rather than trying to read that letter.

Some sections were easier, because I was familiar with the names and the content. If for example I was trying in pensioners in Germany section and I saw the partially faded word “Sch—swig” I would know it was Schleswig. I was not as familiar with the place names and personal names in Gaelic Wales for example, and might need to use additional time looking them up if I couldn’t read them from the document.

Some archives and cultural heritage organizations (such as our own DCIC) have access to OCR (Optimal Character Recognition) software which automates the transcription process through digitally analyzing the image and identifying the letters. Some software such as ABBYY finereader are quite sophisticated. Yet such software can be to expensive for many archives and it has its limitations (although that’s a topic for another blog). For example, hard to read or hand written notes are notorious for being impervious to OCR. With that in mind, remember that even with the best technology available, some of the metadata we will be searching will only be available if a person is willing brew a pot of coffee, set up the document on a separate screen, and risk a little arthritis to type it all up.