Automatic Transcription in Colonial Contexts: OCR for the Primeros Libros

View/Open

Date

Author

Metadata

Abstract

The PDF images in the Primeros Libros digital collection, an effort to produce digital facsimiles of all books printed before 1601 in the Americas, pose several challenges for Optical Character Recognition (OCR) systems. The Ocular system, designed by Taylor Berg-Kirkpatrick et al., jointly models the physical operation of hand-press printing and the language of the written document, allowing it to ‘learn’ to read early printed books. Ocular cannot, however, handle the orthographic variation and code switching prevalent in the American context. Working with PDF images of trilingual texts in Spanish, Latin, and Nahuatl, we set out to modify Ocular for use on the Primeros libros collection.
In this paper, we present our OCR tool for the Primeros Libros collection, an extension of Ocular which can handle multilingual documents, and which includes an interface for the incorporation of orthographic idiosyncrasies. At the same time, we argue for a situated analysis of digitization tools which considers Ocular's statistical models within the context of the Primeros Libros collection. As Walter Mignolo has shown, books from early colonial Mexico embody a larger project of language codification which was deeply embedded in the colonization and religious conversion of New Spain. The mathematical simplicity of Ocular's statistical models suggests a neutral engagement with the text that disguises a deep engagement with these colonial processes. Automatic transcription in this context becomes a process with significant implications for the ideological positioning of digitization projects.