Character (not) Recognition

On October 3, 2015 by Michelle Warren

The great thing about experiments is that you never know how they’re going to turn out. Remix the Manuscript was sparked by a data visualization question: how should we display transcriptions of marginal notes? At some point someone suggested that it would be good to have a transcription of the text itself. The PDF file of the edition seemed like a good short cut to jump start the process. Just one problem: Middle English characters thorn and yough don’t convert that smoothly into recognizable characters.

People who study born-analog documents have to grapple with any number of technical translations before they have any digital data to work with. The printed book might be digitized, but it has to be converted to OCR before it can searched. If the letters are fuzzy or imprecise for any reason, that conversion results in lots of unreadable characters. In the case of manuscripts, from any time period, the conversion of idiosyncratic handwriting into searchable characters is still largely out of technical reach. In short, in the humanities the data are not there to be found: they have to be made.