Digital Textbases and Optical Character Recognition (OCR)

Experienced users of ECCO know about the limits of its full-text capability. The long s in eighteenth-century fonts is one of many peculiarities that can wreck an automated effort at optical character recognition (OCR). Though I’m grateful that I can search ECCO and other databases using full text, I often wonder how complete my search is. I usually get a sense of how many false hits I find, but how many true hits am I missing? How accurate are the full-text capabilities of these resources?

A recent article presents a method for assessing the accuracy of OCR using the British Library’s 19th Century Newspaper Project as a case study:

The article briefly mentions Gale’s Burney newspapers project. One of the good points in this article concerns how we should measure accuracy:

Given a newspaper page of 1,000 words with 5,000 characters if the OCR engine yields a result of 90% character accuracy, this equals 500 incorrect characters. However, looked at in word terms this might convert to a maximum of 900 correct words (90% word accuracy) or a minimum of 500 correct words (50% word accuracy), assuming for this example an average word length of 5 characters. The reality is somewhere in between and probably more at the higher extent than the lower. The fact is: character accuracy of itself does not tell us word accuracy nor does it tell us the usefulness of the text output. Depending on the number of “significant words” rendered correctly, the search results could still be almost 100% or near zero with 90% character accuracy.

The term “significant words” refers to words that users are likely to search for, in contrast to function words (pronouns, prepositions, etc.). A textbase’s accuracy in terms of “significant words” is an appropriate yardstick for how useful its full-text search is.

The full article merits reading. The authors found that for significant word accuracy, the 19th Century Newspaper Project was 68.4% accurate and the Burney Newspapers was 48.4% accurate. Eighteenth-century newspapers can be astonishingly difficult to read even in the originals, so this low percentage is not that surprising. I suspect that ECCO is somewhere in between these two percentages.

I had thought Ian was referring to the Coda to his “Use and Misuse piece of EEBO”, but please do let us know if there’s another source, Ian.

The Tanner et. al. article is quite interesting, and I was struck by its reminder that “proper nouns, names and place names are harder for OCR engines to cope with” (9). In terms of using the Burney collection online for research, I would imagine that this trio of significant words would represent the a large majority of words searched. I have not used Burney extensively, but I did find that if I used common nouns associated with the proper nouns for names and places that were the focus of my search, I received far better results. I wonder if others have had the same experience?

One of the many interesting points in the Tanner article is the attempt to arrive at a more statistically precise assessment of accuracy in automated recognition technologies.

Eleanor’s question sounds like a good one: perhaps certain methodologies, such as searching for associated terms, yields a slightly more complete search. I would be interested in hearing more about how scholars use the Burney Collection.

What I appreciated about the accuracy assessment/measurement is its identification of specific problems that can then be addressed. Plus, this information also offers data/evidence to support/explain anecdotal experiences.

I use Burney to search for titles of works, how works are being advertised, business notices, bookseller/publisher activity, geographic information about commercial entities, partnerships among commercial entities, product placement, cultural announcements, and more.

I often start with proper names (of people, addresses, titles), and then I use the results of those searches to create new search strings. When I browse nearby dates within a title’s hits, I not infrequently find results that did not come up as hits in the actual search. This is a rather reductive description of my search strategies, but it offers the general idea of how I approach this tool.

That’s helpful, Eleanor. Thanks. It’s useful to see how
scholars use these text-bases and to hear about their
methods. I would also love to hear a library cataloguer
talk about searching these text-bases.