Wednesday, July 4, 2012

OCRing typescript: A benchmarking test with PRImA

During our pilot phase, we plan to add over 1 million images of archives to the Wellcome Library website, all created during the 20th century. Much of this is comprised of typescript: letters, drafts of papers and articles, research notes, memos, and so on. Key to improving discoverability of these papers is OCR (optical character recognition), which allows us to encode the words as text and include them in a full-text index. We set out to test whether typescript material would provide accurate OCR results that could be included in our index.

OCR technology

OCR works by segmenting a block of text to the individual character level and then comparing the patterns to a known set of characters in a wide range of typefaces. The accuracy of character recognition relies on the source information having clear-cut, and common, letter forms. The accuracy of word recognition relies on both character recognition, and the availability of comprehensive dictionaries for comparison. OCR software can compare these words to dictionaries to enhance word recognition, and also to estimate accuracy rates (or levels of confidence).

When it comes to good quality, clearly printed text, OCR can be extremely accurate even without any human intervention - with rates of 99% or higher for modern printed content (less than one word out of one hundred words having at least one inaccurate character), reducing as you recede in time to 95% for 1900 - 1950 printed material, and lower for 19th century material (see Tanner, Munoz and Ros, 2009). For some formats, OCR is worse still - as the document mentioned above shows, 19th century newspapers may only reach 70% significant word accuracy (words not including "stop" words such as definite/indefinite articles, and other non-search terms).

Typescript testing

Regarding our archival collections, there is a wide range of content that is theoretically OCR'able. Some will OCR very well, such as professionally printed matter. But much non-handwritten content is by the nature of the age of this material in a typescript form. We had no idea how well this type of content would OCR. To find out we commissioned Apostolos Antonacopoulos and Stefan Pletschacher based at the University of Salford and members of PRImA (Pattern Recognition and Image Analysis Research) to do a benchmarking exercise from which we could determine whether we could rely on raw OCR outputs, should not OCR this type of material at all, or to test various methods to improve OCR'ability (such as post-processing of particular images).

Apostolos and Stefan chose a selection of 20 documents from a larger sample we provided originating from our Mourant and Crick digitised collections. These 20 documents where manually transcribed using the Aletheia groundtruthing tool for comparison to the output of three OCR engines, Abbyy FineReader Engines 9 and 10 and Tesseract, open source OCR software.

The results of the OCR benchmarking test show that original, good quality typescript content can reach up to 97% significant word accuracy with Abbyy Fine Reader Engine 10 (such as this example below):

At the bottom end, carbon copies with fuzzy ink can result in virtually 0% accuracy in any OCR software:

What was pleasantly surprising was how the average-quality and poorer content fared. On the better end of the scale we have 93% accuracy despite some broken characters:

And here we have poorer quality typescript producing 72% accuracy with many faint and broken characters:

The average rate for 16 images of good to poorer quality typescript is 83% significant word accuracy (excluding the carbon copies).

Accuracy levels are reported here according to the results from Abbyy FineReader Engine 10 on the "typescript" setting. The reported accuracy rates covers all the visible text on the page including letterheads, pre-printed text such as contact details, text overwritten by manual annotations and so on. Naturally, errors are more likely to occur in these areas, which (except in the case of text overwritten by manual annotations) are not of much significance in terms of indexing and discoverability. Further tests would be required to determine what the accuracy rate is with these areas excluded. For example, it may be possible to digitally remove the annotations in this draft version of Francis Crick and James Watson's "A Structure for DNA" to raise the overall accuracy rate (currently only 45%):

There is some variation between FineReader 9 and 10 where one or the other may have a small advantage with a few cases showing as much as a 30% difference. Overall, there is only 1% difference when looking at averages between 9 and 10. Tesseract, on the other hand, was far less accurate especially for the poorer quality typescript (roughly half as accurate overall).

Develop a workflow that would divert images down different paths depending on content ("typescript" path, FineReader 9 or 10, enhancements to be applied or not applied, etc.)

We may find that an average of 83% word accuracy overall is perfectly adequate for our needs in terms of indexing terms and allowing people to discover content efficiently. Further investigation is required, but this report has given us a good foundation from which to press on with our OCR'ing plans.

These digital collections are not yet available online, but will be accessible from autumn 2012.

Robkirk, FineReader was selected as the typical OCR engine used by most suppliers; it and Tesseract are also the two the PRImA team have most experience with. Having the groundtruth means that we could of course compare other OCR technologies to these in future if required.

John, I'm glad you found the report interesting! We are not planning to transcribe the handwritten content - we are looking at hundreds of thousands of handwritten pages, and our budget just does't stretch to that! However we are keeping our eye on crowdsourcing developments, in case this becomes a feasible option in future.

Would it be an idea to step right back and compare the time and costings for manual transcription with those of staffing and running an OCR unit? The Thesaurus Linguae Graecae project, that has been creating machine-readable texts since 1972, has regularly re-evaluated OCR software throughout its history. I believe that they still out-source text conversion to double-key data entry companies because it is cheaper, faster, and most important of all, more accurate.

OCR'ing will always cost less than re-keying, especially if you can do it in bulk, and as we are getting better-than expected results, we are likely to dispense with manual correction too (which increases the cost of OCR considerably of course).

Double-rekeying can be more accurate, but I have never seen a scenario where it is cheaper.

Two pass verification, also called double data entry, it is nothing but data entry quality control method that was originally employed when data records were entered onto sequential 80 column Hollerith cards with a keypunch. In the first pass through a set of records, the data keystrokes were entered onto each card as the data entry operator typed them. On the second pass through the batch, an operator at a separate machine, called a verifier, entered the same data. The verifier compared the second operator's keystrokes with the contents of the original card. If there were no differences, a verification notch was punched on the right edge of the card.

Informatics Outsourcing is an Offshore Data Management service company. Data Management Service includes all types of Data Entry Services like Text Data Entry, Numerical Data Entry, Double Key Data Entry, Image Data Entry with affordable price. Our team to give the solution quickly and given requirements.