Optical character recognition

The Project

Optical character recognition (OCR) for the Bdinski sbornik project was
implemented with ABBYY Finereader, version 11.
The input source was Jan L. Scharpe and F. Vyncke’s 1973 Bdinski Zbornik: An
Old-Slavonic Menologium of Women Saints (Ghent University Library Ms. 408, A.D.
1360) (Brugge: De Tempel), where the original typeset text looks like:

Scanning

ABBYY Finereader can perform OCR on either pregenerated PDF image files or input fed
directly from the scanner into the software through the TWAIN interface. The latter
approach, scanning directly into the program from the flatbed scanner (Scan to
Microsoft Word in the image below), yielded much better results:

Scans were performed at 300 dpi; we tried 400 dpi in the hope of resolving some
recognition problems, but we found that the recognition rate at the two resolutions was
comparable, and that the higher resolution yielded larger files that had no processing
advantages.

Language settings and the pattern editor

The OCR process required us to select a language (the term is in quotation marks
because it is not necessarily the same as an actual human language), train the system to
map individual glyph images to individual characters (code points)
in the specified language, and then convert text areas from bitmapped images to
character data streams. To choose a language, first click on Tools to drop down
the menu below:

From that menu, click on Language Editor … to select and edit the language or
languages that the system will be asked to recognize. You can either select from a list
of existing languages or create a new one (by clicking on New … at the
bottom):

We initially set the recognition language to Russian (OCS was not an option) in the hope
that the pretrained knowledge of modern Cyrillic that shipped with the product would
improve the recognition rate. We found that the difference between the black-letter
typeface in the input source (see the image above) and modern
Russian typography was such that we obtained much better results starting with a
completely clean slate. That is, we created a separate, custom language, which we called
BdinskiSbornik. All in all, we found this more reliable (quicker to train,
more accurate results) than using Russian or even Russian (Old Spelling) as a
preset language.

Once we had decided to create a new language we had to configure it by specifying
the available inventory of characters, which would then be used as mapping targets
during OCR training. To edit a selected or custom language in order to specify the
characters that should be available, hit Properties …, which opens the Language
Properties dialog:

We selected Russian (Old Style) as the Source language as a way of
preselecting most of the characters that we wanted to include in our language.
This selection at this point just specifies the base character inventory, and should not
be confused with selecting Russian (Old Style) as a language, which we
avoided because it would have resulted in incorporating preexisting knowledge about
character distribution that would have been erroneous for our purposes. Since the
Source language is just the starting point for specifying the character
inventory, we then selected the … button next to the Alphabet bar to customize
our inventory by including and excluding individual characters:

We found it frustrating that although the Alphabet editing feature permits us to identify
the characters to be recognized, certain ones cannot be excluded (e.g., Latin x)
and others cannot be included (such as the entire Unicode Cyrillic Extended-B range; http://www.unicode.org/charts/PDF/UA640.pdf). We understand that the use of a
pretrained language might entail a commitment to a standard character inventory, but we
see no reason why a custom language setting should not make the entire Unicode Basic
Multilingual Plane (BMP) available. Because of this limitation, during the recognition
process it is not possible simply to copy and paste a new character into the program.
What we did instead was utilize placeholders, e.g., я for ꙗ, and we then replaced the placeholders through global
search and replace operations after the initial OCR had been completed and the document
had been saved to a Microsoft Word file.

The overall recognition rate was good. We didn’t count the errors per line, but after
training the system and letting it train us, we found that we could process a page of
input in approximately ten minutes, which was much quicker and more accurate than the
results we could have obtained by keyboarding. Nonetheless, the consistency of the ABBYY
program was unpredictable, and sometimes it would recognize a character flawlessly for a
while and then begin to make mistakes with the same character later. Processing the same
page multiple times could yield different results each time.

Where the program makes a consistent mistake due to a training error, such as regularly
mapping a glyph image to the same wrong character, it is possible to undo the erroneous
training by using the Pattern Editor (Tools → Pattern Editor). Begin by
selecting the language you’re using:

and then hit Edit …. A screen of glyph images that have been stored during
training, along with the characters to which they have been mapped, will open, and you
can then delete the glyphs that are being misrecognized consistently and retrain the
page or document:

Reading and training

After scanning each page we found it most efficient first to delete or edit the green
text boxes that define the areas to which OCR will be applied:

This enabled us to exclude running headers, footnotes, edition line numbers, and other
areas that we did not want to retain in our output.

Because we created a new language, the system began with no ability to recognize
any glyphs, and we had to train it. The training proceeds slowly, one glyph at a time,
at first, but the system learns quickly, after which we could just feed it pages and let
it read. The character recognition was never flawless, and we always had to read each
page of output carefully and correct the errors, but overall we were satisfied with
ABBYY’s ability to simplify and streamline the process of converting the printed text to
character data.

To train, go to Tools → Options, and check the box that says Read with
Training:

This box will uncheck itself after every reading, so this selection must be repeated
every time you run a training scan. If you forget (as we did frequently), ABBYY will
read all your pages (or your selected page) automatically, and if you have already
manually corrected the mistakes in the output panel on the right side, and are rereading
to improve the training, it will revert it to the standard reading and overwrite the
corrections.

If you just hit Read, ABBYY will read the entire document, which probably isn’t
what you want, both at first (because you’ll need to train the system) and later
(because pretty extensive editing and correction was required, we found it easiest to
work through the book a page at a time). To read selections, highlight the desired pages
and right-click. To read selected blocks of text, outline them with a green box and
right-click.

Training itself is a fairly straightforward process. As was noted above, it is not
possible to copy and paste new symbols into the Training field, and if there is not an
option for a certain letter, it is necessary either to edit the range of available
characters in Language Editor (if permitted) or select a place-holder character and then
replace it afterwards by employing a global search and replace operation to post-process
the output Word file. The system tries to isolate individual characters, but it
sometimes misses, whether because the glyph is discontinuous (e.g., ы) or because
the image has a faint part that makes it appear discontinuous, and it also sometimes
erroneously misreads what should be two separate characters as one. If the system has
erroneously selected only part of a letter, or more than one letter or symbol, it is
possible to adjust the recognition area with the help of the « and »
buttons, which can join or unjoin what appear to be discrete glyphs (separated by white
space). The training interface is illustrated below:

After training and reading, letters about which the system is uncertain are highlighted
on the right side of the screen in turquoise. The user can correct these manually, but
those corrections are not used to update the training. It is possible to correct any
character, whether highlighted in turquoise or not, but in general the system showed a
fairly good awareness of which recognition moments were uncertain, and it proved most
efficient to concentrate on reviewing just the turquoise sections at this stage. The
remainder is not error-free, but since the entire output will undergo comprehensive
proofreading, correction, and editing at a later stage in the development process, we
felt that any further correction should be deferred until that time.

One of the trickiest and most frustrating aspects of the training and reading workflow is
that correcting the output errors as described above does not feed back into the
training, and therefore does not improve subsequent recognition. It is possible to
retrain on a whole page, but this is tedious and sometimes counterproductive in that it
may correct one error while introducing another. Alternatively, it is possible to select
just a small area of text to be trained, select Read with training, and right
click on that selected text box to read only that area. The user can then find any
problematic symbol (select … next to the text box), copy it, close the trainer
without saving the pattern, and delete the unnecessary text area. This occurred fairly
frequently with letters that were adorned with diacritics or superscript letters. A
related problem is that it is not possible to train the system on glyphs that it
incorrectly thinks it has recognized. During the training process the system will stop
and query the user when it is uncertain, but if it is certain but incorrect, there is no
way to select for training a glyph on which the training routine has not stopped on its
own.

Saving, opening, etc.

It is possible to save selected pages by highlighting them on the left side and
right-clicking. To save the entire project, the user must select Save FineReader
Document under File. Likewise, to open a project, the user must
specifically select Open FineReader Document under File:

Simply hitting Open will open only an individual file, and not an entire project.
Saving the project means saving not only the text, but also the training, so this is an
important step if the OCR is not going to be performed all in one session.

Recognition issues

The following recognition problems were particularly noticeable in the project:

The system mistook the letter б (b) for the digit 6 (six), the letter
o for the digit 0 (zero), and the letter з (z) for the
digit 3 (three). We tracked down many of these afterwards by performing a
regular expression search for sequences of characters that included both letters and
digits, since although the input document contained both, it was unlikely to include
them in the same word. We corrected single-character errors, that could not
be identified in this way, during the character-by-character proofreading and
editing stage.

The system occasionally skipped or inserted spaces. To fix this we globally reduced
all sequences of space characters to a single space afterwards; we inserted missing
spaces manually during proofreading.

The system frequently confused в (v) and б (b).

The system very frequently confused и, н, and п. Some of these
could be corrected through global search and replace operations (e.g., of those
three letters, only и can stand alone as a word, only н occurs before
а in a two-letter word) or global search with verification before
replacement (e.g., in word-initial position before о,п is overwhelmingly the most likely, н is possible, and и is
impossible).

The system mistook м for iл. This was easily corrected with a global
search and replace operation.

The system inconsistently recognized ы, often making it ьi. Ultimately
we remap most occurrences of jery to ьї because that’s what occurs in the
manuscript, so this is easily handled during post-processing.

Capitals in general were very poorly recognized. In addition, the system would
capitalize lower case letters and convert to lowercase letters that should have been
capitalized. The latter isn’t a problem because we converted the entire output file
to lowercase anyway. The edition introduces case distinctions according to modern
orthographic conventions, capitalizing proper nouns and the first word of every
sentence, and we corrected these to conform to the spellings found in the
manuscript, reserving modern upper-case letters for decorative initials and headings
in the manuscript.