Update: I’ve turned off commenting on this article because it was just
a bunch of people asking for help and never getting any. If you need
help with these instructions, go to Stack Overflow and ask there. If you
have corrections to the article, please send them directly to me using
the Contact form.

Tesseract is a great and
powerful OCR engine, but their instructions for adding a new
font
are incredibly long and complicated. At CourtListener we have to handle
several unusual blackletter
fonts, so we had to go
through this process a few times. Below I’ve explained the process so
others may more easily add fonts to their system.

Create training documents

To create training documents, open up MS Word or LibreOffice, paste in
the contents of the attached file named ‘standard-training-text.txt’.
This file contains the training text that is used by Tesseract for the
included fonts.

Set your line spacing to at least 1.5, and space out the letters by
about 1pt. using character spacing. I’ve attached a sample doc too, if
that helps. Set the text to the font you want to use, and save it as font-name.doc.

Save the document as a PDF (call it [lang].font-name.exp0.pdf, with lang
being an ISO-639 three letter
abbreviation
for your language), and then use the following command to convert it to
a 300dpi tiff (requires imagemagick):

You’ll now have a good training image called lang.font-name.exp0.tif. If
you’re adding multiple fonts, or bold, italic or underline, repeat this
process multiple times, creating one doc → pdf → tiff per font variation.

Train Tesseract

The next step is to run tesseract over the image(s) we just created, and
to see how well it can do with the new font. After it’s taken its best
shot, we then give it corrections. It’ll provide us with a box file,
which is just a file containing x,y coordinates of each letter it found
along with what letter it thinks it is. So let’s see what it can do:

You’ll now have a file called font-name.exp0.box, and you’ll need to
open it in a box-file editor. There are a bunch of these on the
Tesseract
wiki.
The one that works for me (on Ubuntu) is
moshpytt, though it doesn’t
support multi-page tiffs. If you need to use a multi-page tiff, see the
issue on the
topic for tips.
Once you’ve opened it, go through every letter, and make sure it was
detected correctly. If a letter was skipped, add it as a row to the box
file. Similarly, if two letters were detected as one, break them up into
two lines.

Next, you need to detect the Character set used in all your box files:

unicharset_extractor *.box

When that’s complete, you need to create a font_properties file. It
should list every font you’re training, one per line, and identify
whether it has the following characteristics: \<fontname> \<italic>
\<bold> \<fixed> \<serif> \<fraktur>

So, for example, if you use the standard training data, you might end up
with a file like this:

Note that this is the standard font_properties file that should be
supplied with Tesseract and I’ve added the two bold rows for the
blackletter fonts I’m training. You can also see which fonts are
included out of the box.