We have written a post, quite some time ago, about an R Stemmatology Package1, and started its development (it is available on Github2). Now, we come back to this long overdue subject, to draft a roadmap for future developments.

NB: this is a working post, that will evolve in the course of development of this package.

This post is the first of a small series centred on optical text recognition applied to manuscripts. There are many interesting research projects dealing with this question, but my purpose here is quite different: I wish to demonstrate how it is now possible for a single researcher to quickly get usable results with some of the open source tools that are out there, starting with OCRopy and CLSTM.

The field of optical character recognition (OCR) applied to manuscripts (handwritten text recognition) is rapidly evolving, especially now that artificial intelligence methods, such as neural networks, are getting more widespread. For the scholar of medieval manuscripts, it has interesting applications, and it can help in the constitution of textual databases or in the collation of witnesses.

Ocropy is a « collection of document analysis programs »1, that uses a Long Short Term Memory (LSTM) architecture for Recurrent Neural Networks2. It has been successfully applied to a variety of printed scripts, including old prints3. In particular, it is already being used to acquire the texts of incunabula. There are already some very good documentation on this subject, particularly Uwe Springmann’s slides from a workshop in Munich in 20154 and a paper about the Ocrocis environment5.

My interest in OCRopy started with the modest goal of digitising the text of XIXth century editions of Old French texts (namely, the collection of the Anciens poëtes de la France). With it, I got close to 1% error after some time, but soon, motivated by the successful use of OCRopy with incunabula and news I heard from a colleague6, I turned to trying it on manuscripts.

Setting up OCRopy

For the Ubuntu user (probably also for other Linux distributions), OCRopus is very easy to install (it may prove harder with Mac OS, though), following the 4 simple steps described on the repository. Once installed, the next step is getting the image files for the text you want to recognise, preferably in TIF format, and in good resolution (I used 600 DPI images, but the more common recommendation is 300 DPI). Some digital libraries offer such files with an open license, for instance the E-Codices platform7.

Fig. 1: screenshot of Scan Tailor. Here, splitting the pages.

Depending on the case, you might have to preprocess the images, to rectify orientation, clean the image, crop it, etc. There are many tools to do that, for instance ScanTailor (fig. 1). Once your images are ready, and in a ./tif folder, you are good to go.

Preparing data: from layout analysis to ground truth production

Binarisation and layout analysis: column and line segmentation

Before anything, you will need (if you haven’t done it before) to binarise your images, and to try to detect columns and lines8. This first command,

$ ocropus-nlbin tif/* -o book

will binarise your images and place them in a book folder, and,

$ ocropus-gpageseg book/*.bin.png

will try to perform an identification of columns and lines. For both these commands, you might have to use the -n option, to deactivate error checking, and the skipping of pages or lines in your data. This part of the tools, layout analysis, is not based on training (yet) and can only be configured through a few options, not necessarily very documented. Some are related to scale, others to noise or baseline thresholds; I haven’t had much experience playing with those, so any feedback in comment to this post will be appreciated. Indeed, with manuscripts, this phase is often quite problematic, since lines or columns are not necessarily very regular…

On my experimentations on ms. Bodmer 68, using uniform and gaussian modes, I had the following error rates in column and line segmentation on fol. 134r, 136v and 142r :

fol.

Column errors

%

Line errors

%

Default

134r

25

60

6

7

136v

13

31

2

2

142r

22

52

1

1

Gaussian

134r

27

64

4

5

136v

16

38

4

5

142r

22

52

1

1

I do not count here errors related to lines amputated from beginning or end, and false positive (noise lines). In my experience, this is the part were Ocropy (as well as the other tools I tried) are least effective. In the end, I did column segmentation myself, with an image processing tool, and corrected line segmentation manually with Gimp. One can hope there will be improvements in this area in the near future9

Fig. 2a: a successful segmentation

Fig. 2b: an extreme segmentation failure

Ground truth

The first step in training a model to OCR a manuscript will then be to create some ground truth data, that is a correct (or, as correct as possible) transcription of a sample of the manuscript, on which to train the model, so it will learn to recognise the handwriting. This can be easily done with Ocropy, by creating an html file, with line images and text boxes:

$ ocropus-gtedit html -H 35 book/*/*.bin.png -o gt.html

and then,… well, then you have to transcribe a part of the text, to have something to train with (fig. 3).

Fig. 3: ground truth production

The question is: how many lines do you need to get an effective model? Usually, with deep learning, the answer is: the more data, the better. For my part, I have had good results with as few as 400 lines and have not had the chance to train with more than 2000 yet. According to Uwe Springmann and David Kaumanns10, good results on incunabula were obtained with between 1000 and 5000 training lines, the latter for harder cases. In any case, I would advise you to start small (for instance, 400 lines), then train, use the model to annotate 400 more, correct, re-train, and so on until you get a satisfying model.

Training on a medieval manuscript: the ms. Bodmer 68 of the Chanson d’Otinel

Training

Once you have some ground truth data, you can begin training (that can take some time). First, you will have to extract ground truth from the html file used for transcription:

$ ocropus-gtedit extract gt.html

The lines you transcribed will go into the book folder, into the subfolders created for each page, and will be labelled <image-name>.gt.txt.

Be careful that all the line of the gt.html file containing text will be extracted (empty lines will be skipped), not only the one you edited. So, if you have some uncorrected lines with text, you have to pay attention not to include them in training or testing data.

So, in order to train, place 90% of your data in a train folder, and 10% in a test folder, to estimate the success of each model. It is then time to train. The last thing to do, if you have special characters, is to modify the file chars.py, located in your ocrolib installation (in Ubuntu, it will be in /usr/local/lib/python2.7/dist-packages/ocrolib/). The syntax is simple enough (fig. 4).

Fig. 4: the beginning of my chars.py

Once it is done, you can launch training with:

$ ocropus-rtrain -o myModel -d 1 train/*/*.bin.png

The -d 1 option is here to allow you to visualise the training steps (fig. 5). You can safely remove it if you want to gain time. In each training iteration, a line is read, and the results from the guess of the neural network (OUT) are compared to the ground truth (TRU), and aligned text (ALN).

The visualisation shows you ground truth and training image, predicted and aligned results, probabilities for characters (green for space, blue for character with highest probability and yellow for absence of character) and the evolution of error11.

Every n iterations, the model is saved (default is 1000, but you can modify it with the -F option). Other useful options are:

The --load options allows you to retrain on an existing model (or, if you want, to split training in a few different sessions), and the -N to decide after how many iterations the training must end. But let’s have a look at the two other parameters, that will have an important effect on the training results (and training time); -r is the learning rate. In my experience, the default value is quite good, but I would be happy to have feedback on it; -S is the number of state units of the model. Increasing it to 200 or even 400 did wonder: my models learned faster and/or achieved lower error rates. On the other hand, training was more computing-intensive, and it considerably augmented the time/computing power needed for each iteration. I will develop this point in my next post, about CLSTM.

The other question is how long a training is needed to get good results. According to Uwe Springmann and David Kaumanns experiments on incunabula, they got best results with 30,000 to 200,000 iterations. Expressed as a number of epochs (an epoch is a number of iterations equal to the number of training lines, so with 400 training (ground truth) lines, it will be 400 iterations), they advise a number of 100 epochs12. As for me, best results were obtained for each model:

For now, it seems that the raw number of iterations is a better indicator that the number of epochs, and 30 000 iterations is a good start, but it remains largely to be explored.

If you have such feedback to share, please do not hesitate to put it in a comment or send it to me. If I get enough of it, I will try to statistically estimate better configurations and make a post about it.

Evaluating the results

Once you have trained and produced a few models, the next step is to comparatively assess their performance and error rate. To do that, you can use the following bash code:
for i in *.pyrnn.gz; do
echo "$i" >> modeltest
ocropus-rpred -m $i test/*/*.bin.png
ocropus-errs test/*/*.gt.txt 2>>modeltest
done

This way, you will get a modeltest file, with the error rate of all models. For each model, you will have something like this:
bodmer-00054000.pyrnn.gz
errors 615
missing 0
total 6340
err 9.700 %
errnomiss 9.700 %

that is, the raw number of errors, of missing characters, total number of characters, and percentage of error.

On manuscript Bodmer 68, I started a first training, with 1722 lines of my transcription of the Chanson d’Otinel, and the error followed the evolution presented in fig. 6.

Fig. 6: first training on manuscript Bodmer 68

As you can see, the model soon arrived at around 20% of error; at iteration 46 000, it was at 16,32% of error, and never decreased afterwards.

I wasn’t quite satisfied with these results, that were not so usable. So, I went through a phase of checking the correctness of training data and its alignment with segmented lines, editing chars.py to include all the characters I needed (such as long ſ), etc. And I trained again: you can see how effective the increase of quality of input data was on this new training (fig. 7).

Fig. 7: second training on manuscript Bodmer 68

This time, the model got to 9.7% error, at iteration 54 000, and error never decreased afterwards. You can also observe some punctual spikes in error at iteration 23 000 and 57 000.

To understand the source of the errors, and eventually do some post-treatment to lower error rate, you can have a look at the confusions of characters, using the command:

$ ocropus-econf test/*/*.gt.txt

This will give you the most frequent confusions, in a form similar to:

32 _
29 _
21 _
13 _ ı
10 _
8 z _
7 n m
6 _
6 _ t
6 _ u

As you can see, in my case, most errors were related to whitespace, or confusions between whitespace and characters or abbreviations, or, also, in the succession of strokes constituting n, m or ı. If you want to see as well if the errors are dependant of a certain context, you can try the -C option:

$ ocropus-econf -C2 test/*/*.gt.txt

In my case, the most frequent confusions with context were more explicit:

As you can see, most errors were related to abbreviations (superscript a, tildes, etc.) or succession of minim strokes (ı, n, m, u and their various combinations) that are also hard for humans when they cannot perceive the full word.

On the other hand, the results obtained were quite usable, now. To give you an idea, here is a sample of the recognition, compared with the binarised folio of the manuscript (fig. 8). I underline errors, and indicate missing characters with ø.

In some cases, it might be useful to do some automatic post-treatment, to correct the most frequent errors, for instance with a database of known/existing words. In any case, once you are satisfied with your model, it is time to OCR the rest of the manuscript.

Acquiring the rest of the text and extracting it

To apply recognition to the full document, you can do:

$ ocropus-rpred -m myModel.pyrnn.gz book/*/*.bin.png

This will predict the text of the full book folder. You can then extract it in several formats. You can get it as HOCR, using:

$ ocropus-hocr book/*.bin.png

The HOCR format is interesting, as it gives you, for each text line, its alignment with the original document (that you can transform, with XSLT, to TEI facsimile elements, if you so wish):

The only problem with this format is with extracting manually corrected lines, that I fear may be omitted in the export procedure. You can also extract it with:

$ ocropus-gtedit text book/*/*.bin.png

This will give you a correct.txt file, with the text, and a separate reference.html file, with the indexed line images.

If you want to correct the results from the prediction, before extracting it, you can do instead:

$ ocropus-gtedit html -H30 book/*/*.bin.png

It will give you a correction.html file, that you can again edit and correct. Here, we have to be cautious, because, when extracting text, the predicted (.txt) lines are used by default, instead of the ground truth ones (.gt.txt). So, once we have corrected, we will need to remove all existing text files, extract, and rename the files from .gt.txt to .txt (yes, it is a bit of a bother):

And now you have your text ! You can convert it to TEI or some other format, and start editing.

From one manuscript to the next: trying to reuse a model on another manuscript

So, from the beginning of this post, I have talked of training a model for a specific manuscript. One question remains: can this model be applied to another similar manuscript, or retrained on an other manuscript ?

I tried to apply the model created for Bodmer 68 (last third of the XIIIth century, Gothic Textualis libraria, anglo-norman) to another, quite different, manuscript I wanted to OCR, the Digby 23 (first half of the XIIth century, Praegothica, anglo-norman as well). I will talk more about the model developed for this manuscript with CLSTM in my next post, but the outline is that, without retraining, I had an error of 33% (instead of 9.7%). After a few epochs of retraining (with the --load option), it went down to 15%, but never below. In the same time, parts of the model started “exploding”, and lowering learning rate13 did not solve the problem:

It soon turned out a lot more effective to retrain a new model from scratch. Actually, every time, in my experience, retraining an existing model, be it for print or manuscript, though it might be faster, was less effective in the end.

That’s it for now ! Do not hesitate to leave observations, remarks or feedback in the comments.

Galerie

Recently, I have started to rewrite some of the Perl scripts I used for stemmatological experiments (a few of them briefly commented on this site, the others in a to be published article) in R. Even if R is a … Continuer la lecture →

Galerie

Sources for the article, including the datasets used and functions that can reproduce the results, are available as part of the stemmatology package. It can be found on Github: https://github.com/Jean-Baptiste-Camps/stemmatology Be aware that this is not the original prototype, used … Continuer la lecture →

Galerie

We inaugurate here a new category, dedicated to sources and research documents. Mostly, we will publish in this category scripts, databases and documents related to articles we have published. The main purpose is, by making our sources available, to allow … Continuer la lecture →

Digital Philology is a new peer-reviewed journal devoted to the study of medieval vernacular texts and cultures. Founded by Stephen G. Nichols and Nadia R. Altschul, the journal aims to foster scholarship that crosses disciplines upsetting traditional fields of study, national boundaries, and periodizations. Digital Philology also encourages both applied and theoretical research that engages with the digital humanities and shows why and how digital resources require new questions, new approaches, and yield radical results.

Digital Philology will have two issues per year, published by the Johns Hopkins University Press. One of the issues will be open to all submissions, while the other one will be guest-edited and revolve around a thematic axis.

Contributions may take the form of a scholarly essay or focus on the study of a particular manuscript. Articles must be written in English, follow the 3rd edition (2008) of the MLA style manual, and be between 5,000 and 9,000 words in length, including footnotes and list of works cited. Quotations in the main text in languages other than English should appear along with their English translation.

Digital Philology welcomes submissions for the 2012 and 2013 open issues. Inquiries and submissions (as a Word document attachment) should be sent to dph|a|jhu.edu, addressed to the Editor (Albert Lloret) and Managing Editor (Jeanette Patterson). Digital Philology will also publish reviews of books and digital projects. Correspondence regarding digital projects and publications for review may be addressed to Timothy Stinson at tlstinson|a|gmail.com.

Editorial Board

Tracy Adams (Auckland University)

Benjamin Albritton (Stanford University)

Nadia R. Altschul (Johns Hopkins University)

R. Howard Bloch (Yale University)

Kevin Brownlee (University of Pennsylvania)

Jacqueline Cerquiglini-Toulet (Université Paris Sorbonne – Paris IV)

Suzanne Conklin Akbari (University of Toronto)

Lucie Dolezalova (Charles University, Prague)

Alexandra Gillespie (University of Toronto)

Jeffrey Hamburger (Harvard University)

Daniel Heller-Roazen (Princeton University)

Sharon Kinoshita (University of California, Santa Cruz)

Joachim Küpper (Freie University of Berlin)

Deborah McGrady (University of Virginia)

Christine McWebb (University of Waterloo)

Stephen G. Nichols (Johns Hopkins University)

Timothy Stinson (North Carolina State University)

Lori Walters (Florida State University)

Digital Philology: A Journal of Medieval Cultures

Call for Submissions

Digital Philology is a new peer-reviewed journal devoted to the study of medieval vernacular texts and cultures. Founded by Stephen G. Nichols and Nadia R. Altschul, the journal aims to foster scholarship that crosses disciplines upsetting traditional fields of study, national boundaries, and periodizations. Digital Philology also encourages both applied and theoretical research that engages with the digital humanities and shows why and how digital resources require new questions, new approaches, and yield radical results.

Digital Philology will have two issues per year, published by the Johns Hopkins University Press. One of the issues will be open to all submissions, while the other one will be guest-edited and revolve around a thematic axis.

Contributions may take the form of a scholarly essay or focus on the study of a particular manuscript. Articles must be written in English, follow the 3rd edition (2008) of the MLA style manual, and be between 5,000 and 9,000 words in length, including footnotes and list of works cited. Quotations in the main text in languages other than English should appear along with their English translation.

Digital Philology welcomes submissions for the 2012 and 2013 open issues. Inquiries and submissions (as a Word document attachment) should be sent to dph@jhu.edu, addressed to the Editor (Albert Lloret) and Managing Editor (Jeanette Patterson). Digital Philology will also publish reviews of books and digital projects. Correspondence regarding digital projects and publications for review may be addressed to Timothy Stinson at tlstinson@gmail.com.