Rationale of the CLaMM competition

The tasks to be evaluated in the competition based on the CLaMM corpus (CLaMM : classification of Latin Medieval Manuscripts) are related to the classification of images of Latin Scripts, from handwritten books dated 500 C.E. to 1600 C.E.

Automated analysis and classification of handwritings applied to the written production of the European Middle Ages is a new challenge and a frontier in handwriting recognition and document analysis and recognition.

Context

Digital libraries from Cultural Heritage institutions contain literally ten-thousands of digitized manuscripts of the European Middle Ages. Some examples:

The overwhelming majority of manuscripts in there are written in Latin script.

In this context, there is a need for an automated “tagging” or “cataloguing” of the handwriting on the images, not only to allow for historical research (when and how which text is written), but primarily because it is a pre-requisite for handwritten text recognition (HTR) or automated indexing and data mining. To perform HTR on the digitized manuscripts, one “numerical model” is necessary to recognize the text for each script type and the identification of the script type is the first step.

This has been stated for the modern handwriting styles [1]. The medieval millennium extending from 500 C.E. to 1600 C.E. shows that the Latin script evolved and took very different forms, much more diverse than all the writing styles of the 19th to the 21st century.

3 Importance of the CLaMM dataset

The participants of this competition will get the only available reference data-set covering the European Middle Age and tagged as regards script types and production date.

In real-life conditions and beyond the challenges of material degradations, segmentation, etc., one of the difficulties is that there is a historical continuum in the evolution of scripts so that there are mixed types and many scripts that could pertain to two or more categories. In this regard, classification of scripts addresses the subjectivity of the human mind, so that, as in art history, all attributions remain subject to debate and discussion.

Related topics and previous work

The present competition on the Classification of Medieval Handwritings in Latin Script is related but differs from:

Performing scribal identification within a homogenous corpus or within a particular manuscript.

The latter topic is the closest and has been dealt with by numerous competitions and publications [2]–[4].

As for the Classification of Medieval Handwritings in Latin Script specifically: the first attempt at automating the classification of medieval Latin scripts was made by the Graphem research project (Grapheme based Retrieval and Analysis for PalaeograpHic Expertise of medieval Manuscripts) funded by the French National Research Agency (ANR-07-MDCO-006, 2007-2011). The results are published in [5], [6].

Further research has been conducted on a theoretical level by one of the organizers and several teams in Computer Science[7]–[15]. Nevertheless none of the teams had access to the labelled data-set and the latter has not been made available anywhere.

[9] I. Siddiqi, F. Cloppet, and N. Vincent, “Contour Based Features for the Classification of Ancient Manuscripts,” presented at the 14th Conference of the International Graphonolics Society, (IGS), Dijon, 2009.

CLaMM and HDRC-IR

This site presents the CLaMM (Classification of Latin Medieval Manuscripts) corpus, which is the basis for the Competitions on the Classification of Medieval Handwritings in Latin Script, jointly organized by Computer Scientists and Humanists (paleographers) at ICFHR2016 and ICDAR2017, and hosts the new HDRC-IR Image Retrieval for Historical Handwritten Documents competition at ICDAR2019.
It provides access to a rich database of European medieval manuscripts to the community on Handwriting Analysis and Recognition.
Keywords- Historical documents; Image classification;
Feature extraction;