Accessing the full text of the Trésor des Chartes’ registers! (beta version)

HIMANIS project

Looking for the word “abbatisse” in plain text of the Tresor des Chartes’ registers

The HIMANIS project is funded by the Joint Programming Initiative on Cultural Heritage and Global Change” (JPI-CH) of the European Union. The partners are developing cost-effective solutions for querying large sets of handwritten document images. It gathers Computer Science, Humanities and Cultural Heritage institutions in order to produce technology to generate new, research-based knowledge from historical manuscripts. As a challenging and particularly interesting case study, the large collection of the Trésor des Chartes’ registers produced by the French royal chancery (Paris, Archives Nationales, JJ35 – JJ211). The expected outcomes of this project are:

a new indexing/searching technology for historical manuscripts

a new paradigm to study our historical heritage, as conveyed by manuscripts, by using full text search technology.

a new vision of the raise of nation states in Europe via a new study of the corpus under this paradigm.

You can search the plain text in the Trésor des Chartes’ registers and provide feedback!

How to search?

Query settings

Home page of the HIMANIS indexing and search engine

A query box is provided to type the queries

The search engine is case insensitive (“Paris” and “paris” retrieve the same results). The search engine allows for “AND”, “OR” and “EXPRESSION” queries (see below).

Abbreviations are not queried as they appear in the images; instead the corresponding expanded forms are used. For example, use “paris” to find “Paris”, “par.”, “par^s”, “Pars.”, etc.

Query text must be diacritcs-folded (e.g., “sebastien”, rather than “Sébastien”).

A confidence box and slider are provided to express a confidence threshold which determines the desired precision-recall tradeoff. The default is 50%. A higher value results in higher precision (little or none wrong spots) and lower recall (some, or many existing instances of the query may not be retrieved); a lower value results in lower precision (some, or many wrong spots) and higher recall (all or most the existing instances can be retrieved).

A ‘Max. results’ box is provided to express the maximum number of spots to be retrived and shown. For example, if this number is set to 20 and the confidence is set to 40%, the system shows at most 20 spots with confidence higher than 40%; and if no (or 0) confidence is set, the system retrieves at most 20 spots with the highest confidence.

Query settings

Search: AND, OR, SEQUENCE

The search engine allows compound queries (boolean AND, OR, NOT) and “sequences”. For compound queries, the results are computed at Page level, and the number of matches is the total number of word occurrences matching the query.

Relations are interpreted at the full page-image level.

The operators are: AND: “&&” (or just blank space), OR: “||”, NOT: “-” (before the words which have to be negated). Parenthesis “(“, “)”, can be used for grouping

For “Sequence”, the number of match is computed at Page level and one match corresponds to the complete expression.

A query can be a sequence of words, expressed using the symbols “[” and “]”. Sequences are not exact strings; they allow (a few) extra words to appear among the stated words. Examples: “[ludovico francorum]” (where “francorum” is an expanded abbreviastion) would retrieve pages with strings such as “ludovico franc.”, “Ludovico rege franc.”, “ludovico dei gra. francorum”, etc. In the current system version, sequences can not be mixed with boolean operators.

Single words

You can search for single words (click on each word to see the results).

Navigation through the results

The current presentation of results is bound by the hierarchy of the collection and we know that it is not always easy to flip through the results. We ask for your patience and your help!

We ask for your patience and your help!

It is possible to:

browse all registers

browse all pages in one register

see the list of registers with the number of matching pages (not the number of matches in the volume)

see the list of pages with the number of matches

Queries are interpreted, and results are shown, hierarchically. For example, assume a collection is composed of many Volumes and each Volume of many Pages. Then, if the query is issued at the top collection level, the system first shows the Volumes where the query may appear with confidence above the threshold specified. Then if the same query is issued for a specific Volume, the system shows all the Pages where the query may appear with a confidence above the threshold. Finally if it is issued for a specific Page, the system shows locations where the words involved in the query may appear. In the PRHLT interface, default levels of the hierarchy are called: “Home”, “Collection”, “Chapter” and “Page”.

The navigation from one page goes either to the previous or the next one in the same volume, or back to a higher level (volume or corpus). There is no possibility to jump directly to the next match.

To go back to the superior level, please navigate with the breadcrumb.

Number of volumes with at least one match (and average confidence)List of volumes with at least one match and number of pages with at least one matchList of pages in the volume and number of matches on the pageMatches on the page

Miniatures. At each level of the hierarchy (except at the lowest one, the page level), the system shows the elements of this level by means of miniature images, with some information associated; namely,

an identifier of the element,

a bar representing the confidence of the system that the query appears somewhere in this element

the number of elements in the lower level where the query may appear with a confidence above the threshold specified.

By hovering the mouse over the confidence bar, the actual confidence value is shown.

Pan and Zoom: When a (part of a) page image is shown, it can be explored by moving the mouse while holding its right button. In addition, the mouse wheel can be used to zoom in and out.

If a word appears more than once in a text line, only the instance with greatest confidence is shown.

The approximate locations of spotted queries in each page are marked with rectangles (called “bounding boxes”) surrounding the corresponding words. The color of a rectangle expresses the degree of confidence the system has in the corresponding spot. Exact confidence values can be seen by hovering the mouse over the rectangles.

Your feedback is needed!

YOU CAN GIVE YOUR FEEDBACK

On the spotted words

When you see the results of a query a “word level”, you can click on the word and you are asked if the word is correctly spotted. Please help us and say “correct” or “incorrect”.

Giving feedback on spotted words

If you see a missed word

In the example below, the word “pedagiis” has been correctly spotted on the lower right part of the image.

Correct and missed matches on one page

If you double-click on the image, you can add the word to our index!

Inserting the correct transcription for a word

Hints for the Chancery collection

Click on the Chancery miniature with no query to see the 147 Volumes indexed, each showing the number of images it contains (ranging from less then 100 to more than 1000).

Queries can be formulated at any level: full Collection (Chancery), specific Chapter (Volume JJ…), or individual Page.

Remember to use the Confidence and/or Max.results levers depending on the precision-recall tradeoff you wish to achieve.

The number of the “page” indicated is the number of the file.

There are 147 volumes; 32 additionnal volumes are being digitized by the French National Archive

In the near future…

This is a presentation of the Beta Version. In a near future, you will be able to scroll through the results and provide feedback directly on a hit list of the results.

You will be able to get the correct foliation of each image and retrieve metadata from the inventories and text editions.

IRHT’s interface proposal to access the Chancery corpus: aligning and merging early modern inventories (tables of contents), 20th c. inventories and indexes, partial editions, and the result of automated indexing performed in the HIMANIS project

This is a fantastic tool. I would have liked to know if it is possible to consult whole pages without using the search engine. I tried to browse JJ 171, for instance, but I did not seem to find a way to zoom on the pages.

HIstorical MANuscript Indexing for user-controlled Search

Visit our new HIMANIS websites

The blog HIMANIS is devoted to image analysis and Handwritten text recognition for medievalists. Basing on a research program conducted on the registers of the French Royal Chancery in the 14th and 15th centuries, it covers several fields from the Humanities (paleography, diplomatics, history of the governments) and Computer Science. The research is funded by the European Commission through "Heritage Plus - JPI on Cultural Heritage and global change".