IPM Text Encoding Part 2: Indexes of people

In Part 1, I gave an overview of the process we are using to automate the structural and semantic XML markup of the IPM calendars. In this post, let's have a look at how we are dealing with the metadata about Person entities mentioned in the calendar entries.

Initially, we attempted to identify family relations directly from the narrative of the inquisition, a typical example being

the estate passed to John son of Richard heir of Robert son of Eleanor

which was fairly straightforward to process. However, the general case proved to be too complex, for example

Richard, late earl of Arundel, father of Richard his father, and his male heirs by Eleanor daughter of Henry of Lancaster, senior, late earl of Lancaster

And this would only have worked on a calendar-by-calendar basis: cross-document coreference of person names is hard, particularly without any training data available!

Instead, we are making use of the hard work done by the volume indexer in disambiguating, de-duplicating and rationalising (where possible) the various references to people and their descendants in each inquisition. Both the structure of each index entry and the levels of indentation used are employed to identify the nature of the relationship between one index entry and another.

Text in parenthesis indicates either an alternative spelling of a surname when it follows a surname, or a person's birthname when it follows a given name

Indentation indicates that the person shares the same surname as the most recent person entry above the current entry that also has a lower level of indentation than the current person

Family relations ‘son of', ‘wife of' etc indicate a relation with the most recent person entry above the current entry that also has a lower level of indentation than the current person, unless the family relation is immediately followed by another person (e.g. ‘Isabel wife of' vs ‘Isabel wife of Richard')

Where the relation is inverted (e.g. ‘his wife'), the relationship is to the most recently mentioned person in the same entry.

Clearly, the index is much more complex than this, and the levels of indentation have to be kept track of (as they snake in and out!) but this gives you an idea of how this can work.

Here's what a portion of the processed index looks like, as visualised in a custom index-processing pipeline that we developed in GATE:

Index entry with automated entity and relational data identified

The metadata generated for the second 'William, knight' entry is shown in the pop-up. The entry has inherited the surname and surname variant spelling information from the parent 'Lord Bourchier' entry. The values of the 'has_daughter' and 'has_son' fields contain, respectively, the internal identifiers of the 'Eleanor', 'Henry', 'John', 'Thomas', 'William' entries that follow. Similarly, these entries will have 'daughter_of' and 'son_of' pointers back to the internal identifier for 'William, knight'.

From this representation, an intermediate XML file can be exported from which a topic map can be created: