Over the last few weeks we’ve been testing our initial cut at an EAD-to-RDF transform against a range of data that extends beyond EAD documents prepared using the Hub data entry template to documents created using other tools – and varying somewhat in terms of the markup conventions used.

In the course of that, I’ve been pondering some of the choices we made in the model I described here and here, and we decided to make a couple of changes (one very minor and the second still relatively so, I think):

Archival Resource: We’ve changed the name of the class we were calling “Unit of Description” to “Archival Resource”. I think “Unit of Description” was problematic for two reasons. First, it was ambiguous, because it could be interpreted either as the unit (of archival material) being described (which is what was intended) or as a unit/part of the archival description (which is not what was intended). Second, I adopted it from the ISAD(G) standard, where the context is one in which the archival resources are considered to be the primary things being described. I’m less sure the label works in the “linked data” context where we’re providing statements, and sets of statements (descriptions), “about” not just the archival materials, but many other things. In this context, everything that is described (people, concepts, places, etc) might be seen, in some sense, a “unit of description”, and so using that label for one subset of them seems inappropriate. That left us with finding a suitable alternative, a generic term that covers archival material in general, at any level of description (fonds, collection, item etc), and “archival resource” seemed like a reasonable fit.

Origination as Concept: When I first sketched out the model, I raised some questions, including (as “question 3” in that post) whether it was useful/necessary to model the origination of the archival resource as a pair of concept and agent, following the pattern used for the <controlaccess> terms. Having experimented with that approach, we’ve decided it introduces unnecessary complexity and we’ve fallen back on treating <origination> as a simple relation between archival resource and agent. The use of concept and agent is retained for the <controlaccess> case, where names are typically drawn from an “authority file”, as it allows us to maintain the distinction between a conceptualisation of the agent (as reflected by the authority record/entry) and the agent itself (a distinction which is also made in the model underpinning datasets such as VIAF, which we will be making links to).

The revised model is summarised in the following diagram (an amended version of Figure 3 from the earlier post):

Amended data model for EAD

i.e. an Archival Resource and a Biographical History are now related directly to an Agent.

Below is a draft list of human-readable definitions for the classes in the model. Some are simply references to classes provided by existing vocabularies like Dublin Core, FOAF, event vocabularies:

A narrative or chronology that places the archival materials in context by providing information about their creator(s). A finding aid may contain several such narratives or chronologies pertaining to different archival materials and their creators.
Subclass of: bibo:DocumentPart, (bibo:Document), foaf:Document

Recorded information in any form or medium, created or received and maintained, by an organization or person(s) in the transaction of business or the conduct of affairs, and maintained for its long-term research value. An archival resource may be an individual item, such as a letter or photograph, or (more commonly) some aggregation of such items managed and described as a unit.

Level

An indicator of the part of an archival collection constituted by an archival resource, whether it is the whole collection or a sub-section of it.
Subclass of: skos:Concept

Having had a little more time to experiment with the Archives Hub EAD data, and to think about what sort of operations on the RDF data we might wish to perform or enable others to perform, I’ve introduced a few small extensions to the model I described a couple a few weeks ago.

Extents

At our last project meeting, we talked about some of the possibilities for visualisations of the data. One of the ideas (suggested by Jane) is to explore representing relative sizes of collections, perhaps on a map, so that, for example, a researcher could provide a geographic location and a subject area and get a visual representation of the relative sizes of collections within that area.

The EAD XML format provides an element called <extent> for “information about the quantity of the materials being described or an expression of the physical space they occupy”. Although the EAD Tag Library provides guidelines to try to encourage some uniformity of the content, the data in the Hub EAD documents is quite variable. Examples of the content in the samples I’ve looked at include:

In the initial model, this was just treated in RDF as a single triple with subject the URI of the unit of description (an archival collection or some part of it) and this string a literal object. I’m suggesting changing this to treat the “extent” as a resource with its own URI, rather than simply as a literal. Doing that enables us – for at least some of these cases – to make explicit that it is a value measured in some “unit” (linear metres, archival boxes), to “normalise” the way those units are represented (so e.g. “linear metres”, “metres” and “m” can be mapped to a single form in the RDF data), and possibly to make comparisons, albeit approximate ones, between extents measured in different units (for example, “archival boxes” and “linear metres”).

So we end up with patterns in the RDF graph like:

unit:123 dcterms:extent extent:123 .

extent:123 ex:metres “2.04”^^xsd:decimal .

Having said that, I recognise that the nature of the input data is such that such techniques are usefully applicable only to a subset of the data; I’m not sure there’s a great deal we can do with “composite” strings like the last one in the list above, other than present them to a human reader.

Events and Times

One of the other ideas for presenting data we’ve chewed around is that of some sort of “timeline” view. It’s something I’ve been quite keen to explore – though I’m conscious that the much of the most useful information is, in the EAD documents, in the form only of prose in the “biographical/administrative histories” provided for the originators of the archives.

As a first tentative step in this direction, I’ve introduced a notion of “event” into the model, where, in the first instance:

the Creation of a unit of description is modelled as an event taking place during a period of time

(where birth/death dates are provided in the input) the Birth and Death of a person are modelled as events taking place during a period of time

It’s possible to generate this just from simple processing of the input data. It may be possible to go further and generate a richer range of “events” through the use of some flavour of intelligent text analysis/”entity extraction” tools on the biographical/administrative history text, but that’s something for us to consider in the future.

Postcodes

Finally – and as I noted in the previous post this is something which goes beyond the content of the EAD documents themselves – prompted mainly by the recent announcement by John Goodwin that the Ordnance Survey had extended their linked data dataset to include “post code units”, I’ve added in a notion of “Postcode Unit” so that we can make links to resources from that dataset (and also to the UK Postcodes dataset).

So the revised model looks something like the figure below:

Figure 1

So, I’m hoping that – bug fixes aside – I can stop tinkering with this for a while 🙂 and that we can work with this version of the model, and test out what is possible and where any “pain points” are, and then think about where further changes might be useful.

With the Archives Hub data well under way, it was time to start looking at the Copac data. The first decision to be made was which version of Copac data to use – consolidated or unconsolidated. As part of the process of adding records to Copac they are de-duplicated, allowing different institutions’ records for the same item to be presented as one record, instead of several. For more info on Copac de-duplication, see this blog post.

So our first question was: deal with the individual records from each library, or with the consolidated records created for Copac? This made us think about the nature of what we were describing. The unconsolidated records (generally!) relate to the actual, physical ‘thing’ – what in FRBR would be the ‘item’.

The consolidated records are closer to (but by no means a perfect example of) the FRBR manifestation. That is to say, they are describing different physical instances of the same theoretical work; in linked data terms, they are ‘same as’. They aren’t perfect manifestation level records, as there may be other records on Copac for the same manifestation which haven’t been consolidated due to cataloguing differences. At Copac, we err on the side of caution, and would rather have this happen, than have records which aren’t the same consolidated into the same record.

So we could do our mapping and our transformations at unconsolidated level, and then use ‘same as’ to link together the descriptions that would later be consolidated in Copac. But as we’re accepting Copac’s judgement that they are describing the same set of items, why not save ourselves that trouble, and work from the consolidated description? We can then hang the individual bibliographic records off this central unit of description.

This means that all of the information provided by the different libraries is related to the same unit of description. The bibliographic records that go together to make up a consolidated Copac record may not contain all of the same information, but they won’t contain any contradictory information. Thus two records which are the same in all details except date of publication (say 1983 in one, 1984 in the other) will not consolidate, but records which are the same in all details except that one contains a subject where the other does not, will consolidate.

In fact, subjects are one of the things (along with notes) that don’t affect consolidation at all. We will combine all of the subjects that come in individual descriptions, so that a consolidated record might end up with the subjects:

Management

Management — theory

Management (theoretical)

Business & management

We will leave these in the linked data description for the same reason they are left in the Copac description – while such similar terms may seem superfluous, they actually increase discoverability, by providing multiple access points. They will link into the central ‘unit of description’, rather than the individual bibliographic records.

Once we’d decided on this central unit of description (name TBD, but likely to be ‘Copac record’ or something boringly similar), other aspects of the description started to fall into place. Some of these were straightforward – publication date, for instance, is fairly obviously a literal – while others took more thought and discussion.

Among the more complicated issues was that of creator. We are working with MODS data, which has come from MARC data, and MARC allows you to have only one ‘creator’. This creator sits in the 100s as the main access point, and all other contributors (including co-authors!) are relegated to the 700s, where they become what we have decided to call ‘other person associated with this unit of description’. Not very snappy, but hopefully fairly accurate. In theory, the role that this person has in the creation of the item should be reflected in a MARC indicator, but in practise this is not often included in descriptions. Where it is indicated that a person (or a corporate body) is an editor, contributor, translator, illustrator etc, we can build these into the modelling; where not, they will have to be satisfied with the vague title of ‘associated person’.

This will work for most situations, but it does still leave room for error. Where a person is named in the 700s with no indicator of role, it is possible that they are a person who was associated with one particular item, rather than the manifestation – a former owner or bookseller, for example. While we do want to present this information, which works as another access point, and may be of interest to users, we have the problem that this information should really be associated with the item, not our quasi-manifestation. This information only concerns one specific physical item, as described in one of the individual bibliographic records. Should it really have a link to our central unit of description? If not, where do we link it to? Our entries for individual bib records describe only the records themselves, not a physical real-world item. It’s an interesting point, and one we’ll be dicussing more as the project goes on.

We’re continuing to work with Copac data, and will discuss other issues here as they arise.

As mentioned by Jane in a couple of previousposts, she, Bethan and I met up in Manchester in August to share our thoughts about how to model the Archives Hub EAD data in a form that can be represented in RDF.

RDF in a nutshell

For the purposes of this discussion, the main point to bear in mind is that the “grammatical principle” underpinning RDF is one of making simple three-part statements, each of which makes an assertion of a relationship (of some particular type) between two things. So for example, in RDF I can “say” things like:

Document 123

has-title

“Arthur and George”

or

Document 123

is-authored-by

Person P

Person P

has-name

“Julian Barnes”

When considering how to represent EAD data in RDF, then, the first step is to try to take a step back from the “nitty-gritty” of the EAD XML markup, and think about the three part statements we might construct to represent the “information content” of that document. We need to think in terms, not of XML documents and elements and attributes and nesting/containment, but rather of what an EAD document is “saying” about “things in the world” (perhaps more accurately, in the “world” as conceptualised by the creator of the archival finding aid, shaped by archival description practices in general) and what sort of questions we want to answer about those “things”. What are the “things” – and here I use the term in a general sense to include concepts and abstractions as well as material objects – that an EAD document provides information about? What are the relationships between these things? What else does an EAD document say about those things?

Note: The discussion here does not cover the “document”/”description” side of the “Linked Data” picture i.e. for each “thing”, we’ll be providing a “description” of that “thing” in the form of a “document”. Metadata describing that “document” will be important in providing information about provenance and currency, for example, but that is not discussed here.

EAD as used by the Archives Hub

The EAD XML format was designed to cope with the “encoding” of a wide range of archival finding aids, including those constructed according to the (slightly different) cataloguing practices and traditions of different communities.

Further, many features of the EAD format are optional: one can construct a valid EAD document using only a fairly minimal level of markup, or one can use more detailed markup to represent more information.

This flexibility can be something of a “double-edged sword”: on the one hand, it enables data creators to accommodate a wide range of data, and it provides choice in the level of detail of markup (and human resources in creating that markup!) to be applied; on the other hand, it can make working with EAD data quite complex for a consumer, particularly when processing data from a range of sources which perhaps use a range of different conventions and features of the language.

In part to address this sort of issue (as well as to make things simpler for data providers by insulating them from the detail of EAD markup), the Archives Hub provides a forms-based EAD editor, based primarily on the information categories enumerated by the ISAD(G) archival description standard, which generates EAD documents following a consistent set of markup conventions. (I sometimes think of this as a “profile” of EAD, a narrower set of constraints than that imposed by the EAD DTD/schema itself, but I’m not sure that sort of terminology is in widespread use in this context.)

So, we made the “pragmatic” decision to work, in the first instance at least, on the basis of this particular set of EAD markup conventions, rather than trying to address the full EAD format, which means we can limit the number of variants we need to deal with. Having said that, even for the case of data created using the Hub editor, an element of variation is present, because although the data entry form generates a common high-level structure, data creators can apply different markup within those high-level structural components. In this first cut at a model, we have focused on analysing those common structural elements, with the intention of extending and refining our approach at a later stage.

In the course of this (or in thinking about it afterwards) we’ve come up with a few questions, which I’ll try to highlight in the course of the discussion below. Any feed back on these points (or indeed on any other aspect of the post!) would be very welcome.

The “world” as seen by EAD

Jane and I had both done some doodling before our meeting, and we started out by walking through our ideas, highlighting both those aspects which seemed pretty clear and uncontroversial, and aspects where we were uncertain or several alternatives seemed possible (and reasonable). Although we were using slightly different terminology, I think we had come up with quite similar notions, and after a bit of discussion, we arrived at a first cut at a “core” model which I’m representing graphically in Figure 1 below. This isn’t intended as a formal UML or E-R diagram, but each box represents a type of “thing” (a class) and each arrow represents a type of relationship between individual things (“instances” of those classes):

Figure 1

So the “core” types of things identified in this first stage were:

Unit of Description: these are the “units” of archival material, a document or set of documents, the actual stuff held in the repository and described by the finding aid. It’s a “generic” class to reflect the archival description principle of “multi-level description”. An archival finding aid typically has a “hierarchical” structure, in which one “unit of description” is (described as logically forming) “part of” another “unit of description”. A finding aid may provide a only a “collection-level” description of a collection which contains many thousands of individual records, without describing those records individually at all; or it may include descriptions of various component groupings and sub-groupings of records; or it may indeed go as far as describing individual records within such groupings. For each Unit of Description, information relevant to that particular unit is provided. EAD and ISAD(G)) allow for the provision of more or less the same set of information whatever the “level” of unit described, though in practice some elements are more commonly used for “aggregate/group” units.

Archival Finding Aid: these are the documents created by archival cataloguers to describe the archival materials. Often a single finding aid describes (or has as its topic/subject) several units of description, but it may be the case that a finding aid describes only a single unit – where only a description of the collection as a whole is provided.

Repository (Agent): the organisations who curate and provide access to the archival material, and who create and maintain the archival finding aids. (EAD allows for the possibility that two different agencies perform these two roles; the Hub EAD Editor works on the basis that a single agent is responsible for both).

Origination (Agent): the entity (individual, organisation or family) “responsible for the creation, accumulation, or assembly of the described materials before their incorporation into an archival repository” (from the description of the EAD <origination> element). Jane analysed the rather complex nature of the ISAD(G) Creator/EAD origination relationship, which encompases notions of both “item creator” and “collector”, in <a href="http://archiveshub.ac.uk/blog/?p=2401"an earlier post on the Archives Hub blog.

“Things” which are referenced in the form of names used as “access points” or “index terms” using the EAD <controlaccess> element. The Hub EAD Editor supports the provision of the following as <controlaccess> terms, and recommends the use of a number of thesauri or “authority files” from which they should be drawn: Names of “Subjects” (topics); Personal Names; Family Names; Corporate Names; Place Names; Book Titles; Names of Genres or Forms; Names of Functions. So the corresponding “things named” are: Concepts, Persons, Families, Organisations, Places, Books, Genres or Forms, and Functions. As Jane notes in her recent post the relationship between the Unit of Description and the entity named in the <controlaccess> element is not necessarily a relationship of “about”-ness, but a rather less specific one, which for the moment we’ve labelled as simply “associated with” (though a better label might be preferable!).

(I’ve shown the Origination and Repository as distinct classes in the diagram, rather than as a single Agent class, because, as I hope will become clearer below, it ends up that they participate in a slightly different set of relationships).

We went on to extend and refine this core model to accommodate more of the information from the EAD document.

First, we refined the way the “access points” are represented. I’d discussed this aspect of the model with Leigh Dodds of Talis and he suggested that we consider modelling the physical entities here as concepts, in turn related to physical entities, i.e. that we represent the “conceptualisation” of a person, family, organisation or place captured in a thesaurus entry or authority file record, as distinct from the actual physical entity. So, to take an example which I think Bethan used during our conversation, we can distinguish between a conceptualisation of William Blake as a poet and one of William Blake as an artist, each in turn related to William Blake the person.

Although I don’t plan to discuss the specifics of RDF vocabulary in this post, it’s worth noting that the FOAF RDF vocabulary has recently been extended with the addition of a property, foaf:focus, to represent the relationship between the conceptualisation and the thing conceptualised (person, place etc), to support exactly this convention.

For some of the <controlaccess> named entities – like the topics, genres/forms and functions – there is no “other thing conceptualised” and it is sufficient to model them simply as concepts (or as instances of a subclass); and for the book case, we’ll just treat it as a “book” (and for the moment, at least, sidestep any FRBR-ish issues).

In both cases, the notion that the concept is a member of a specific thesarus/authority file can be captured by introducing the notion (from SKOS) of a “Concept Scheme”.

Question 1: One question raised by this approach is whether, for the cases where there is a distinct entity involved, in transforming an EAD document into RDF, we should:

Coin URIs for, and generate “descriptions” of, both the concept and the person/family/organisation/place conceptualised (with a triple with a foaf:focus predicate relating the two? Or:

Coin a URI for, and generate a “description” of only the concept, and leave the relationship with the person/family/organisation/place conceptualised “out of scope” at the transform stage (though that relationship might be obtained at a later stage by linking the concept to external data)?

My inclination is to do the former, on the grounds that this enables us to capture more of the information present in the EAD document i.e. to capture the information that where a <persname> element is used, this is the name of a conceptualisation of a person, where a <corpname> element is used, this is the name of a conceptualisation of an organisation, and so on.

Question 2: Is it necessary/useful to also model the name itself as a distinct resource? I think we can manage without that, but we may revisit that point in the future.

Second, having made this choice for the <controlaccess> entities, we decided to apply it also to the case of the “origination” agent discussed above, with the “origination” relationship becoming one between a Unit of Description and a conceptualisation of an agent, rather than between a Unit of Description and the agent itself. I admit I’m still not completely sure this is necessary/useful/”the right thing to do”. The use of the <origination> element in the Hub EAD profile is described in the guidelines here. It allows for names to be presented in “the commonly used form of name”, rather than the form specified by an authority record (and indeed a survey of the data reveals a good deal of variation), so it’s a bit more difficult to argue that this corresponds directly to the name of an entry (concept) listed in an “authority file”.

Question 3: Is it necessary/useful to introduce a “conceptualisation” of the agent who “originated” the Unit of Description? For now, we’re working on the basis that it is, but we may revisit that choice.

This extended model is represented graphically in Figure 2:

Figure 2

A final stage of refinement gave us a few further extensions.

First the EAD Document is introduced as a particular “encoding of” the Finding Aid.

Second, I’ve suggested that we model the Biographical or Administrative History associated with each Unit of Description as a resource in its own right, distinct from the Finding Aid as a whole. I’m not sure this is strictly necessary, and again it’s something that we may revist in the future. But it enables us to provide information about the Biographical History as a distinct resource. One of the reasons this may be useful is that we’ve discussed (albeit somewhat vaguely at this point!) analysing/mining the text of the Biographical History as a source of further information, and having a URI for the Biographical History enables us to be explicit about the source of that data. We can also make the Biographical History the subject of triples to indicate that it is related not just to the Unit of Description but also to the entity who “originated” that unit (or, given the discussion above, to the conceptualisation of that entity). Also, we could associate it with different literal expressions (e.g. the original EAD fragment as XML Literal, but also an XHTML or plain text derivative). It also, of course, makes the Biographical History into a resource that others can refer to in their own assertions in their own data.</p

Third, we introduced the “level” of the Unit of Description as a distinct resource, a concept. This means that each “level” within the (relatively small) set used within the Hub data can each be assigned a distinct URI, and described in their own right, and – again – referenced by others.

Fourth, similarly, the “language” of the Unit of Description is treated as a distinct resource. (The plan here is that we’ll try to simply reference resources within an existing Linked Data dataset, such as lexvo.orga>.)

Fifth, the EAD <dao> and <daogrp< elements are mapped into a relationship between the Unit of Description and an external digital object (or group of objects). I’ve labelled the relationship here as “is represented by” as that is the description provided by the EAD documentation, but I think Jane and Bethan felt that in practice in the Hub data, the relationship might sometimes be rather less specific than that.

For the moment, the other EAD elements corresponding to ISAD(G) elements (i.e. to textboxes in the Hub data entry form) will be treated as properties with XML Literal values (though we could follow the <bioghist> approach and generate individual URI-identified resources if that proves to be useful).

Sixth – and here we stepped slightly beyond the scope of the EAD document itself (so I’ve greyed it out in the diagram below) – we’ve added a notion of the location of the Repository and a relationship between the Repository-as-Agent and that Place. Although details of repository location aren’t included in the Hub EAD documents, Jane and Bethan said they do have that data available, and it should be fairly easy to integrate it.

So we’ve ended up with the model illustrated in Figure 3.

Figure 3

Question 4: Are we missing any obvious “things” that we need to treat as resources?

Note: In this post, I haven’t gone as far here as to enumerate all the properties that will be used to describe instances of each of those classes, but I’ll provide that in a future post.

Multi-level description, context, “completeness” and “inheritance”

The one remaining question – and perhaps one of the thorniest to address fully – is that arising from one of the fundamental characteristics of the nature of archival description. As noted above, archival description is typically based on a “hierarchical”, “multi-level” approach, in which, within a single finding aid, information is provided about an aggregation of records, and then about component parts of that aggregation, and so on, perhaps down to the level of providing descriptions of individual records, but often stopping short of that.

The ISAD(G) standard presents principles of moving from the general to the specific, and providing information relevant to the particular unit of description (ISAD(G) 2.2):

Provide only such information as is appropriate to the level being described. For example, do not provide detailed file content information if the unit of description is a fonds; do not provide an administrative history for an entire department if the creator of a unit of description is a division or a branch.

And of “non repetition” (ISAD(G) 2.4):

At the highest appropriate level, give information that is common to the component parts. Do not repeat information at a lower level of description that has already been given at a higher level.

In some cases, it may indeed be the case that if some descriptive attribute is not explicitly provided for the unit of description, then the information provided for its “parent” unit in the hierarchy is applicable; however, this is often not the case. The elements of the ISAD(G) Identity Statement Area (or the EAD <did> child elements), for example, are specific to the unit of description and do not apply to its “child” units; and for many other descriptive elements, a simple rule of “direct inheritance” may not be appropriate. For the <controlaccess> elements, for example, a “blunt” inference rule that the named entities “associated with” a unit of description are also “associated with” every “child” unit (and so on “down the tree”) may result in associations that are simply not useful to the consumer of the data.

In a post on the Archives Hub blog, Jane emphasised the value of the “Linked Data” approach in making things mentioned in our data into “first-class citizens”. One consequence of the multi-level approach in archival description practice is a strong sense of the importance of “context”, and that the descriptions of the “lower level” units should be read and interpreted in the context of the higher levels of description (perhaps even that they are in some sense “incomplete” without that “contextual” data). In contrast, the “Linked Data” approach typically involves exposing “bounded descriptions” of individual resources. Now, certainly, yes, those “bounded descriptions” include assertions of relationships with other resources (including the sort of part-whole/member-of/component-of relationships present here), and those links can be followed by consumers to obtain further information on the other resources – however, there is no requirement or expectation that consumers will do so. So, there is arguably a (perhaps unavoidable) element of tension between the strongly “contextual” emphasis of EAD and ISAD(G) and the “bounded descriptions” of “Linked Data”. Rather than seeing that as an insurmountable hurdle, however, I think it provides an area that the project can usefully explore and evaluate.

(If I remember correctly) we made the decision that, for now at least, the only piece of information for which we would implement an explicit “inheritance” from a “higher-level” Unit of Description to a “lower” one (and generate additional RDF triples in the data) would be that of the repository which provides access to the material (i.e. the EAD <repository> element).

Conclusion

As I said above, the model I’ve outlined here is intended as very much a first cut, not the “last word”, and something we’ll most likely revisit and refine further in the future, particularly as we see in practice what it enables us (and others) to do with the data generated, and where we might require some further tweaks to enable us to do more. For now, we feel it provides a basis for our initial work on transforming EAD data into RDF.

The next steps are:

to decide on URI patterns for the URIs we will be generating (i.e. URIs for instances of the classes in the diagram above)

to select terms from existing RDF vocabularies and to define any additional RDF terms required to create “descriptions” of these things based on information from the EAD document

to create a transformation that implements the model (in the first instance, an XSLT transform)

I’ve already done some work on all of these, and I’ll write about them in a separate post here – which hopefully will be rather shorter than this one and will take me rather less time to write!