Abstract:

This paper presents the work carried out within the EU-funded Collaborative EuropeaN Digital Archival Research Infrastructure (or CENDARI) project to provide an appropriate customisation of the Encoded Archival Guide (EAG) that would fulfil the expectations of researchers in contemporary and medieval history with regard to describing where they could find collections and documents of specific interests. After describing the general data landscape that we have to deal with within the CENDARI project, we specifically address the data entry and acquisition scenario to identify how this affects the actual data structures, which are to be handled. We then present how we implemented such constraints by means of a full TEI/ODD specification of EAG and point out the main changes we made, which we think could also contribute to the further evolution of the EAG setting at large. We end up providing a wider picture of what we think could be the future of archival formats (EAG, EAD, EAC-CPF) if we want them to be more coherent and more sustainable at the service of both archives and researchers.

Historians at work

The CENDARI[2] project is a research collaboration aiming at integrating digital archives and resources for research on medieval and modern European history. The project brings information and computer scientists together with leading historians and existing historical research infrastructures (archives, libraries and other digital projects) to improve the conditions for historical scholarship in Europe through active reflection of and considered response to the impact of the digital age on scholarly and archival practise.

In this context, we have accompanied a group of historians[3] specialised on the First World War (WWI) in their information gathering activity with the objective to both identify the optimal set-up for their data input and specify the target formats that should be used within the project. The actual content was planned to cover several levels of descriptions of archival content:

General information about archives that maintain relevant material for WWI research;

Specific descriptions of relevant collections within such archives, without any constraints on coverage (strong sampling);

Description and possibly transcription of specific items within these collections when the content may be of utmost importance for the corresponding research.

The ultimate goal in this primary data gathering activity is the production of archival research guides, which combine the assessment by researchers of various archival contents as to their relevance for a specified research question. Such guides are comprehensive documents that can be made available to the scholarly community at large.

A complex data space

The main challenge in the CENDARI project is the fact that the data space, which it encompasses, is extremely complex. Indeed we have to face a heterogeneous data world both in terms of input and output.

Data input takes place in multiple forms: some data are entered manually, either directly using an Extensible Markup Language (XML) editor or through editing environments such as ICA-AtoM[4]. Data sources can also be provided by partner archives on the basis of finding aids, which in turn, depending on the digital stage level of the institution, can be available as native XML or PDF documents, but can as well only be available in print format.

Data output is also made complex because of the multiple use and publication scenarios (for example entries need to be readable both in HTML on the server and in the ICA-AtoM environment). The various disseminators – The European Library (TEL) is part of the CENDARI project – or partners – Archives Portal Europe network of excellence (APEx) – also bring in their own constraints on the possible re-use of the CENDARI data.

When it comes to the formats, which CENDARI has to handle (or connect to), the situation is equally complex. The partners of the project are ranging from archives to museums, libraries and other research projects. Therefore we are dealing with multiple data formats. The main ones are undoubtedly archival formats (EAG, Encoded Archival Description (EAD), Encoded Archival Context – Corporate Bodies, Persons, and Families (EAC-CPF)) as archives are the main source providers for the historians, but others have been encountered, too: formats developed by libraries or Europeana (Machine-Readable Cataloguing (MARC), Europeana Data Model (EDM)) as well as formats developed by research communities (Text Encoding Initiative (TEI), Music Encoding Initiative (MEI)). The medieval manuscripts community represents half of the historians involved in CENDARI and has long been used to the TEI.

The formats mentioned above offer in turn a large standardisation spectrum and are nothing but unified, which makes the data space even more complex.

First the ubiquity of the existing standards has to be underlined. Though some of the standards provide one general version, they all allow for customisation and the various existing profiles, which result from that, can be compared to a large spectrum of flavours, restrictions and extensions. Each project in the field has adapted standards to fit their needs. Such customisations have so far been handled without a clear overall technical and editing strategy for their specification.

Another crucial point in this fragmented landscape is the various levels of standardisation and maintenance: those vary from strong maintenance environments managed by solid consortiums (TEI and EAD are good examples) to a much looser standardisation strategy (it has been the case for EAG).

Finally, the last challenge to tackle has to do with the transversality of some entities throughout the various standards. There are various levels of description, which standards correspond to: the TEI deals with document level information, EAD deals with collection descriptions, which may include document level information, and EAG with institution descriptions. The issue is to coherently encode transversal entities such as locations, people or dates, which are to be found at institution, collection and artefact or document levels. Though the information is the same (and should be extracted from a level to be integrated in another one), the granularity is very often different; elements related to an address are for example much more detailed in EAG and TEI (by means of <eag:street>, <eag:country> and other <tei:postCode> elements), whereas EAD and MARC have a looser way of encoding it. The difference in structuring the information may be huge, as it is the case for bibliographic fields: of the three archival formats (EAG, EAD, EAC-CPF) only EAD, as directed to describing archival material, provides the possibility of structured bibliographic information with the element <bibref> and its very limited set of sub-elements. EAC-CPF and EAG rather generally deal with resources without specifying their domain too much, while EAG only provides a <descriptiveNote> element with free text. In comparison to that, both the TEI and Metadata Object Description Schema (MODS) offer very structured (and deep) ways of encoding bibliographical information (see below in chapter EAG(Cendari): the customisation architecture).

In the following sections, we will show the strategy we have adopted to take care of this complexity in the context of the specific data input scenario we adopted in the CENDARI project.

Data input in CENDARI

The CENDARI data workflow has been elaborated by taking into account several constraints. There was a strong necessity of tracking sources and responsibility of the data input activity. In a project like CENDARI, in which dozens of people work on the entries, it is crucial to identify who made what change, when and for what reason. Maintaining versions was also considered important, as was the need of a collaborative working environment. Finally, two ways of editing were adopted: a professional XML editing environment and a more user-friendly tool to allow detailed encoding for historians who felt confident with XML without excluding the less technically experienced partners.

Following the well-known principle simple is beautiful, we chose a workflow requiring no heavy development, based on the three following components: oXygen[5]/ICA-AtoM, Subversion[6], and XTF[7].

Image 1: The CENDARI data entry workflow

Once this technical workflow was agreed upon, the data input activities (led by a group of historians) could start. The first milestone of CENDARI was the elaboration of an archival directory gathering information on all institutions that were likely to provide interesting content for CENDARI end-users, historians in medieval and modern history. After a phase during which historians hunted the so-called hidden archives and came back with a rich list of contacts and cooperating institutions, encoding and storing this information was necessary. EAG was an obvious candidate to address this task.

The EAG model - history and scope

EAG was initially an initiative of the Spanish Ministry of Culture in 2002 intended to provide a format for encoding information about institutions that hold fonds. Since then, EAG has been largely applied in the Censo Guía de Archivos de España y Iberoamérica[8], but was never taken up by any wider standardisation committee. The initial proposal has been made available in the form of a Document Type Definition (DTD) along with an EAG Tag Library (in Spanish).

In parallel to this initiative, the International Council on Archives (ICA), through its Committee of Best Practices and Standards (ICA/CBPS), released a description standard similar to the General International Standard Archival Description (ISAD(G)) in 2008: the International Standard for Describing Institutions with Archival Holdings (ISDIAH)[9], providing a precise description of all the components needed for describing holders of archives.

Interestingly, these two initiatives, which have been carried out without explicit coordination are quite aligned from a content point of view, but clearly reflect the absence of a global strategy for this archival standard. Besides, the quite outdated technological background of the initial EAG proposal made it clear that we had to go a step further.

Customising EAG

In order to fulfil the requirements of the CENDARI data input workflow, and because of the somehow immature phase of development of the EAG model, we identified the necessity for designing a customised EAG schema that would be a compromise between three main constraints as pictured in image 2.

First, we could not diverge too much from the existing standards and in particular ensure our compliance with description standards such as ISDIAH. Second, we had to take into account the request of researchers for more expressivity in order to associate reliability information or commentaries to third-party archival information. Finally, from a pure pragmatic point of view, we had to take into account the actual legacy data we had to deal with (digital information coming from any third party outside of CENDARI) as well as the practises with regard to EAG in order to ensure maximum interoperability with other projects or initiatives.

Image 2: Constraints bearing upon the EAG(Cendari) proposal

EAG(Cendari): the customisation architecture

The work on EAG(Cendari) was done on a comprehensive editing platform based on the following components:

A specification of the main EAG components implemented in the TEI ODD language;

The TEI vocabulary properly complementing missing features in EAG;

A feature tracking environment to record, evaluate and validate the various customisation proposals made either by the technical team (ie the authors of this paper) or the users (the historians).

Indeed, the TEI guidelines can be used for two different purposes. First, as the basis of an XML format, they provide the technical constraints to control the validity of TEI documents. Second, they are delivered with an extensive prose description that informs users about the logic of the guidelines as well as the most appropriate way(s) to use them to represent specific textual phenomena[10]. Still, those two aspects are not split into two separate documents but indeed integrated within one single specification, from which both views can be automatically generated. This mechanism, in line with the concept of literate programming[11], is based on the existence of an underlying specification language named ODD (One Document Does it all), which is itself expressed in TEI.

In the TEI infrastructure, each element is thus defined as an ODD specification providing all the necessary information both to control its syntactic behaviour (XML) and to generate the corresponding documentation. Such information includes a gloss, a definition, the technical description of the element's content model, the various attributes it can bear and one or several examples of its usage.

In CENDARI, we used this environment to facilitate the maintenance of the customisation as we made progress on it, with the advantage that we could generate on the fly and for each available version a complete set of schemas (W3C XML schema, RELAX NG or DTD) and documentation (in HTML, PDF or DOC(X) formats).

In addition to this, we used an instance of Jira, a project management software, kindly provided to us by the DARIAH e-Infrastructure for receiving and discussing feature requests from the historians. After an open discussion with historians, the technical team assesses if a new element or attribute should be created and the requests are implemented in a TEI/ODD document.

The context: EAG 2012

In parallel to CENDARI's work, the APEx project realised that, due to the already mentioned independent developments of EAG and ISDIAH around the years 2002 and 2008, EAG 0.2 does not comply in all parts to the recommendations given in ISDIAH. Therefore APEx started to revise EAG 0.2 as created by the Spanish Ministry of Culture to make it more compliant to ISDIAH recommendations and to bring together archival information from all over Europe. In August 2012 a new EAG, revised by a consortium of 28 project partners was published by the APEx project as EAG 2012.

Some selected new features of EAG 2012 will be presented below.

First a <location> element was introduced, wrapping information related to the physical address of an institution, in order to better structure geographical location. It allows recording different types of addresses or locations per institution (eg visitors address vs postal address). It also makes the location of an institution easier through the use of geographical coordinates.

Secondly the introduction of a wrapping <repositories> element gives the possibility of encoding the information about several repositories (eg for institutions with local branches) within a single EAG document. In this <repositories> element, multiple <repository> sub-elements are allowed. Until this new feature, institutions with head quarter and local branches (like national archives in many countries) had to record information about their branches in several EAG documents.

The third main feature newly introduced in EAG 2012 was the ability to encode information in several languages by making text elements repeatable. Similarly to what happened in EAD, when EAD3 was released, all elements that contain textual information allow the attributes @xml:lang and @xml:script to be used. Languages and scripts used in an element can therefore be encoded. This is for example the case for <descriptiveNote> in several elements such as <repositorhist> or <holdings>. Allowing multilingualism in most of the elements is a great asset for European projects like CENDARI and APEx. That enhances and favours the exchange of EAG instances at European and international levels.

Finally, general contact information has been replaced by a <contact> element recording contact information for each service of an institution to facilitate direct and separate description and contacting. Having entry points in most of the departments of an institution permits to quickly identify the proper person.

Changes introduced in EAG(Cendari)

Similarly to APEx, CENDARI realised that EAG 0.2 was missing a series of essential features for a proper description of archival institutions. We thus decided to adopt EAG 2012 when it was released and created a customisation of our own focussing on specific needs (the focus being put on researchers). In this context, depth was favoured over wide coverage. This had two important consequences: a very limited set of elements and the introduction of sourcing and referencing mechanisms.

A reduced set of elements

Compared to a general EAG 2012 document covering mandatory as well as the numerous optional elements, an EAG(Cendari) document provides as much mandatory information in the <control> and <identity> sections, but has a more limited <desc> part.

Fields relating to administrative information have been put aside to focus on fields of interest for the historians: opening hours and accessibility information have been skipped, whereas historical information and details on holdings are strongly recommended (though not mandatory).

Providing source information to EAG description

Initially the usage of EAG was based on the assumption that the archives would generate the information out of their internal system or – manually – by themselves. In any case they were seen as the sole content creator of an EAG instance. In the CENDARI case on the contrary, most information is gathered by researchers from existing (mostly printed) sources. Providing source information is thus essential to trace back the validity of the precise content of an EAG record, but also to identify the origin of such information, when the record is indeed a compound of several sources, as well as the researcher's own assessment of the archive.

As a matter of fact there is already an existing EAG <source> element to provide such a background, but only for the sake of qualifying the whole record. Instead of inventing a new mechanism for this purpose, we took up the existing @source attribute recently introduced within the TEI guidelines. This attribute points back to a bibliographical reference or includes a pointer to the website, from which the information has been taken, and indeed is pointing back to the <source> element[12] in our case.

The following TEI snippet illustrates this mechanism:

Referencing mechanisms

Along the same requirement lines, it was necessary to mark up references to Internet sources mentioned in repository descriptions. There again, the TEI guidelines offer the appropriate element (<ref>), which, by means of its @target attribute, may point to any kind of URL-defined location, as illustrated below[13]:

Bibliographical descriptions

The lack of appropriately structured bibliographical components allowing one to provide precise information about sources is an EAG 2012 weakness. We thus complemented the EAG vocabulary for CENDARI with a series of bibliographical elements from the TEI vocabulary, namely <title>, <author>, <date> and <publisher>.

For instance, a simplified reference to a published article would look as follows:

Customisation and standardisation

As shown in this paper, CENDARI uses a customisation of EAG 2012, which only slightly differs of the main schema. Both projects (APEx and CENDARI) aim to establish this new EAG as a standard in the archival domain. Still, it has become clear for both projects that going further in standardisation would only make sense when moreover the coherence between the various archival standards (EAC-CPF, EAG, EAD) could be improved in general. At the moment we cannot see a clear coordination in this respect, neither on technical levels (maintenance platform) nor on an editorial level (coherence of available features and documentation).

In order to contribute to this important debate, we consider here that although EAG(Cendari) has been developed for a specific project with precise needs, it was based upon a coherent maintenance environment that could be used to develop a future standard and even to provide a comprehensive framework for the whole group of standards. The building block of such a framework should be indeed designed in such a way that it:

Prevents incoherence and useless overlapping between the three,

Provides means for the community to report bugs and missing features (as possible in all open standards) and

Provides a strong technical environment to both keep track of the evolutions in the specification (versioning) and an adequate management of schemas and documentations.

Even if we have focussed in this paper on the work carried out on customising EAG, the CENDARI project had the opportunity to experiment with a similar approach for EAD in order to provide collection descriptions that would fit the researchers' needs in the project. There again, we could see the advantage of using an ODD customisation to both make it easy to identify the most appropriate subset for the CENDARI project and complement, when necessary, the EAD vocabulary with the efficient constructs available in the TEI framework.

After this experience we developed a global vision for the future organisation of archival standards at large, comprising EAD, EAG, but also the EAC-CPF format. The idea, pictured in image 3, is to have an integrated platform for the specification of all three standards based on a set of coherent editorial and technical principles:

A joint technical committee that shares a global vision for all three standards;

An ODD based specification for all three standards so that any shared component between the three can be maintained in a coherent way and all by-products (schemas, documentation etc) are generated automatically from one core specification;

A maintenance mechanism by which requests for changes in the three standards are systematically documented and discussed and periodic releases are issued;

A general principle (inspired from the TEI guidelines) of customisation, so that projects applying archival standards better can identify, which subset (or profile) they are using, and to also enable the use and identification of possible extensions. This should improve the comparison of existing usages with customisations being systematically documented against the reference standards;

Editorial mechanisms for the management of feature requests, versioning, releases, etc that allow any user to precisely refer to the actual version of the standard he has implemented.

Image 3: Maintenance architecture for archival standards

In the CENDARI project, we have three ODD specifications for EAC-CPF, EAD and EAG at hand today, where we systematically try to align technical mechanisms and, when appropriate, reuse the available (and well-maintained) TEI components. For the community of archivists we propose to consider the suggested workflows positively in order to offer better services to both archives and researchers.

In our opinion, data exchange between archives and their users (researchers but also Digital Humanities (DH) projects at large) as well as other institutions (museums and libraries) would be favoured. Interoperability would also benefit from such a scheme, where various needs and diverse data usage are considered. Finally, a more open framework where the three archival standards are centrally and consistently maintained could contribute to the emergence of a stronger DH user community, which in turn would be willing to get more involved in their maintenance and evolution.

CENDARI is a four-year, European Commission-funded project (under the 7th Framework Programme for Research) led by Trinity College Dublin, in partnership with fourteen institutions across eight countries, to facilitate access to archives and resources in Europe for the benefit of researchers everywhere.
↑

3.

Special thanks to Anna Bohn and Aleksandra Pawliczek from the Freie Universität Berlin team in CENDARI for their very useful contributions to this work.
↑

4.

"ICA-AtoM is web-based archival description software that is based on International Council on Archives ('ICA') standards. 'AtoM' is an acronymn for 'Access to Memory'." ICA-AtoM homepage (https://www.ica-atom.org/) (viewed 21 March 2014).
↑

Committee of Best Practices and Standards (CBPS): ICA-ISDIAH, International Standard for Describing Institutions with Archival Holdings, First edition (http://www.wien2004.ica.org/en/node/38884) (viewed 21 March 2014)
↑

10.

The TEI guidelines contain in particular a wealth of examples for each element and the major constructs they allow.
↑

About the Author

Maud Medves is research assistant at Inria, France and associate at Humboldt University in Berlin. Originally trained in political sciences and communication, she has specialised in data modeling and elaboration of metadata profiles as digital curator in several European projects (PEER, CENDARI, EPISCIENCES). Scientific and technical information as well as open access thematics are the core topics of her current research work.

Laurent Romary is senior researcher at Inria, France and guest scientist at Humboldt University in Berlin. He carries out research on the modeling of semi-structured documents, with a specific emphasis on texts and linguistic resources. He also leads or takes part in major standardisation activities at ISO or the Text Encoding Initiative. He is currently one of the directors of the European DARIAH eInfrastructure.

More articles from this author

Comments

I welcome very much your attempt to perform a general principle of customisation regarding the use of standard profiles and possible extensions!Another point: What do you think about more depth in the information about the history of the repository. I would suggest to insert a subfield for "mandates" above . I think that one of the most important information about a repository would be - from the view on EAG as a standard for archival description - information about the repository's mandates for collecting and composing collections and fonds. From this kind of information a researcher may deduce which kind of material and contents he would find in a collection holding institution. For instance: Who knows the mandates for collecting given to the International Tracing Service's archives, can deduce, which kind of provenances he would be able to find there.