inf11

I spent the last couple of days in Manchester at the “end of programme” meeting for the JISCexpo programme under which LOCAH is funded. It was a pretty busy couple of days with representatives of all the projects talking about their projects and their experiences and some of the issues arising.

Yesterday I found myself as “scribe” for a discussion on the “co-referencing” question, i.e. how to deal with the fact that different data providers assign and use different URIs for “the same thing”. And these are my rather hasty notes of that discussion.

the creation/use of co-references is inevitable; people will always end up creating URIs for things for which URIs already exist;

one approach to this problem has been the use of the owl:sameAs property. However, using this property makes a very “strong” assertion of equivalence with consequences in terms of inferencing

the actual use of properties sometimes introduces a dimension of “social/community semantics” that may be at odds with the “semantics” provided by the creator/owner of a term

the notion of “sameness” is often qualified by a degree of confidence, a “similarity score”, rather than being a statement of certainty

the notion of “sameness”/similarity is often context-sensitive: rather than saying “X and Y are names for the same thing in all contexts”, we probably want to say something closer to “for the purposes of this application, or in this context, it’s sufficient to work on the basis that X and Y are names for the same thing”

is there a contrast between approaches based on “top-down” “authority” and those based more on context-dependent “grouping”?

how do we “correct” assertions which turn out to be “wrong”?

we decide whether to make use of such assertions made by other parties, and those decisions are based on an understanding of their source: who made them, on what basis etc.

such assessment may include a consideration of how many sources made/support an assertion

it is easy for assertions of similarity to become “detached” from such information about provenance/attribution (if it is provided at all!)

It seemed clear to us that this is really about focussing on institutional administrative data, as it’s probably harder to sell the idea of providing research data in linked data form to the Pro VC. Linked data probably doesn’t allow you to do things that couldn’t do by other means, but it is easier than other approaches in the long run, once you’ve got your linked data available. Linked Data can be of value without having to be open:

“Southampton’s data is used internally. You could draw a ring around the data and say ‘that’s closed’, and it would still have the same value.”

== Benefits ==

Quantifying the value of linked data efficiencies can be tricky, but providing open data allows quicker development of tools, as the data the tools hook into already exist and are standardised.

== Strategies ==

Don’t mention the term ‘linked data’ to the Pro VC, or get into discussing the technology. It’s about the outcomes and the solutions, not the technologies. Getting ‘Champions’ who have the ear of the Pro VC will help. Some enticing prototype example mash-up demonstrators that help sell the idea are also important. Also, pointing out that other universities are deploying and using linked open data to their advantage may help. Your University will want to be part of the club.

Making it easy for others to supply data that can be utilised as part of linked data efforts is important. This can be via Google spreadsheets, or e-mailing spreadsheets for example. You need to offload the difficult jobs to the people who are motivated and know what they’re doing.

It will also help to sell the idea to other potential consumers, such as the libraries, and other data providers. Possibly sell on the idea of the “increasing prominence of holdings” for libraries. This helps bring attention and re-use.

It’s worth emphasising that linked data simplifies the Freedom of Infomataion (FOI) process. We can say “yes, we’ve already published that FOI data”. You have a responsibility to publish this data if asked via FOI anyway. This is an example of a Sheer curation approach.

Linked data may provide decreased bureaucracy. There’s no need to ask other parts of the University for their data, wasting their time, if it’s already published centrally. Examples here are estates, HR, library, student statistics.

The potential for increased business intelligence is a great sell, and Linked Data can provide the means to do this. Again, you need to sell a solution to a problem, not a technology. The University ‘implementation’ managers need to be involved and brought on board as well as the as the Pro VC.

It can be a problem that some institutions adopt a ‘best of breed’ policy with technology. Linked data doesn’t fit too well with this. However, it’s worth noting that Linked Data doesn’t need to change the user experience.

A lot of the arguments being made here don’t just apply to linked data. Much is about issues such as opening access to data generally. It was noted that there have been many efforts from JISC to solve the institutional data silo problem.

If we were setting a new University up from scratch, going for Linked Data from the start would be a realistic option, but it’s always hard to change currently embedded practice. Universities having Chief Technology Officers would help here, or perhaps a PVC for Technology?

Please note: Although this is the ‘final’ formal post of the LOCAH JISC project, it will not be the last post. Our project is due to complete at the end of July, and we still have plenty to do, so there’ll more blog posts to come.

We consider the Archives Hub EAD to RDF XSLT stylesheet to be a key product of the Locah project. The stylesheet encapsulates both the Locah developed Linked Data model and provides a simple standards-based means to transform archival data to Linked Data RDF/XML. The stylesheet can straightforwardly be re-used and re-purposed by anyone wishing to transform archival data in EAD form to Linked Data ready RDF/XML.

The stylesheet is the primary source from which we were able to develop data.archiveshub.ac.uk, our main access point to the Archives Hub Linked Data. Data.archiveshub.ac.uk provides access to both human and machine-readable views of our Linked Data, as well as access to our SPARQL endpoint for querying the Hub data and a bulk download of the entire Locah Archives Hub Linked Dataset.

The stylesheet also provided the means necessary to supply data for our first ‘Timemap’ visualisation prototype. This visualisation currently allows researchers to access the Hub data by a small range of pre-selected subjects: travel and exploration, science and politics. Having selected a subject, the researcher can then drag a time slider to view the spread of a range of archive sources through time. If a researcher then selects an archive she/he is interested in on the timeline, a pin appears on the map below showing the location of the archive, and an call out box appears providing some simple information such as the title, size and dates of the archive. We hope to include data from other Linked Data sources, such as Wikipedia in these information boxes.

This visualisation of the Archives Hub data and links to other data sets provides an intuitive view to the user that would be very difficult to provide by means other than exploiting the potential of Linked Data.

Please note these visualisations are currently still work in progress:

Short description: A JISC-funded project working to make data from Copac and the Archives Hub available as Linked Data.

Longer description: The Archives Hub and Copac national services provide a wealth of rich inter- disciplinary information that we will expose as Linked Data. We will be working with partners who are leaders in their fields: OCLC, Talis and Eduserv. We will be investigating the creation of links between the Hub, Copac and other data sources including DBPedia, data.gov.uk and the BBC, as well as links with OCLC for name authorities and with the Library of Congress for subject headings.This project will put archival and bibliographic data at the heart of the Linked Data Web, making new links between diverse content sources, enabling the free and flexible exploration of data and enabling researchers to make new connections between subjects, people, organisations and places to reveal more about our history and society.

Key deliverables: Output of structured Linked Data for the Archives Hub and Copac services. A prototype visualisation for browsing archives by subject, time and location. Opportunities and barriers reporting via the project blog.

As I’ve noted previously, we initially focused our efforts on processing the set of EAD documents held by the Archives Hub, and on the particular set of markup conventions recommended by the Hub for data contributors – what I sometimes referred to as the Archives Hub EAD “profile” – though in practice, the actual dataset we’ve worked with encompasses a good degree of variation. But it remains the case that the transform is really designed to handle the set of EAD XML documents within that particular dataset rather than EAD in general. (I admit that it also remains somewhat “untidy” – the date handling is particularly messy! And parts of it were developed in a rather ad hoc fashion as I amended things as I encountered new variations in new batches of data. I should try to spend some time cleaning it up before the end of the project.)

Over the last few months, I’ve also been working on another JISC-funded project, SALDA, with Karen Watson and Chris Keene of the University of Sussex Library, focusing on making available their catalogue data for the Mass Observation Archive as Linked Data.

I wrote a post over on the SALDA blog on how I’d gone about applying and adapting the transform we developed in LOCAH for use with the SALDA data. That work has prompted me to think a bit more about the different facets of the data and how they are reflected in aspects of the transform process:

aspects which are generic/common to all EAD documents

aspects which are common to some quite large subset of EAD documents (like the Archives Hub dataset, with its (more or less) common set of conventions)

aspects which are “generic” in some way, but require some sort of “local” parameterisation – here, I’m thinking of the sort of “name/keyword lookup” techniques I describe in the SALDA post: the technique is broadly usable but the “lookup tables” used would vary from one dataset to another

aspects which reflect very specific, “local” characteristics of the data – e.g., some of the SALDA processing is based on testing for text patterns/structures which are very particular to the Mass Observation catalogue data

What I’d like to do (but haven’t done yet) is to reorganise the transform to try to make it a little more “modular” and to separate the “general”/”generic” from the “local”/”specific”, so that it might be easier for other users to “plug in” components more suitable for their own data.

In this post, I’ll say a little bit more about what is involved in the “Expose” operation up in the top right of the diagram.

Cool URIs for the Semantic Web

In an earlier post, I discussed the URI patterns we are using for the URIs of “things” described in our data (archival resources, concepts, people, places, and so on). One of the core requirements for exposing our RDF data as Linked Data is that, given one of these URIs, a user/consumer of that URI can use the HTTP protocol to “look up” that URI and obtain a description of the thing identified by that URI. So as providers of the data, our challenge is to enable our HTTP server to respond to such requests and provide such descriptions.

The W3C Note Cool URIs for the Semantic Web lists a number of possible “recipes” for achieving this while also paying attention to the principle of avoiding URI ambiguity i.e. of avoiding using a single URI to refer to more than one resource – and in particularly to maintaining a distinction between the URI of a “thing” and the URIs of documents describing that thing.

Thse guidelines refer to the URIs used to identify “things” (somewhat tautologically, it seems to me!) as “Identifier URIs”, where they have the general pattern:

http://{domain}/id/{concept}/{reference}

where:

concept is a name for a resource type, like “person”;

reference is a name for an individual instance of that type or class

(The guidelines also allow for the option of using URIs with fragment identifiers (“Hash URIs”) as “Identifier URIs”.)

The document also recommends patterns for the URIs of the documents which provide information about these “things”, “Document URIs”:

http://{domain}/doc/{concept}/{reference}

These documents are, I think, what Berners-Lee calls Generic Resources. For each such document, multiple representations may be available, each in different formats, and each of those multiple “more specific” documents in a single concrete format may be available as a separate resource in its own right. So a third set of URIs, “Representation URIs,” name documents in a specific format, using the suggested pattern:

(We’ve deviated slightly from the recommended pattern here in that we just add “.{extension}” to the “reference” string, rather than adding “/doc.{extension}”, but we’ve retained the basic approach of distinguishing generic document and documents in specific formats, which I think is the significant aspect of the recommendations.)

The Talis Platform

It is perhaps worth emphasising here that in the LOCAH case a “description” of any one of the things in our model may contain data which originated in multiple EAD documents e.g. a description of a concept may contain links to multiple archival resources with which it is associated, or a description of a repository may contain links to multiple finding aids they have published, and so on. A description may also contain data which originated from a source other than the EAD documents: for example, we add some postcode data provided by the National Archives, and most of the links to external resources, such as people described by VIAF records, are generated by post-transformation processes.

This aggregated RDF data – the output of the EAD-to-RDF transformation process and this additional data – is stored in an instance of the Talis Platform store. Simplifying things slightly, the Platform store is a “database” specialised for the storage and retieval of RDF data. It is hosted by Talis, and made avalable as what in cloud computing terms is referred to as “Software as a Service” (SaaS). (Actually, a Platform store allows the storage of content other than RDF data too – see the discussion of the ContentBox and MetaBox features in the Talis documentation – but we are, currently at least, making use only of the MetaBox facilities).

Access to the store is provided through a Web API. Using the MetaBox API, data can be added/uploaded to the MetaBox using HTTP POST, updates can be applied through what Talis call “Changesets” (essentially “remove that set of triples” and “add this set of triples”) again using HTTP POST, and “bounded descriptions” of individual resources can be retrieved using HTTP GET. There are also “admin” functions like “give me a dump of the contents” and “clear the database”. In addition, the Platform provides a simple full-text search over literals (which returns result sets in RSS), a configurable faceted search, an “augment” function and a SPARQL endpoint.

A number of client software libraries for working with the Platform are available, developed either by Talis staff or by developers who have worked with the Platform.

Delivering Linked Data from the Platform

I’m going to focus here on retrieving data from the MetaBox, and more specifically retrieving the “bounded descriptions” of individual resources which which provide the basis for the “Linked Data” documents.

This process involves a small Web application which responds to HTTP GET requests for these URIs:

For an “Identifier URI”, the server responds with a 303 status code and a Location header redirecting the client to the “Document URI”

For a “Document URI”, the server derives the corresponding “Identifier URI”, queries the Platform store to obtain a description of the thing identified by that URI, and responds with a 200 status code, a document in a format selected according to the preferences specified by the client (i.e. following the principles of HTTP content negotiation), and a Content-Location header providing a “Representation URI” for a document in that format.

For a “Representation URI”, the server derives the corresponding “Identifier URI”, queries the Platform store to obtain a description of the thing identified by that URI, and responds with a 200 status code and a document in the format associated with that URI.

The first step above is handled using a simple Apache rewrite rule. For the latter two steps, we’ve made use of the Paget PHP library created by Ian Davis of Talis for working with the Platform (Paget itself makes use of another library, Moriarty, also created by Ian). I’m sure there are many other ways of achieving this; I chose Paget in part because my software development abilities are fairly limited, but having had a quick look at the documentation and one of Ian’s blog posts, I felt there was enough there to enable me to take an example and apply my basic and rather rusty PHP skills to tweak it to make it work – at least as a short-term path to getting something functional we could “put out there”, and then polish in the future if necessary.

The main challenge was that the default Paget behaviour seemed to be to use the approach described in section 4.3 of the Cool URIs document, “303 URIs forwarding to Different Documents”, where the server performs content negotiation on the request for the “Identifier URI” and redirects directly to a “Representation URI”, i.e. a GET for an “Identifier URI” like http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist resulted in redirects to “Representation URIs” like http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist.html or http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist.rdf

If possible we wanted to use the alternative “recipe” described in the previous section, and after some tweaking we managed to get something that did the job. We also made some minor changes to provide a small amount of additional “document metadata”, e.g. the publisher of and license for the document. (I do recognise that the presentation of the HTML pages is currently pretty basic, and there is room for improvement!)

Finally, it’s maybe worth noting here that the Platform store itself doesn’t contain any information about the documents i.e. neither the Document URI nor the Representation URIs appear in RDF triples loaded to the store. So, in principle at least, we could add additional formats using additional Representation URIs simply by extending the PHP to handle the URIs and generate documents in those formats, without needing to extend the data in the store.

I’d started to write more here about extending what we’ve done to provide other ways of accessing the data, but having written quite a lot here already, I think that is probably best saved for a future post.

In addition, there are some useful posts around on techniques for “probing” a SPARQL endpoint, i.e. issuing some general queries to get a picture of the nature of the graph(s) in the dataset behind an endpoint. See, for example:

In this post, I’ll focus mainly on responding to the second point, by providing a few sample SPARQL queries. Inevitably, these can only give a flavour of what is possible, but I hope they provide a starting point for people to build on.

This isn’t intended to be a tutorial on SPARQL; there are various such tutorials available, but one I found particularly thorough and helpful is:

The data is hosted in an instance of the Talis Platform, which supports a few useful extensions to the SPARQL standard, some of which are used in the examples below.

Listing “top-level” archival “collections”

Following the principles of “multi-level” description of archives, archivists apply a conceptualisation of archival materials as constituting hierarchically organised “collections”, where one “unit of description” may contain others, which in turn may contain others. It is often the case that an archival finding aid provides descriptions of materials only at the “collection-level”, or perhaps at some “sub-collection” level, without describing items individually at all.

This list includes archival resources at any “level”, from collections down to individual items.

We want to narrow down that selection so that it includes only “top-level” archival resources i.e. archival resources which are not “part of” another archival resource. This can be done by extending our pattern to allow for the optional presence of a triple with predicate dcterms:isPartOf, and filtering to select only those cases where the object in that optional pattern is “not bound” i.e. no such triple is present in the dataset:

Finding the location of the Repository holding an Archival Resource

For each archival resource, access to that resource is provided by a Repository (an agent, an entity with the ability to do things). This relationship is expressed using the property locah:accessProvidedBy. The Repository-as-Agent manages a place where the resource is held, a relationship expressed using the locah:administers property, and that place is associated with a postcode, both as a literal, and (perhaps more usefully) in the form of a link to a “postcode unit” in the dataset provided by the Ordnance Survey; by “following” that link, more information about the location can be obtained (e.g. latitude and longitude, relationships with other places) from the data provided by the OS.

Listing the Archival Resources associated with a Person

In the EAD finding aids, the description of an archival resource may provide an association with the name of one or more persons associated with the resource as “index terms”. The person may be the creator of the resource, they may be the topic of it, or there may be some other association which is considered by the archivist to be significant for people searching the catalogue.

The following query provides a list of person names, the “authority file” form of the name, the identifiers of the archival resources with which they are associated, and the URI of a page on the existing Hub Web site describing the resource. I’ve limited it to a particular repository as without that constraint it potentially generates a quite large result set (and it helps me conceal the fact that some of the person name data is still a little bit rough and ready!)

Listing Persons associated with Archival Resources, where Persons are born during a specified period

In an earlier post, I described the modelling of the births and deaths of individual persons as “events”.

Based on this approach, birth or death events occurring within a specified period can be selected. So, for example, the following query returns a list of persons born during the 1940s, with the archival resources with which they are associated:

(I use this to illustrate the “event” approach, but in this case, birth and death dates are also provided as literal values of properties associated with the person, so there are other (easier!) ways of getting that information.)

To close, I’ll just emphasise again that these are only a few simple examples, intended to give an idea of the structure/”shape” of the data, and a flavour of what sort of queries are possible. If you come up with any examples of your own you’d like to share, we’d be glad to hear about them in comments below. (Come to think of it, it’s probably not very easy to maintain formatting/whitespace etc in comments, so it might be easier to host any such examples elsewhere and just post links here).

P.S. If there are any “tweaks” that you think we could make that would make things easier for those consuming/querying the data, it would be good to hear about them. I can’t promise we’ll be able to implement them, but we are still at the stage where things can be changed and we do want the data to be as usable and useful as possible.

We’re very pleased to announce the release of http://data.archiveshub.ac.uk, the first Linked Data set produced by the LOCAH project. The team has been working hard since the beginning of the project on modelling the complex archival data and transforming it into RDF Linked Data. This is now available in a variety of forms via the data.archiveshub.ac.uk home page. A number of previous blog posts outline the modelling and transformation process, the RDF terms used in the data, and the challenges and opportunities arising along the way. A forthcoming post will provide some example queries for accessing data from the SPARQL query endpoint. The data and content is licensed under a Creative Commons CC0 1.0 licence.

We’re working on a visualisation prototype that provides an example of how we link the Hub Data with other Linked Data sources on the Web using our enhanced dataset to provide a useful graphical resource for researchers.

One important point to note is that this initial release is a selected subset, representative of the Hub collection descriptions as a proof of concept, and does not contain the full Archives Hub dataset at present, although we are very keen to explore this in the future.

We still have some work to do, this being the initial release of the Hub data. Some revisions for a later release will address a few issues including reconciling our internal person and subject names, and will also contain some further enhancements to the data to include links to Library of Congress subject headings and further links to DBPedia based on subject terms. We also hope to include links for place names using Geonames and Ordnance Survey.

We encourage feedback on the data, the model and any other aspect of data.archiveshub.ac.uk, so please leave comments or contact us directly.

We are also working hard on our other main LOCAH release, the Copac Linked Data. Our first version of the model for this is now finished, and we have the data in our test triple store. We hope to release this in about a month’s time.

I’d personally like to thank the LOCAH team for all their hard work on this exciting and challenging project. I’d also like to thank our technology partner, Talis for kindly providing our Linked Data store.

In the previous post, I described some of the considerations in choosing RDF vocabularies to use for the LOCAH archival metadata. In the tables below, I’ve tried to summarise the properties used to “describe” an instance of each of the classes in our model, i.e. for a particular thing URI, in our dataset, one might expect to find triples with that URI as subject and these property URIs as predicates, and when our data is served as linked data, and a thing URI is dereferenced the “bounded description” provided will include those triples (and others) – though some may be optional, so not necessarily present for all instances (and some may not be present at all until we add some more data…!)

This is really more of a “reference document” than a blog post, but I provide it in part as documentation of the data creation/transformation process, and in part as a guide for potential users of the actual data. Having said that, the data is liable (even likely) to change so consumers should always refer to the actual data for an up-to-date picture of the terms used. I’ve tried to highlight (dark grey background) below terms which I consider to be particularly “at risk” and liable to be removed/replaced, mostly terms from the “locah” vocabulary.

Most of this data is generated from the transformation of the EAD XML documents; a small proportion is added separately. Again, I’ve tried to indicate that in the tables below (light grey background).

For all of the following, the object is simply a copy of the XML element content from the EAD document as an XML Literal. This is a rather “dumb” and probably not terribly useful “translation” from the EAD; in a future iteration of the transform, we hope to extract further useful triples from this part of the EAD data, and we will probably remove some of these triples.

Extent

In the EAD XML doc, extent is expressed simply as a literal. Where possible we’ve tried to parse out a “unit of measurement” and a quantity, reflected in RDF as a triple where the predicate reflects the unit and the object the quantity, as a typed literal, to try to make comparisons easier. I need to catch up with what current “best practice” is for representing quantities/units of measurement so this may well change. Also, currently, “units” include things like “file”, “paper” and “envelope”, which may not be terribly useful.

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the person who is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the family that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the organisation that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the place that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

In this post and the next one, I’ll outline the RDF vocabularies we’re using to describe those “things”. This post covers some of the considerations in choosing the vocabularies and some of the “patterns” we’ve used in deploying them; the next lists the properties and classes you can expect to find in the LOCAH data.

Using existing RDF vocabularies

As far as possible, we’ve tried to make use of existing, deployed RDF vocabularies. These include:

Dublin Core Terms:http://purl.org/dc/terms/
A subset of properties from DC Terms is used, mostly for common properties of the “document-like” or “work-like” resources. (Note: for the Dublin Core properties, I’ve preferred the DC Terms vocabulary to the Dublin Core Metadata Element Set vocabulary, on the grounds that the former provides more precise definitions of its properties.)

Those distinctions between which vocabulary “describes” what are somewhat rough, particularly taking into account that the “directionality” of properties in RDF is somewhat arbitrary: a triple using the dcterms:creator property to link a created work to an agent is as much “about” the agent as it is “about” the thing created.

However, where we’ve seen a need to express a notion that is not well addressed by an existing vocabulary, we have defined the additional classes and properties required and provided URIs for them as a small “local” LOCAH RDF vocabulary. At this point in time, I consider most of these terms something of a “work in progress”, and likely to be revised (or even dropped completely) before the end of the project. But I suspect some will remain – which, given the bounded timescale of the project, leaves questions about the longer term management of such vocabularies.

Discovering Appropriate Vocabularies

Most of my knowledge of existing RDF vocabularies has come from lurking on good old-fashioned mailing lists, particularly the W3C Semantic Web Interest Group list and the Linked Open Data list. I don’t read every posting by any means, and the signal-to-noise ratio can be variable, but for me they remain an excellent source of information with a knowledgeable and active contributing community (and the archives are a great repository.)

In similar territory, Semantic Stackoverflow provides a “question-and-answer”-style service, though it tends to have a fairly technical focus.

Another useful source is to look at actual linked data datasets, particularly those which are in a similar “domain” to the one you’re working in and cover similar resource types, and check out what vocabularies they are using (and how they are using them). In the library/bibliographic domain in particular, there has been a fairly steady stream of linked data datasets appearing over the last couple of years, so there’s quite a bit to go on, rather less so for the archives case. For a few pointers, see e.g. this review post by Ed Summers (itself already nearly a year old).

There are some services which aim to provide disclosure/discovery services based on aggregations of information about vocabularies and their constituent terms, sometimes called “metadata registries” or “metadata schema registries”. I’ve had mixed experiences of using these services: in some cases the content is not current; in others the coverage is intentionally tailored to the requirements of a particular community, so the challenge becomes one of finding a registry whose coverage matches the task at hand. One service (with quite general coverage) which I have occasionally found useful is Schemapedia, a project by Ian Davis of Talis; it provides “vocabulary”-level descriptions, rather than descriptions of individual “terms” but it includes some examples of actual terms: see, e.g. the entry for the Biographical Vocabulary.

There are a number of services which provide search functions across aggregations of data gathered from the linked data Web/Semantic Web. Sindice crawls and aggregates a huge range of RDF data and provides a “Google”-like search across that aggregation. (I’ve also found navigating such an aggregation helpful in thinking about various aspects of linked data: the sig.ma browser highlights the consequences of merging data from multiple sources, and related issues of provenance, attribution and trust, for example).

Finally, at the risk of stating the obvious, plain old Web search engines can still be a useful entry point.

Having said all this, I admit that the discovery of RDF vocabularies is still something of a challenge, and I continue to come across useful things I’d missed. And having found something potentially useful often raises further questions: Is the vocabulary stable or still being developed? Is it described following “modern” good practice for RDF vocabularies? Is it being managed/curated? By an individual/institution/community? Does it have the support of a community of users? Particularly if the intention is for a dataset to have some longevity, these may be significant considerations.

Patterns for using RDF Vocabularies

While discovering RDF vocabularies capable of expressing the information you want to represent is a first step, it often raises issues of exactly how those vocabularies might best be deployed, or of choosing between several possible alternative solutions.

Leigh Dodds and Ian Davis of Talis have authored a booklet Linked Data Patterns which tries to address some of these challenges, by gathering together some common “patterns” of use, based on existing practice by linked data implementers – though perhaps inevitably at this stage, some aspects of that practice are something of a “moving target” as new challenges are identified and practice evolves to address them. (See, for example, a recentdebate on the Linked Open Data mailing list covering the question of expectations for what the object of an rdfs:seeAlso triple might/should dereference to.)

I continue to find the reflections of linked data practitioners an excellent source, particularly those working in domains close to those I’m interested in. I regularly find myself referring to the series of posts by Jeni Tennison on creating linked data. In this context, the fifth post on “Finishing Touches” is particularly relevant, and in large part prompts my next couple of points.

Labelling

One of the principles I’ve tried to adhere to, following the guidance by Jeni is that each resource we expose should have a human-readable label, provided using the rdfs:label property, and as far as possible that label should function as a useful “stand-alone” name for the thing.

In some cases this is a straightforward matter of using some text content node in the EAD XML document as an RDF literal. In other cases, a single element in the EAD document is mapped to a number of distinct resources in our model. In these cases, the transformation process typically prefixes or suffixes the source text to generate labels for the various different things. Perhaps unsurprisingly, this sometimes leads to some slightly “artificial” or “stilted” results, so it’s something we may need to refine.

Also, and perhaps more problematically, as I’ve noted in a previous post, the practice of archival description has traditionally relied heavily on a “multi-level description” approach which results in the presentation of resource descriptions “in the context of” the descriptions of other related resources. So it is common to find individual items within a collection labelled simply as something like “Letter”, on the basis that the reader of the finding aid will glean further information from the fact that the description of the item is presented within a context provided by a list of other “sibling” items, all “children” of a “parent” aggregation of some form. Currently our mapping generates the rdfs:label of an item using only the label (EAD unititle element) of that item in the EAD document, with the result that we may indeed end up with many individual resources labelled “Letter” (though of course the description will also include other properties derived from other EAD data and links to “parent” resources). An alternative might be to try to generate a label by “qualifying” the item unittitle, say, by prefixing it with the label of a “parent” resource – though I suspect in practice this would generate some somewhat unwieldy results.

Where the source data makes it seem reasonable to express it, I’ve also indicated the use of a “preferred label”, using the skos:prefLabel property. I’m conscious here of the need to be careful: the SKOS specification includes a number of “integrity conditions”, rules which data using the SKOS vocabulary should follow. Amongst them is the requirement that

A resource has no more than one value of skos:prefLabel per language tag.

The important thing to remember is that this is intended to apply in an “open world” context, not simply as a condition scoped to a particular “document”. The EAD to RDF transform process is performed on a document-by-document basis. Within the Hub dataset, it is quite common that for a single resource, labels for that resource are generated from the content of multiple EAD documents. While in theory naming within the set of EAD documents should be consistent, in practice, the use of variants of names is widespread in our data – the names of archival repositories is one example. Generating an skos:prefLabel triple for each variant would result in a conflict with the integrity condition once the data was merged in the triple store.

Bearing in mind that the “open world” extends beyond the boundaries of our own dataset, the same considerations apply in the case where we are exposing URIs for resources for which other parties already expose descriptions, including an skos:prefLabel triple, and we can’t guarantee that the names in our data correspond to those provided by that source.

Inferencing

Another issue to consider is that referred to by Leigh and Ian in their “Materialize Inferences” pattern, and by Jeni Tennison in her discussion of “Derivable Data”. One of the strengths of using the RDF model is that it is supported by a formal semantics, a framework for reasoning with data, i.e. given some set of data, it is often possible to apply some formalised set of rules to infer or derive additional triples. However, it should not be assumed that all consumers of the data will have access to the tools which support such reasoning, so it may be more appropriate for a data provider like LOCAH to explicitly include at least some of those “derivable” triples in the data we provide.

For a simple example of what I mean, the Friend of a Friend (FOAF) vocabulary provides a property called foaf:name (“A name for some thing.”). As part of their description of that property, the FOAF vocabulary owners provide the triple:

foaf:name rdfs:subPropertyOf rdfs:label .

The RDFS property rdfs:subPropertyOf is one of those properties which is associated with a set of rules. What those rules say is that, for any two properties linked by an rdfs:subPropertyOf relation, two resources related by the first property are also related by the second. So each time I find a triple using foaf:name as a predicate, I can infer (deduce, derive) a second triple using the rdfs:label predicate, e.g. if I find

However, to reach that conclusion, my application needs (a) knowledge of the general rdfs:subPropertyOf inference rule, and (b) knowledge that foaf:name is a subproperty of rdfs:label – and (c) the processing capability to apply that rule!

By providing – “materializing” – both those triples in our source data, we relieve the consuming application of that responsibility – though that benefit comes at the cost of increasing the size of the descriptions we provide.

This tactic can be particularly useful, I think, for properties which are subproperties of “generic” vocabularies like the RDF Schema vocabulary or the Dublin Core vocabularies. Sometimes generic linked data tools have some “built-in knowledge” of, and/or specific behaviour associated with, some of these vocabularies (e.g. to obtain literal names/labels/titles for display to human readers). It may be perfectly reasonable to use a triple with some more specialised subproperty in our data to indicate some specific relationship, but where appropriate it is also helpful to “materialize” the triple using the more generic property as well, so that an application looking for RDF Schema or DC properties can easily access that data.

Extending that slightly, Jeni suggests a “rule of thumb” that “if the result of the reasoning involves a resource from another vocabulary, then we should include it”.

The subproperty case is just one example: the inference of resource type based on rdfs:range and rdfs:domain is another case in point. In the LOCAH data, we’ve tried to provide fairly “generous” type data (e.g. including “super-classes”) where possible – again, on the grounds that such information is a commonly used “hook” in user queries (“Select resources of type T where [some other criteria]”).

The “cost” of this approach is that the dataset and the individual “bounded descriptions” served are larger – so there is a “trade-off” here which we may want to monitor and reconsider once we see how the data is being used.

Events

As I mentioned earlier, we extended our very initial draft model to include a notion of “event”. Currently, the application of this approach in our data is quite limited: it is applied to the “creation”/”origination” of the archival resources, and to the birth, death and “periods of activity” (floruit) of individuals. What we do is similar to the approach sketched by Ben O’Steen in his processing of the British Library’s British National Bibliography data – though with a little more complexity as we make use of event ontologies which model time periods as resources, rather than as literals.

This is probably best illustrated by means of an example. Given a person with birth date of 1901 and death date of 1985, we generate an RDF graph like the following:

What I haven’t illustrated on that diagram is that I’ve also included some data using the CIDOC CRM ontology – actually using the Erlangen CRM vocabulary. I’m feeling my way a bit with this, so it is somewhat partial/experimental at the moment, but I hope to refine/extend it in the future.

The point I wanted to highlight is that we’ve made use of multiple “overlapping” vocabularies here – again on the grounds that it may be useful to provide that flexibility to consumers of the data querying using a specific vocabulary. As above, this is a “trade-off” which we may want to monitor and reconsider in the future.

Summary

I’ve tried to cover here some of the issues around our choices of RDF vocabularies and how we’ve deployed them. The next post will summarise the actual terms used.