how the thesaurus content is modelled using a "concept-based" approach - what are the "things of interest", their attributes, and the relationships between them;

how those things (concepts, terms/labels) are named/identified using http URIs;

how those things can be "described" using the simple "triple" statement model of RDF, and using the SKOS and SKOS-XL RDF vocabularies;

an example of how an expresssion of the thesaurus using the term-based model is mapped or transformed into an SKOS RDF expression

What I didn't really address in that post is how that resulting RDF data is made available and accessed on the Web - which is more specifically where the "Linked Data" principles articulated by Tim Berners-Lee come into play.

(A good deal of the content of this post is probably familiar stuff for those of you already working with Linked Data, but I thought it was worth documenting it both to fill out the picture of some of the "design choices" to be made in this particular example, and to provide some more background to others less familiar with the approaches.)

Linked Data, URIs, things, documents and HTTP

The use of http URIs as identifiers provides two features:

a global naming system, and a set of processes by which authority for assigning names can be delegated/distributed;

through the HTTP protocol, a well understood and widely deployed mechanism for providing access to information "about", or descriptions of, the things identified by those URIs (in our case, the concepts and terms/labels).

As a user/consumer of an http URI, given only the URI, I can "look up" that URI using the HTTP protocol, i.e. I can provide it to a tool (like a Web browser) and that tool can issue a request to obtain a description of the thing identified by the URI. And conversely as the owner/provider of a URI, I can configure my server to respond to such requests with a document providing a description of the thing.

And the HTTP protocol incorporates features which we can use to "optimise" this process. So, for example, the "content negotiation" feature allows a client to specify a preference for the format in which it wishes to receive data, and allows a server to select - from amongst several it may have available - format which the server determines is most appropriate for the client. In the terminology of the Web Architecture, the description can have multiple "representations", each of which can vary by format (or by other criteria). In the context of Linked Data, this technique is typically used to support the provision of document representations in formats suitable for a human reader (HTML, XHTML) and in one or more RDF syntaxes (usually, at least as RDF/XML). (The emergence of the RDFa syntax, which enables the embedding of RDF data in HTML/XHTML documents, and the growing support for RDFa in RDF tools, offers the possibility, in principle at least, of a single format serving both purposes.)

The widespread use of the HTTP protocol and tools that support it mean that these techniques are widely available (in theory at least; experience suggests that in practice the ability (or authority) to set up things like HTTP redirects (more below) can create something of a barrier). It also means that the "Web of Linked Data" is part of the existing Web of documents that we are accustomed to navigating using a Web browser.

One of the principles underpinning RDF's use of URIs as names is that we should try to avoid ambiguity in our use of those names, i.e. we should use different URIs for different things, and avoid using the same URI as a name for two different things. One of the issues I've slightly glossed over in the last few paragraphs is the distinction between a "thing" and a document describing that thing as two different resources. After all, if I provide a page describing the Mona Lisa, both the page and the Mona Lisa have creators, creation dates, terms of use, but they have different creators, creation dates and terms of use. And if I want to provide such information in RDF, then I need to take care to avoid confusing the two objects - by using two different URIs, one for my document and one for the painting, and citing those URIs in my RDF statements accordingly.

However, as I emphasise above, we also want to be in a position where, given only a "thing URI", I can obtain a document describing that thing: I shouldn't need in advance information about a second URI, the URI of a document about that thing.

the use of URIs containing "fragment identifiers" ("hash URIs") (i.e. URIs of the form http://example.org/doc/123#thing). In this case, the "fragment identifier" part of the URI is always "trimmed" from the URI when the client makes a request to the server, and this allows the use of the URI with fragment identifier as "thing URI", leaving the trimmed URI without fragment id as a document URI.

the use of a convention of HTTP "redirects". In this case, when a server receives a request for a URI which it "knows" is a "thing URI" rather than a document URI, it returns a response which provides a second URI as a source of information about the thing, and the client then sends a second request for that second URI. Formally, the initial response uses the HTTP "303 See Also" status code, which sometimes leads to these being called colloquially "303 URIs", even though there's nothing special about the URIs themselves.

I'm conscious that I'm skipping over some of the details here; for a more detailed description, particularly of the "flow" of the interactions involved, and some consideration of the pros and cons of the two approaches, see Cool URIs for the Semantic Web.

URI Sets for the UK Public Sector

The Cool URIs note focuses mainly on the patterns of "interaction" for handling the two approaches to moving from "thing URI" to document URI. Its examples include example URIs, but the exact form of those URIs is intended to be illustrative rather than prescriptive. I think it's important to note that in the redirect case, it is the server's notification to the client of the second URI that provides the client with that information. There is no technical requirement for a structural similarity in the forms of the "thing URI" and the document URI, and consumers of the URIs should rely on the information provided to them by the server rather than making assumptions about URI structure.

Having said that, the use of a shared, consistent set of URI patterns within a community can provide some useful "cues" to human readers of those URIs. It can also simplify the work of data providers - for example by facilitating the use of similar HTTP server configurations or the reuse of scripts/tools for serving "Linked Data" documents. With this (and other factors such as URI stability) in mind, the UK Cabinet Office has provided a set of guidelines, Designing URI Sets for the UK Public Sector which build on the W3C Cool URIs note, but offer more specific guidance, particularly on the design of URIs.

For the purposes of this discussion, of particular interest is the document's specification (in the "Definitions, frameworks and principles" section) of several distinct "types of URI", or perhaps more accurately, URIs for different categories of resource, and (in the "The path structure for URIs" section) of suggested structural patterns for each:

Identifier URIs (what I have been calling above "thing URIs") name "real-world things" and should use patterns like:

http://{domain}/{concept}/{reference}#id or

http://{domain}/id/{concept}/{reference}

where:

{concept} is "a word or string to capture the essence of the real-world 'Thing' that the set names e.g. school". (I think this is roughly what I think of as the name of a "resource type" - note this is a more generic use of the word "concept" than that of the SKOS concept)

{reference} is "a string that is used by the set publisher to identify an individual instance of concept".

The document allows for the use of a hierarchy of concept-reference pairs in a single URI if appropriate, so for a specific class within a specific school, the path might be /id/school/123/class/5

These documents are, I think, what Berners-Lee calls Generic Resources. For each such document, multiple representations may be available, each in different formats, and each of those multiple "more specific" concrete formats may be available as a separate resource in its own right (see "Representation URIs" below). If descriptions vary over time, and those variants are to be exposed then a series of "date-stamped" URIs can be used, with the pattern

http://{domain}/doc/{concept}/{reference}/{yyyy-mm-dd}

Representation URIs name a document in a specific format. The suggested pattern is

The guidelines also distinguish a category of "Ontology URIs" which use the pattern http://{domain}/def/{concept}. I had interpreted "Ontology URIs" as applying to the identification of classes and properties, and I was treating the terms/concepts of a thesaurus as "conceptual things" which would fall under the /id/ case. But I do notice that in an example in which she refers to these guidelines, Jeni Tennison uses the /def/ pattern for an SKOS example. I don't think it's really much of an issue - and pretty much all the other points I discuss apply anyway - but any advice on this point would be appreciated.

So, applying these general rules for the thesaurus case, where, as I discussed in the previous post, the primary types of thing of interest in our SKOS-modelled thesaurus are "concepts" and "terms":

Term URI Pattern: http://example.org/id/term/T{termid}

Concept URI Pattern: http://example.org/id/concept/C{termid}

However, if we bear in mind that within the URI space of the example.org domain, we're likely to want to represent, and coin URIs for the components of, multiple thesauri, and our "termid" references (drawn from the term numbers in the input) are unique only within the scope of a single thesaurus, then we should include some sort of thesaurus-specific component in the path to "qualify" those term numbers. Let's use the token "polthes" for this example:

where each concept and term in the thesaurus is linked to this concept scheme by a triple using the skos:inScheme property. (I omitted this from the example in the previous post so that it was easier to focus on the concept-term and concept-concept relationships, and to try to keep the already rather complex diagrams slightly readable!)

Aside: An alternative for the concept and term URI patterns would be to use the "hierarchical concept-reference" approach and use patterns like:

My only slight misgiving about this approach is that (bearing in mind that strictly speaking the URIs should be treated as opaque and such information obtained from the descriptions provided by the server) in the (non-hierarchical) form I suggested initially, the string indicating the resource type ("concept", "term") is fairly clear to the human reader from its position following the "/id/" component in the URI (e.g. http://example.org/id/concept/polthes/C2). But with the hierarchical form, it perhaps becomes slightly less clear (e.g. http://example.org/id/concept-scheme/polthes/concept/C2). But that is a minor gripe, and really the hierarchical form would serve just as well. For the remainder of this document, in the examples, I'll continue with the initial non-hierarchical pattern I suggested above, but it may be something to revisit if the hierarchical form is more in line with the intent - and current usage - of the guidelines. (So again, comments are welcome on this point.)

For each of these "Identifier URIs", there should be a corresponding "Document URI" naming a document describing the thing, and following the /doc/ pattern:

Description of Term (HTML): http://example.org/doc/term/polthes/T{termid}.html

Description of Term (RDF/XML): http://example.org/doc/term/polthes/T{termid}.html

Descriptions and "boundedness"

The three documents I've mentioned so far (Berners-Lee's Linked Data Design Issues note, the W3C Cool URIs document, or the Cabinet Office URI patterns document) don't have a great deal to say on the topic of the content of the document which is returned as a description of a "thing". This is discussed briefly in the "Linked Data Tutorial" document by Chris Bizer, Richard Cyganiak and Tom Heath, How to Publish Linked Data on the Web.

In principle at least, it is quite possible to provide a single document which describes several resources. This approach has been quite common in association with the use of "hash URIs" in a pattern where a number of "thing URIs" differ only by fragment identifier, and share the same "non-fragment" part (http://example.org/school#1, http://example.org/school#2, ... http://example.org/school#99), and a number of common ontologies make use of this sort of approach. One consequence is that a client interested only in a single resource always retrieves the full set of descriptions. If my thesaurus really did consist only of the half-dozen concepts and terms I described in the example in my previous post, retrieving a document describing them all would probably not be a problem, but for the "real world" case where there are several thousand terms involved, it would represent a significant overhead if every request results in the transfer of several megabytes of data.

Generally, the approach taken is for the data provider to generate some set of "useful information" "about" the requested resource - though saying that rather begs the question of what constitutes "useful" (and whether there is a single answer to that question that is applicable across different datasets dealing with different resource types). Typically the generation of a description is based on some set of rules which, for a specified node in the dataset RDF graph (a specified "thing URI"), selects a "nearby" subgraph of the graph, representing a "bounded description" made up of triples/statements "about" the thing itself and maybe also "about" closely related resources.

Various such algorithms for generating such descriptions are possible and I don't intend to attempt any sort of rigorous analysis or comparison of them here - for further discussion see e.g. Patrick Stickler's CBD - Concise Bounded Description or Bounded Descriptions in RDF from the Talis Platform wiki. But there is one aspect which I think it is worth mentioning in the context of the thesaurus example. One of the key differences between the algorithms used to generate descriptions is how they treat the "directionality" of arcs in the RDF graph, i.e. whether they base the description only on arcs "outbound from" the selected node (RDF triples with that URI as subject), or whether they take into account both arcs "outbound from" and "inbound to" the node (triples with the URI as either subject or object).

That probably sounds like a very abstract point, and the significance is perhaps best illustrated through a concrete example. Let's take the graph for the example from my previous post (tweaked to use the slightly amended URI patterns above - I've continued to leave out the concept scheme links to keep things simple) and suppose this is the dataset to which I'm applying the rules.

(In the figures below, I've tried to represent the idea that a subgraph is being selected by "fading out" the parts which aren't selected, and leaving the selected part fully visible. I hope the images are sufficiently clear for this to be effective!)

Let's first take the approach known as the "concise bounded description (CBD)" - formally defined here, but essentially based on "outbound" links. For the concept C2 (http://example.org/id/concept/polthes/C2), the CBD would consist of the following subgraph (i.e. the document http://example.org/doc/concept/polthes/C2 would contain this data):

Note that for the two terms, the "concise bounded description" is quite minimal (though remember I've simplified it a bit): in particular, it does not include the relationship between the term and the concept. This is because using the SKOS-XL vocabulary that relationship is expressed as a triple in which the concept URI is the subject and the term URI is the object - an "inbound arc" to the term URI in the graph - which the CBD approach does not take into account when constructing the description of the term.

But the fact that the relationship is represented only in this way - a link from concept to term, without an inverse term to concept link - is arguably slightly arbitrary.

An alternative approach, the "symmetric bounded description" seeks to address this sort of issue, by taking into account both "outbound" and "inbound". For the same three cases, it produces the following results:

For the concept case, the difference is relatively minor (for the skos:broader and skos:narrower relationships, the inverse triples are ow also included). But for the term cases, the relationship between concept and term is now included.

So (rather long-windedly, I fear!), I'm trying to illustrate that it's worth thinking a little bit about the content of descriptions and how they work as "stand-alone" documents (albeit linked to others). And for this dataset, I think there's an argument that generating "symmetric" descriptions that include inbound links as well as outbound ones probably results in more "useful information" for the consumer of the data.

(Again, I'm simpifying things slightly here to illustrate the point: I've omitted type information and the links to indicate concept scheme membership. Typically the descriptions might (depending on the algorithm) include labels for related resources mentioned, rather than just the URIs, and would include some metadata about the document - its publisher, last modification date, licence/rights information, a link to the dataset of which it is a member, and so on.)

Summary

What I've tried to do in this post is expand on some of the "linked data"-specific aspects of the project, and to examine some of the design choices to be made in applying some of those general rules to this particular case, shaped both by external factors (like the Cabinet Office guidelines on URIs) and by characteristics of the data itself (like the directionality of links made using SKOS-XL). In the next post, I'll move on, as promised, to the questions of how the data changes over time, and any implications of that.