March 10, 2011

Term-based thesauri and SKOS (Part 3): Change over time (i)

This is the third in a series of posts (previously: part 1, part 2) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In this post, I'll examine some of the ways the thesaurus can change over time, and how such changes are reflected when applying the mapping I described earlier.

A note on "workflow"

In the case I'm working on, the term-based thesaurus is managed in a purpose-built application, from which a snapshot is exported (as an XML document) at regular intervals. These XML documents are the inputs to a transformation process which generates an SKOS/SKOS-XL RDF version, to be exposed as linked data.

Currently at least, each "run" of that transformation operates on a single snaphot of the thesaurus "stand-alone" i.e. the transform process has no "knowledge" of the previous snapshot, and the expectation is that the output generated from processing will replace the output of the previous run (either in full, or through a process of establishing the differences and then removing some triples and adding others). This "stand-alone" approach may be something I have to revisit.

The mapping

To summarise the transformation described in the previous post, a single preferred term and its set of zero or more non-preferred terms are treated as labels for a single concept. For each such set:

a single SKOS concept is created with a URI based on the term number of the preferred term

the concept is related to the literal form of the preferred term by an skos:prefLabel property

an SKOS-XL label is created with a URI based on the term number of the preferred term

the label is related to the literal form of the preferred term by an skosxl:literalForm property

the concept is related to the literal form of the non-preferred term by an skos:altLabel property

an SKOS-XL label is created with a URI based on the term number of the non-preferred term

the label is related to the literal form of the preferred term by an skosxl:literalForm property

the concept is related to the label by an skosxl:altLabel property

In the discussion below, I'll take the following "snapshot" of a notional thesaurus - it's another version of the example used in the previous posts, extended with an additional preferred term - as a starting point:

Once our resource URIs are generated and published, they will be used/cited by other agencies in their data - in other linked data datasets, in other thesauri, or in simple Web documents which reference terms or concepts using those URIs. From the linked data perspective, it is important that once generated and published the resource URIs, which will be http: URIs, remain stable and reliable. I'm using the terms "stable" and "reliable" as they are used by Henry Thompson and Jonathan Rees in their note Guidelines for Web-based naming, which I've found very helpful in breaking down the various aspects of what we tend to call "persistence". And for "stability", I'm thinking particularly of what they call "resource stability". So

once a URI is created, we should continue to use that URI to denote/identify the same resource

it should continue to be possible to obtain some information "about" the identified resource using the HTTP protocol - though that information obtained may change over time

For our particular case, the requirement is only that the "current version" of the thesaurus is available at any point in time, i.e. for each concept and for each term/label, at any point in time, it is necessary to serve only a description of the current state of that resource.

So, in my previous post, I mentioned that the Cabinet Office guidelines Designing URI Sets for the UK Public Sector allow for the case of creating a set of "date-stamped" document URIs, to provide variant descriptions of a resource at different points in time. I don't think that is required for this case, so for each term and concept, we'll have a URI for the that "thing", a "Document URI" for a "generic document" (current) description of that thing, and "Representation URIs" for each "specific document" in a particular format.

The formats provided will include a human-readable HTML version, an RDF/XML version and possibly other RDF formats. Over time, additional formats can be added as required through the addition of new "Representation URIs".

My primary focus here is the changes to the thesaurus content. Over time, various changes are possible. New terms may be added, and the relationships between terms may change. Terms are not deleted from the theasurus, however.

The most common type of change is the "promotion" of an existing non-preferred term to the status of a preferred term, but all of the following types of change can occur, even if some are infrequent:

Addition of new semantic relationships between existing preferred terms

An existing non-preferred term becomes a non-preferred term for a different existing preferred term

An existing non-preferred term becomes a non-preferred term for a newly-added preferred term

An existing preferred term becomes a non-preferred term for another existing preferred term

An existing preferred term become a non-preferred term for a term which is currently a non-preferred term for it (and vice versa)

An existing preferred term becomes a non-preferred term for a newly added preferred term

Below, I'll try to walk through an example of each of those changes in turn, starting from the example thesaurus above, showing the results using the mapping suggested above, and examining any issues which arise.

Case 1: Addition of new semantic relationship

The addition of new broader term (BT), narrower term (NT) or related term (RT) relationships is straightforward, as it involves only the creation of additional assertions of relationships between concepts, using the skos:broader, skos:narrower or skos:related properties, not the creation of new resources.

So if the example above is extended to add a BT relation between the "Collective violence" (term no 6) and "Violence" (term no 4) terms (and the inverse NT relation):

The addition of the triples means that, from a linked data perspective, the graphs served as descriptions of the resources con:C6 and con:C4 change. They each include one additional triple for the concise bounded description case; two triples for the symmetric bounded description case (see the previous post for the discussion of different forms of bounded description). So the contents of the representations of documents http://example.org/doc/concept/polthes/C4 and http://example.org/doc/concept/polthes/C6 change - but no new resources are generated, and no new URIs required.

Case 2: Removal of existing semantic relationship

The removal of existing broader term (BT), narrower term (NT) or related term (RT) relationships is similarly straightforward, as it involves only the deletion of assertions of relationships between concepts, using the skos:broader, skos:narrower or skos:related properties, without the removal of existing resources.

I won't bother writing out an example in full for this case, but imagine the case of the previous example reverting to its initial state.

Again, from a linked data perspective, the graphs served as descriptions of the resources con:C6 and con:C4 change, with each containing one triple less for the CBD case or two triples less for the SCBD case, but we still have the same set of term URIs and concept URIs.

Case 3: Addition of new preferred terms

The addition of a new preferred term is again a matter of extending the graph with new information, though in this case some new URIs are also introduced.

Suppose a new preferred term "Revolution" (term no 7) is added to our initial example:

The RDF representation now includes an additional concept and label, each with a new URI. So now there are two new resources, with new URIs (con:C7 and term:T7), and a corresponding set of new Document URIs and Representation URIs for descriptions of those resources.

It is quite probable that the addition of a new preferred term is accompanied by the assertion of semantic relationships with other existing preferred terms. This is the equivalent of following this step, then a second step of the type shown in case 1.

So from a linked data perspective, there is a new resource with a new URI (term:T8) (and its own new description with a new Document URI), and the existing URI con:C4 is the subject of two new triples, an skos:altLabel for the literal, and an skosxl:altLabel link to the new label, so the graph served as description of that existing resource changes to include additional triples.

Case 5: Existing non-preferred term becomes new preferred term

Suppose the existing term "Civil violence", initially a non-preferred term for "Political violence" is "promoted" and made a preferred term in its own right

So from a linked data perspective, there is a new resource with a new URI (concept:C1) (and its own new description with a new Document URI), and the graph served as description of the existing resources con:C2 and con:C4 both change: the former loses the skos:altLabel and skosxl:altLabel triples and the latter includes a new skos:narrower triple. If symmetric bounded descriptions are used, the description of term:T1 changes too.

Case 6: Existing non-preferred term becomes non-preferred term for a different existing preferred term

Suppose we decide that "Civil violence", initially a non-preferred term for "Political violence", is to become a non-preferred term for "Collective violence".

The graphs served as descriptions of the existing resources con:C2 and con:C6 both change: the former loses the skos:altLabel and skosxl:altLabel triples and the latter gains skos:altLabel and skosxl:altLabel triples. If symmetric bounded descriptions are used then the description of term:T1 also changes.

Case 7: Existing non-preferred term becomes non-preferred term for a newly added preferred term

I think this case is just a combination of Case 3 (addition of new preferred term) and Case 6 (existing non-preferred term becomes non-preferred term for a different existing preferred term) in sequence. We've seen above that these changes can be made without problems, so the "composite" case should be OK too, and I won't bother working through a full example here.

Case 8: An existing preferred term becomes a non-preferred term for another existing preferred term

Suppose the current preferred term "Political violence" is to be "relegated" to become a non-preferred term for "Collective violence", with the latter becoming the participant in hierarchical relations previously involving the former. (I appreciate that these two terms probably don't constitute a great example, but let’s suppose it works, for the sake of the discussion!)

So the graphs served as descriptions of the concepts con:C3 and con:C4 change (with the removal of a triple and the addition of a new one); and that for concept con:C6 changes with the addition of several triples.

So far, so good.

However, the URI con:C2 has now completely disappeared from the graph. If this new graph simply replaces the previous graph, then there will be no description available for resource con:C2.

Case 9: An existing preferred term become a non-preferred term for a term which is currently a non-preferred term for it (and vice versa)

Suppose that the current non-preferred term "Civil violence" is to become preferred to "Political violence", and the latter is to become a non-preferred term for the former - both "relegation" and "promotion" taking place together, if you like.

The graphs served as descriptions of the concepts con:C3 and con:C4 change (with the removal of a triple and the addition of a new one). A new concept con:C1 is created. But again the URI con:C2 has completely disappeared from the graph, with the same consequences that no description will be available.

Case 10: An existing preferred term becomes a non-preferred term for a newly added preferred term

I think this case is just a combination of Case 3 (addition of new preferred term) and Case 8 (existing preferred term becomes a non-preferred term for another existing preferred term) in sequence.

The same problem will arise with the URI of the existing concept disappearing from the new output graph.

Summary

I've walked through in detail the different types of changes which can occur to the content of the thesaurus. This highlighted that for one particular category of change, where an existing preferred term is "relegated" to the status of a non-preferred term, exemplified by my cases 8, 9 and 10 above, the results of the suggested simple mapping into SKOS had problematic consequences: the URI for a concept disappears from the generated RDF graph - and this creates a conflict with the principles of URI stability and reliability I advocated at the start of this post.

In the next post, I'll suggest one way of (I hope!) addressing this problem.