Terms from controlled vocabularies

Codes or values from "controlled vocabularies" frequently appear as the values of GML properties.
In many cases the enumerations are defined by an external authority.
GML CodeType implements ISO 19103 GenericName which (optionally) allows a "codeSpace" to be linked with a value, in order to scope it.
For example

a particular license number issued by the transport department of my government in Australia

a biological species defined by the museum shown.

The key feature of the pattern is that the authority for the term or code is identified alongside the term used.

This is particularly important when taking advantage of gml:name, which is a standard property on GML Objects and Features, and where the cardinality is set to maxOccurs="unbounded".
In many contexts an Object or Feature instance has multiple names or identifiers assigned by independent agencies.
For example, a Borehole or Well may have names assigned by the driller, the regulator, the logging company, the operator, the lease-holder, etc etc.
The names may be disambiguated through each name carrying a different codeSpace value.

In some cases it is sufficient to merely identify the authority for the term or code used, with no assumption about the form of the definition or the mode in which it is made available.
However, in general it is necessary for both codes and the collections from which they are drawn to be formally described in a mode which allows them to be network accessible.

Given a term, within the set identified by a codeSpace value, it is necessary to resolve the term to retrieve the value.

In the case of ApplicationProfiles, and notably MetadataProfiles, a profile will typically bind value domains to feature attributes. This binding may be to a subset of the terms defined within a dictionary. In order to achieve this, a resolving mechanism (see TermResolutionMechanisms) for terms must be identified and adopted, and this should be powerful enough to meet the complete set of common use cases.

The rest of this page discusses specific methods available in GML/XMML. These are adequate for limited dictionaries (where it is reasonable to incur the overhead of loading the entire dictionary across the network when encountered) but not for a general solution.

Unique identifiers for terms and concepts

An alternative syntax is for the term itself to be identified through a URI, for example

This fits nicely with the proposed Dictionary model described below, in which a set of definitions are encapsulated as elements in a collection, each one with a separate handle though its gml:id attribute.
However, use of a URL does not restrict the definition to one from a GML Dictionary - any web service might be configured to deliver a definition, and it might arrive in any format (e.g. OWL/RDF).
Since the term is "loosely coupled" into the data instance, typically via an xlink:href, it is the responsibility of the consuming application to make sense of it.

There is also a case for the use of URN's to identify terms, e.g.

urn:x-ogc:def:nil:OGC:inapplicable

urn:x-ogc:def:uom:UCUM:[n_mi]

urn:x-seegrid:def:xmml:Analyte:Al2O3

The following URN stems have been used in various XMML example instances and indicate the scope of a possible series of vocabularies for use with XMML:

urn:x-seegrid:definition:xmml:Analyte

GML representation of vocabularies

There are two standard methods for recording controlled vocabularies in GML/XMML.
These are examplified by examples in the CVS as follows:

the instance documents in the "dictionaries" directory - XmmlSVN:swe/trunk/ExampleInstances/dictionaries

GML Dictionaries

The following advice is included in GML 3.1 regarding the use of GML dictionary/definition components:

It will often be convenient to use definitions provided by external authorities.
These may already be packaged for delivery in various ways, both online and offline.
In order that they may be refered to from GML documents it is generally necessary that a URI be available for each definition.
Where this is the case then it is usually preferable to refer to these directly.

Alternatively, it may be convenient or necessary to capture definitions in XML, either embedded within an instance document containing Features or as a separate document.
The definitions may be transcriptions from an external source, or may be new definitions for a local purpose.
In order to support this case, some simple components are provided in GML in the form of

a generic Definition, which may serve as the basis for more specialized definitions

a generic Dictionary, also known as DefinitionCollection, which allows a set of definitions or references to definitions to be collected

These components may be used directly, but also serve as the basis for more specialised definition elements in GML, in particular: coordinate operations (clause 12), coordinate reference systems (clause 12), datums (clause 12), temporal reference systems (clause 14), units of measure (clause 16).

Note that the GML definition and dictionary components implement a simple nested hierarchy of definitions with identifiers.
The latter provide handles which may be used in the description of more complex relationships between terms.
However, the GML dictionary components are not intended to provide direct support for complex taxonomies, ontologies or thesauri.
Specialised XML tools are available to satisfy the more sophisticated requirements.

When using a dictionary a set of definitions is collected in an XML instance document.
Each individual definition carries a gml:id attribute, which acts as the handle for the definition in the context of the dictionary.
The standard fragment-identifier/abbreviated-XPointer method (e.g. http://my.big.org/dictionary#itemA) may be used to turn this into a URL, allowing reference to this definition in its dictionary context .
This addressing method is also broadly scalable to online registries where a URL may be used to identifie a definition in the context of a particular register.

Dictionaries (or registries) are required when

reference to a definition requires that a structured description is available at run time (e.g. CRS's, units of measure)

the list is likely to be changed relatively frequently by its maintenance agency.

Note, however, that values stored in an external "instance document" or resource do not allow schema validation of them in the referring instance document.

Note: GML Dictionaries cover some of the same territory as "Registered Items" for which a model is given in ISO 19135.

Note: A GML Dictionary could be seen as a static representation of the response to a register request - i.e. "back-pocket-register".

Enumerated code-lists

Enumerations may be recorded in XML Schemas.
The LUT examples contain values derived by restriction of gml:CodeType.
Using this method the value of the codeSpace attribute may be "fixed" to the source of the code-list.

(Perhaps explore whether the schema could be autogenerated by harvesting/screen-scraping from definitions provided by the source organisation?)

Schema enumeration allows schema validation, and usually provides greater support for building interfaces.
However, changes to vocabulary involve changes to the schema, so this has implications for versioning of the language itself, rather than just the codelist.
Usually it is undesirable for language versioning to be frequent, so using schema enumerations implies rather stable vocabularies.

Note that for most of the codelists in XMML's LUT* schema documents, limited extensibility has been built in through the availability of the pattern "other:anotherCode".

Discussion

Some General Concepts

I have written a paper on the handling of Dictionaries (reference data) which attempts to organize some of the comments above. This is because different business needs call for different solutions.

In way of comment on the above, I find key/keyref to be very useful in XSL. Their use in schema is a way to generalize the ID concept of DTD's in terms of uniqueness and existence within a context. Within XSL, they allow you to use the key( ) function to jump to referenced element.

Finally, I would note that UN/CEFACT is close to recommending a way that they will use lists. It corresponds roughly to the Section 3 of the [Reference Data] document in that they say to use a value, and code list attribute. For example, AU, where AU comes from the ISO3166 2 character list. Thus, you would not restrict the values in the schema itself.

Relationships between definitions

An example of a hierarchical dictionary is the BGS Rock Classification Scheme. I have attached a small subset encoded how I think it should be as a GML dictionary. The nesting of definitions reflects the hierarchy. This seems a reasonable fit. There are a small handful of entries in the scheme with more than one parent (not in the attached example). This could be handled in a GML dictionary instance by using href attributes to link from more than one parent (or the whole dictionary could be presented as a flat single level list with href attributes used for all child relationships). This scheme really has computer codes and English names for each rock type but I've put the code both as the gml:id attribute and as a gml:name element with the English name as the gml:description element. Not sure if this is the intended way it should be used.

Something like the AMF Thesaurus has a multiple hierarchy with terms having possibly several parents as well as several children. Also in the AMF Thesaurus there are different kinds of relationship ("Broader terms", "Narrower Terms", "Related Terms"). Maybe this could be implemented using simple xlinks with the role attribute but I don't think this is the right way to do it as the xlinks are just meant to be an alternative way of implementing the instance documents?

In general I'm not sure that the semantic meaning of nesting a definition inside another one is entirely obvious, presumably it implies some kind of narrower term but maybe we need to have named relationships for things like the AMF Thesaurus?

I think I would suggest putting the BGS codeSpace on the dictionary, rather than on each entry. The semantics of gml:name within a definition need to be clarified, but I would suggest that one name without an explicit codespace should be found, which is the term being defined in this context. Then other names, which should carry a codeSpace, should be explicitly understood to be synonyms. Perhaps the one name should have a differnet name and use a substitution group somehow. More experiments needed.

I can't put the codeSpace on the dictionary as Dictionary does not have this attribute. I can put it on the name property of the dictionary as below. This confuses the meaning, however, as usually the codeSpace attribute identifies the set of possible values contained in the name element.

Your comment about the semantics of nesting is important. This dictionary syntax was created fresh for GML, and did not follow a thesaurus model. The basic structure is hierarchical. So some other mechanism will be needed to capture any other relationships, if needed.

Some nested heirarchies may involve different attributes at different levels (following a make>model pattern). For example a rock classified under the heirarchy, Intrusive Rock>Ultramafic>Pyroxenite comprises the attributes, Mechanisim>Relative Content of Mafic Minerals>Mafic Mineral Composition. Does this point to a requirement to enable nesting accross different code spaces? To classify all attributes in this example as Rock Descriptor may loose richness or require duplication of the codes and values elsewhere. (Richard Batson)

From the GML 3.1 spec above:
However, the GML dictionary components are not intended to provide direct support for complex taxonomies, ontologies or thesauri. Specialised XML tools are available to satisfy the more sophisticated requirements.

I have found the following listing of thesauri formats http://www.w3c.rl.ac.uk/SWAD/thes_links.htm. Has anyone any experience of utilising any of these (or can suggest others) for their thesauri (or dictionaries with complex hierachies)?

Time varying definitions

Time-variation of code-lists might be managed by time-stamping the value of the "codeSpace" attribute.

Gazetteers

These are a special form of code-list which associates a name with a geospatial area. The most common use-case requires an "authoritative" definition, which resolves a name to a unique spatial extent. There are OGC projects in this area (RobAtkinson).

There are other use-cases, such as the one assumed by the ADL gazetteer project, which allows a single "conceptual" place to have many names and many extents. This is more about knowledge-capture, and would often occur in the definition of geological entities, provinces, etc. This is well managed by the RDF model.