Note on Minting, Defining, and Using URIs (Sputnik draft)

Abstract

[Abstract, to be written based on what ends up going into this document.]

How to comment on this draft

There is a shorter version: [[/../ShorterSputnikDraft]] - you might want to look at that first.

For now, please put your comments on the /DraftTalk page. I am
editing this file off line, and keeping comments on a separate page
helps ensure that your comments are heard and tracked. I will attempt
to address all concerns and record dissenting views fairly.

There is obviously still a lot of work to be done. Right now
I'm most interested in hearing about problems with the organization of
the document, sections that should be cut, sections that should be
expanded (except for the obvious ones), and, especially, claims I make
that you disagree with.

[Brackets] usually indicate work to be done, information to be
integrated, and questions to be answered. Please answer questions if
you have answers. I will process or, if necessary, remove all
remaining bracketed sections before final publication.

Status of this document

This is an editor's draft with no official standing.

I am hoping to ask HCLS at its October 25 teleconference for approval to publish this
note (in whatever state it's in at the time) on the W3C web site. The
note will still be just an editor's draft at that time.

It is my intent to publish preprints of this note on my employer's web
site under a Creative Commons Attribution 3.0 license in advance of
publication on W3C's web site. W3C's version will be published under
the more restrictive W3C license. The non-W3C version will be clearly
marked as being a preprint. I promise to do my utmost not to
misrepresent anyone or anything.

Introduction

This note is about problems surrounding choice and use of
URI, including choice of new URIs ("minting") and publication of
definitions - part of the larger problem of how to use RDF well. URI
choice is a problem for a variety of reasons, including definition
quality, stability of meaning, and accessibility of defining
documents.

The problems aroung URIs have been so vexing, and the argument so
heated, that it is worthwhile to step back and consider what we're
trying to accomplish. This is what the next section attempts to do.
With this background, it becomes possible to formulate strategies
rationally, by reference to the goal.

The first section needn't be read if you just want to know what we
recommend.

What we're trying to do

Technical disciplines such as life sciences research and health care
are awash in information, most of which is locked up in a combination
of written reports and formal objects such as tables, databases, and
structured files written in a variety of notations. The transition
from natural language and diverse formal notations to a uniform
declarative language, rendering large amounts of this information in a
common formal notation, will enable these multiple sources to be combined
and processed together, enabling a broad range of computational uses,
including

summarization and display - reviewing large amounts of selected information

query - obtaining precise answers to questions

validation - determining whether a body of information makes sense or is consistent

discovery via statistical methods - locating information that is unusual by some measure

To this end, RDF [cite] has been proposed as a common
representation language.
RDF is a simple formal declarative language intended to
augment and substitute, in certain applications,
for other information-carrying languages, including natural
language and other formal notations. Its key features are its
blandness, which helps to ensure its generality and neutrality, and
its ability to gracefully combine information coming
from multiple sources.

Technical disciplines place particular demands on the use of RDF and
related [cite] technologies. The activities of
practitioners range from the highly exploratory, which place a premium
on spontaneity, flexibility, and inclusiveness, to the highly
rigorous, with premiums on chains of inference, repeatability, and
durable documentation. Inaccuracies in transcribing information
into RDF and mismatches in combined sources may have serious
consequences when information is being used -- for example,
consider the cases where RDF-encoded information is used as part of a
grant application or in deciding on the correct treatment for a
medical condition.

RDF-encoded information is organized into graphs. Each graph is a
set of statements employing a vocabulary of terms relevant to the
graph's subject matter. [Footnote explaining why I say "term" and not
"URI": "term" is evocative of requirements; much of what needs to be said sounds ridiculous when you say "URI"; URIs do not "identify" according to Pat Hayes; "resource" is undefined and its ordinary meaning is too restrictive.] Statements are supposed to declare something about the world
-- that is, they are supposed to have the capacity to be true or
false. [Footnote: definitions of terms are true "by definition" - as in mathematics, a definition can be meaningless or inconsistent, but it cannot be false.] The meaning of a graph -- what it says and what it
implies, both logically and socially -- ultimately depends, in large
part, on how an agent will understand the terms occurring in the
graph's statements. It is therefore important to have, at the very
least, an understanding of how terms are coined and used in RDF used
in our domain.

We have no direct control over how an unknown agent will interpret a
graph.
W3C recommendations and other guidelines will be used to determine the interpretation of a graph, but other social and political processes will also influence interpretation, just as just as for natural language.
The best a document
such as the current one can do is to inspire practice that will
increase efficiencies and reduce confusion among those who take its
advice.

The most essential characteristic of scientific endeavor is
skepticism. This does not mean that we restrict ourselves to uttering
only well-established truths. Instead, skepticism is reflected in
careful attention to the logical and bibliographic support for
assertions. The chain of support helps an agent processing some
information to form its own judgments of the validity and usefulness
of the information for the application at hand. Inference and
citation are therefore at the heart of any use of natural or formal
language in science.

Scope of a graph

One axis along which to classify communications is according to scope
- basically, the standards and expectations surrounding a message's
applicability, and in particular the expected separation in time
between the writer and the reader. In RDF this question has bearing
on choice of terms and the manner in which definitions and other
graphs are published.

For example, the following situations imply different
scope:

Message passing - assertions valid only in the context of a conversation - writing and reading separated by minutes

Time-sensitive communication - assertions that might not be true tomorrow - separation by hours to months

Knowledge curation - true to the best of one's knowledge, today - separation by months to years

Archiving - presented with the hope that it still will make sense a long time from now - separation by years to decades

Scope determines, in part, how much effort one needs to invest in
careful preparation and use of RDF graphs.
A context requiring only short-lived RDF graphs may be more forgiving
than an archival context.
For example, a short-lived graph may refer unambiguously to states of affairs
that change infrequently ("the president"), while a long-lived RDF
graph needs to avoid such context-sensitive reference.

Threats to the successful use of RDF

[or of any kind of scientific information, really]

Given the goals of efficient communication and integration using a
commons language as stated above, we can set out potential
problems that need to be addressed. The advice given here is aimed
at mitigating these particular threats.

Threats to common interpretation

[This is just a summary of the advice that follows. I think this should be condensed or flushed in favor of better organization of the advice section.]

The intended use of a term can be established in several ways:

by explicit published definition ([cite Booth on declarations])

written in natural language

written in RDF

implicitly according to statements made in what is supposed to be an authoritative document

according to how it is used (reverse engineering or inference required)

For brevity I'll say "defining document" to mean any of the above.
(Not quite the same as a Boothian "declaration", but close.)

When a common understanding of a term fails to be established, the reasons for
failure include

No defining document

Defining document difficult to locate

Poor quality definition - vague, ambiguous, or unclear

Definition/use inconsistency - the term is used differently from how it's defined

There are multiple credible defining documents

change in concept over time (example from natural language: 'transient ischemia attack')

accidental collisions

disagreement over formulation of definition

Threats to integration (graph combination)

Two terms for same thing (missed opportunities for unification)

Failure to recognize or state relationships (e.g. missing subclass/subproperty assertion)

Incompatible logical systems (inconsistent entailments). This is an important subject but not in the scope of this note

Advice

[Alan says: "Some of the advise seems different from the
other... Ensure unique definition, find the best definition, are
something like what to do when something isn't perfect and rest are
how to make things perfect" -- maybe this idea should be turned into
an introductory paragraph]

Publish defining descriptions

When you mint a new term, make sure you write a defining description
(DD) and publish it somehow. [I don't like the acronym DD, but got
tired of writing the two words out so many times. Please advise.]
Publish the DD in a venue that will last at least as long as all uses
of the term. Ideally, publish it archivally, e.g. in a journal
article or persistent web archive.

A special case is that of network resources [cite RFC 2616]. By
convention advocated in AWWW and elsewhere, a network resource [JAR's
coinage] is automatically named by its URI (see [[/../StatusOfHttpScheme]]).
If a term is an http: URI, and the following hold:

the responsible web server yields a 2xx response when responding to any GET request using the URI

the server handling the request observes to the httpRange-14 recommendation

then you may take the term to be defined to mean the network resource
defined by the server's responses to requests using the URI.

Make the defining description easy to locate

Making a DD available is of primary importance. Manual
research of the old-fashioned kind is effective in scholarly work
using natural language, and will be the most general and robust way to
track down the author's or community's intended use of a term. But
making a DD easy to find, and in particular making it easy for
machines to find, is also important (not that an automated agent will
necessarily know what to do with the DD, but hey). This can
be done either by the DD's original publisher, by choosing URIs for
terms that enable sufficiently durable "follow your nose" access, or
by subsequent users of terms, who can facilitate interpretation either
by repetition or by citation.

The conveyor of some RDF might:

Include the DD in a graph that uses the term

Or, cite a document containing a DD

(!!! We need to agree on a way to do citation. compare owl:imports.)

The publisher of a term's DD might:

Publish the DD in a private or public location [check MeaningOfaTerm] that readers are likely to know about

Mint terms that are locators (understood by a wide variety of browser-like things) and publish DD's at related locations (see below).

Make sure the chosen URI is sufficiently durable as a locator, or at least has a good shot at being durable. Since it is often difficult to predict the lifetime of a term, that means being conservative (talk about purls/handles here?).

In the http: scheme, two methods have been advocated for publishing DD's.

Use a 303 redirect to send the agent to a DD for the term

Mint # URI's and place a DD at the #-racine (note that # relinquishes server control) (avoid sharing the same #-racine among multiple terms as this breeds confusion over when one definition stops and the next begins, overcommitting and risking versioning headaches)

(Must acknowledge the LSID criticisms of http: URIs here. Most agents
don't know better and will foolishly follow the URI, and stop there.
The argument is that it's better to use a non-http URI, so that you
will get no answer (go to no location) rather than what might be the
wrong answer (or the right answer from the wrong location).)

(Find a DD is not the same as finding all information relating to what
the term names. Address Mark W's LSID use case here.)

Compose clear, unambiguous definitions

No magic here; good definitions are difficult to craft.

Definitions should specify single and particular usage. For example,
a term should be used for a document, or a thing described by the
document, but never both (except in the unlikely event that the
document is intentionally self-describing). This applies even if the
document is a database record: Some statements are true of the record
but not the thing, and vice versa. If both have names, they need to have different names.

[Of course you may be able to avoid making a name for one or the other,
using blank node notation specifying the relation between the two.]

Although any DD is better than no DD, it is better if a DD is
expressed in RDF. This is obviously not a guarantee of quality, nor a
guarantee that any particular automated agent can do anything useful
with it. Certain formal aspects of a definition, such as a subclass
relationship, can be expressed well in RDF. It is recommended that
DD's contain either an rdfs:comment, OWL constraints adequate to
uniquely determine the referent, or some well-justified alternative.

Don't issue a 2xx response unless you intend for the term to denote
the network defined by responses to requests that use the
term. [httpRange-14]

[The "Banff Manifesto" [cite] insists that a certain set of properties be
specified. Alan suggests that at the very least there should be an
rdf:type.]

Definitions (which should be small and extremely stable) should be
separated from other RDF related to the term (such as statements that
describe the denoted resources). The non-definitional RDF is likely
to be less stable than the definition, and those attempting to
understand the commitment assocaited with the term may be confused as
to what information constitutes part of the definition and what parts
don't. [But how to delimit, exactly? Consult D Booth's memo. Put
them in separate documents, yes? What a pain.]

Use a term in a manner consistent with definition

There is no magic solution. You must be willing to do some research
to make sure the way you're using a term is consistent with how the
community is using it. If a definition is unclear, figure out how
the term is used in practice and attempt agreement with that.

Ensure unique definition

The existence of multiple credible DD's forces an unpleasant choice on
those who would use a term. Choosing the most recent or most
"authoritative" DD may lead to misunderstanding of a graph since
the graph may have been composed using an earlier or different DD.

Multiple definitions can arise in various ways:

Accidental collisions

Make sure, when you mint a new term, that it's not already in use. Collisions are addressed by the "URI owner" or "naming authority" mechanism of web architecture [cite]. Not covered by this convention is the problem of URI reassignment as a result of domain name loss and capture, but as this possibility seems rather unlikely at the present time, there is little call right now for solving this problem.

Change in concept over time (example from natural language: 'transient ischemia attack')

Be clear on expectations. Some of what you say involving the term may be intended to be defining (true indefinitely) while other statements are observations related to the term's referent and subject to correction or other kinds of revision. E.g. a definition of a term for a particular mountain should not include a statement of the mountain's height, which would be subject to change or correction.

Don't change a definition - mint a new term instead.

Disagreement over "correct" interpretation leading to multiple "clarifying" DD's

Another remote problem for now, but likely to arise as it has in the technical literature. This can only be solved through community process, just as disagreement over a popular term in natural language would be. Statements by the "URI owner" should be given special weight but if the owner is not available for consultation or is as fallible as the rest of us then they should not necessarily be considered an "authority" on the term's meaning.

Disagreement over methods of definition accessibility

If for some reason a term must get a new DD for some reason, at least
publish a new DD under a distinct document name (URL, etc.), so that
it can the correct DD can be cited unambiguously. Relate versions of
DDs to one another using a suitable ontology [which one?] [where to
put such assertions - effectively a citation of a previous document
version by a new version? in the later DD?].

Find the best definition

It is important for a consumer of RDF to obtain the best definition of a term.
Ideally, there is only one definition, but one must defend against the
instability and disagreement.

As the goal is communication, "best" means the definition closest to
the intent of the author(s) of the RDF you're trying to use. In the
absence of inline definition or citation, this may be difficult to
track down, so a heuristic search may be required.

In the event one chooses to consult the web (don't skip the steps of
seeking definitions closer at hand or more definitely cited), do so
carefully [see ../MeaningOfaTerm]. The following heuristics may
be helpful [per TAG/Cool URIs]:

Some servers will observe the httpRange-14 recommendation. If so, then a 2xx response implies that the term refers to a network resource.

Some servers will provide defining RDF in a document reached by following a 303 redirect.

If the spelling of the term contains a #, the servers may provide defining RDF via a network resource named by the URI that is the "racine" of the term (the truncation of everything starting with the #).

It would certainly be nice to know which servers obey which rules,
yes? Certinly TAG/AWWW recommendations are not part of the HTTP
protocol, so there is no obligation to follow them. On the other
hand, if a term occurs in RDF, there is a better than even chance that the named
server is aware of this architecture and is using these publication conventions.

But so far the only methods for determining conformance are informal [future work].

It is important to be skeptical of definitions, as they can fail in a
variety of ways. For example, an author may use terms in
contradiction to definitions supplied in the same graph, a web server
may provide a definition that differs from the one consulted by the
author of a graph you're trying to understand, a definition found in a
standard location may be unclear while community practice around the
term's use is not, etc.

Make an effort to re-use terms

It is undesirable to have two distinct terms in use for the same
thing.

There is no magical solution. Please try to be aware of what others in your field are doing terminologically and replace terms if necessary in order to build community consensus.

But be careful: The value of term reuse is so high that one may be tempted to use an existing term when not completely appropriate. This leads to overloading and confusion.

There may sometimes be an awful tradeoff between stability and popularity: a popular term may be unstable as a locator, while the application at hand may require durability. In this case make sure that there are other ways to locate a DD, as described above (sort-of example: BMC's new practice of caching web pages).

Documents and database records whose publishers have not provided a URI, or who have provided a URI that is unstable or difficult to use, present an important special case.

[Talk about ontology version: when to re-use and when to mint?]

Existing well-documented terms may be rejected because either (a) the corresponding definitions are difficult to locate by certain applications, or (b) because they are not "browser friendly". This presents an as yet unsolved quandary for the community; see below. My advice is to assess suitability of a term based on criteria other than the infrastructure's ability to deal with it. I will give dissenting views in future versions of this document.

Similarly, existing well-documented terms may be rejected because the access infrastructure that they imply in other contexts (such as web browsing) appears to collide with the goal of establishing the terms' credentials as "identifiers". In particular, you can't tell whether an http: URI means something other than a network resource without locating a definite statement to the contrary (e.g. via a 303 redirect). It is probably impossible to reverse this practice, especially given that it is used at the heart of RDF practice (e.g. rdf:type) has an influential constituency (the TAG), so it is not clear what is gained by avoiding these terms and replacing them with redundant terms in a different region of URI space.

Seek out and state relationships

Failure to recognize or state relationships (e.g. missing
subclass/subproperty assertion) can lead to incomplete answers to
questions. For example, a graph containing mother assertions combined
with a graph containing parent assertions is less connected than it
should be if the mother-parent subproperty relationship is unstated.

So, if after publishing a graph you discover a related ontology, make
an attempt to establish relationships between your terms and theirs,
and publish the relationships.

But be careful: The value of such relationships is so high that one
may be too eager to relate. The correct relationship should be sought
and if necessary defined; there is no need to latch on to "loose fits"
that are less than accurate, and correct relationships can themselves
often be related to the obvious ones via subproperty assertions or
using OWL. The effort will pay off in query accuracy.

Use of owl:sameAs, even when legitimate, should be avoided except as a
way to bridge independently created graphs neither of which can be
modified. Simply using the best term of the two (the one judged most
likely to rally consensus; usually the one published first) is
preferable since it allows linking through the term even when
inferences using owl:sameAs assertions are not made (e.g. when
inference is limited to RDF entailment [cite]).

Future work

Here is what we as a community need to do in order to make the above
advice easier to follow.

(Alan: Points in this section have to link back to the previous
arguments. The motivation needs to be clear.)

Citation is central to scientific discourse. We need to develop a theory - an ontology - of document reference that is both principled (ontologically web-independent) and as harmonious as possible with current practice.

Terms should have the potential to outlive any domain mentioned in their spelling. We need to figure out how. BMC has taken one small step in this direction [cite]. URI resolution ontology is another approach.

It is recommended that systems be established, similar to journals, for quality control of RDF graphs. If a graph meets requirements for documentation, consistency, citation, and coordination with other graphs, it should be recognized as such.

It is recommended that the community figure out what terms should be used for public database records (such as those in Entrez Gene), and come up with a versioning story for them. The terms should be impeccably hosted - that is, made available durably and consistently - by an organization that the community can trust.

Figure out whether to use published terms that are not browser-friendly or HTTP-friendly (e.g. belong to non-"locator" URI schemes). Personally I sympathisize with both sides of the debate, but I don't see this issue as being important enough to warrant the creation of a second set of terms when terms that work perfectly well already have published definitions. (e.g.: the new info:inchi/ URIs - although one might want to avoid these just because they're not very well specified.)

An independent document repository would be nice, as a way to adopt abandoned projects' documents and otherwise do durably-named and durably-served publication (a la Genbank), especially of defining documents.

Change log

2007-10-08 Added remarks attempting to explain how the locator/identifier debate fits into this framework.

Rough notes

Only the truly committed should read beyond this point.

TBD:

Most advice is to application developers; say so

Talk about genbank and/or NIH permanence policy

Distinguish between *potentially* durable names (e.g. purls) and credibly durable service (e.g. libraries)? Persistent of a *name* is not the same as persistence of the *information*.

Controversies =

2xx-responders (network resources): do
their URIs denote actual server behavior (as TimBL suggests), or ideal
server behavior? If the former, the LSID proponents and librarians
will not want to use them to refer to document-like things (e.g. for
citation). If the latter, new mechanisms will need to be developed to
direct agents to definitions (specifications, promises) so that
readers and writers of RDF know exactly what's meant by the URI.

The LSID/HTTP battle: It is a goal of mine to get LSID and http: users
to collaborate and start worrying about issues more important than
what naming scheme to use. Both naming schemes are in use, so all
diligent clients will need to be able to deal with both kinds of
terms. Minting new URIs for things that have perfectly good ones
already is a bad idea (although "perfectly good" is a high bar). The
question is whether, for new terms, there is any reason to prefer one
scheme over the other. The LSID spec has plausible versioning and
metadata stories; but these can be replicated in http-land (using 303s
or #-racines) once we have a versioning ontology. LSID has a
plausible location independence story; I have tried to show that the
HTTP space does too -- no sensible agent should limit themselves to
what's they find at the http: URI. I know this is a tough pill to
swallow but as I say it's already been swallowed, e.g. any time you
use a non-LSID ontology such as RDFS. What http: has to offer is the
ability to get at definitions using ubiquitous software (http GET) via
the follow-your-nose heuristics. Yes, this is unreliable, but so is
consulting the DNS record for the LSID authority -- and so far there
is no central registry of LSID authorities that might deal with loss
of the authority's LSID DNS record, while at least purl.org provides
for forwarding.

SW FAQ

Following is a list of questions about SW, some of which we might try
to answer in this document or in a companion document.

Questions that HCLS members might have:

What is the semantic web? (collective RDF, or an access apparatus?)

What can I use it for?

How do I use the semantic web?

How do I figure out what a term means?

How do I browse the semantic web (esp. for HCLS content)?

How do I search the semantic web?

What is a resource? document? information resource? representation? (TimBL has ontologies)

What naming scheme should I use - http:, LSID, handle, other urn:, or info: ?

If HTTP - Should I use # or / URIs in ontologies?

What host should I name in my URIs (LSID or otherwise) - that of my own project, or one that can provide persistence?

How do I find a persistent definition provider?

How do I use purl.org?

Where should my stuff be hosted?

Why should I trust it with my stuff?

What are my civic responsibilities as author/publisher?

[Provide definitions somewhere - anywhere.

Definition should include at least an rdf:type, and either an rdfs:comment or a rigorous formal definition.

Provide definitions at the location implied by the URI, using the protocol implied by the URI.

Don't confuse definitions with (other) discourse.

For HTTP: Follow httpRange-14, if you can figure out what it means.

If you inherit naming authority to a namespace (e.g. domain name), don't reuse previously circulated URIs for new purposes. (but how do I know if there were any, and what these were?)]

This is hard. Where do I get help?

Definition quality hierarchy

(The following is an idea I'm trying out - an analysis inspired by
Latour's book Laboratory Life.)
Quality of definition/description (of term) hierarchy.

Consensus definition. (everyone knows, goes without saying)

High-quality citation. (positively identified document)

Adequate information in same graph.

Ad hoc citation (web reference). (unreproducible)

Reverse engineered.

via protocol specified by spelling (303)

spelling = hint

Unconstrained / undefined...

Etc

... central authority for technical terms not possible, but particular
namespaces will have authorities (e.g. CAS numbers, genbank ids,
pubmed ids...). Different kinds of authority: authority over the
namespace (what a term means) vs. authority w.r.t. a subject matter
(to coin a term of a certain kind, coordinate with the authority)...

... automated agents will vary in their expertise at locating
DD's... some will only be as smart as a web browser, while others will
consult a wide variety of sources and speak a wide variety of
protocols (LSID, info, wayback machine & other 3rd-party archives,
etc)... we encourage the latter of course.

... relate "decision to use or not use RDF" to "trust" layer in semweb
layer cake (Sandro)

... would the document be improved with presentation of detailed use
cases, as opposed to examples in the running text? e.g. web browser,
web application, SPARQL (RDF cache), computation. I think not, but
the text needs many more examples than it has.

... talk about publisher conflict of interest ... no interest in
stability or consistency... example: out of business, mergers, ISBN
recycling.

(from outline.html)

Establishing connections with other works requires vigilance in the selection of terms. Before a new term is coined, a literature search should be conducted to find candidate terms that can satisfy the need.

Because of the possibility of creating fresh terms at almost no cost, RDF has the potential to eliminate confusions that ordinarily plague scientific discourse. Any time a new meaning is needed that is at variance with meanings of known terms, a new term can be created.

... talk about the problem of using 2xx-responders for citation.
2xx-responder is defined to be web behavior. Tempting to use them for citation, but such a term does not have the intended meaning (according to TAG) - it refers to the way the server actually behaves, rather than the way it is supposed to behave. Server policy statements are important so that the latter is transparent. Work in progress.