Uniform Access to Metadata

This document surveys the problem of specifying a uniform method
for obtaining information pertaining to a resource without
necessarily having to parse a representation of the resource. It is an
attempt to rationalise several discussions that have taken place in a
variety of e-mail fora. More background and links to e-mail threads
area available on the wiki
page.

The borders of "information pertaining to a resource" is left
intentionally fuzzy.
"Pertaining to a resource" could mean document metadata, information
about how a document is accessed or should be accessed, or even a
description of a resource that is not a document.
The "information" could be a just single link to another document that
contains more information, or it could be something more involved.

The following ideas motivate this effort:

We want uniform access to metadata
because the specific method for extracting metadata from content
will vary wildly from one
media type to the next, and many media types (e.g.
application/x-compressed) have no place to put
metadata at all.

We want to be able to obtain metadata without necessarily loading
the content, because the resource might
be something we don't want to load (for reasons of size, license,
or other kind of application suitability).

Sometimes metadata is generated independently of content, and
we don't want to (or can't) modify existing content streams by inserting
metadata into it.

The first section, which forms the bulk of this report, presents
several use cases. The following section presents solutions that have been
put forth. Finally some critique of the idea and of the various solutions
is given.

Use Cases

The POWDER Use Case

Here a gateway server wants to use metadata ("pertinent information")
from an origin server
to decide whether content is to be passed through unmodified or must
be transformed first.
The generic use case for the Protocol for Web Description Resources [POWDER] is as follows:

Step 1

User requests Web content via their device.

Step 2

A gateway server resolves the URI and determines that there is metadata associated with the resource that asserts access conditions.

Step 3

The gateway matches the assertions in the metadata to the user's delivery context.

Then either

Step 4a

The gateway interprets that there are no constraints on the user accessing the content,

Step 5a

The gateway responds to the User with the full Web content from
the origin server.

or

Step 4b

The metadata asserts that the requested content is not
appropriate to the current delivery context [as is].

Step 5b

The gateway adapts the content and responds to the User.

The POWDER use case document applies this abstract concept
to more real-world applications. For example, the profile may indicate that the user's device is a mobile phone and that therefore that
'appropriate' means mobileOK [OKBASIC, OKPRO]. That is, only content that is likely to provide a functional user experience on a mobile
device would be displayed without adaptation. This would avoid the expense, latency and frustration, for example, of downloading
a 4MB file to a mobile phone only to find that the device couldn't process it.

Other use cases revolve around accessibility, child protection, trust and licensing. In each case, some form of processing takes
place to ensure that only content that is suitable for the delivery context is delivered.

POWDER offers an optimisation route to this scenario as it separates the description from described resource and allows
a single description to be applied to multiple resources, typically 'everything on a Web site.' Step 2 in the use case
above would be more efficient if the link to the metadata (the Description Resource) were available through an HTTP
header, thus obviating the need to parse the content before deciding whether it can/should be displayed directly or adapted
in some way. The same would apply to any service that wished to aggregate content that met particular criteria, discoverable through
POWDER Description Resources. The service would be seeking and authenticating Description Resources as a means of discovering
relevant content, preferably without having to parse the content itself.

Related Experience

Setting up a pointer from a resource to a related description at the HTTP level has another practical advantage in that for some content providers
it is significantly easier to achieve. In a large publishing company, responsibility for content production and content description will often
be allotted to different individuals or, in some cases, different departments in different countries. Authorisation to edit the page template
for a Web site will usually be in the hands of yet another individual or department. Description is seen as an editorial role, rather than
a content production role. Presentation is a job for the marketing department. Therefore including descriptions directly within a document,
or document-like resource, may not be technically permissible for the person whose job it is from a policy point of view.

A company-wide policy of including a common pointer from all content to the location of descriptive data is easier to implement if
that company has a choice of whether this is done at page level, document template level or network level.

These assertions derive from experience with PICS which has led the POWDER WG to think of HTTP Link and HTML <link>
as equivalent (rightly or wrongly). In PICS, you would set a specific HTTP Header or use an http-equiv meta tag in the HTML.
As an example, the ICRA label tester [ICRA] which makes use of Perl's LWP module [LWP],
makes no distinction between a PICS label delivered as HTTP or HTML. Neither does it distinguish
between a Link delivered as HTTP or HTML when looking for links to the RDF-based used by ICRA and Segala now.

The GRDDL Use Case

This text is taken almost verbatim from the GRDDL Use cases document [GRDDL].

Oceanic is part of a consortium of airlines that have a group
arrangement for the shared supply and use of aircraft spares. The
availability and nature of parts at any location are described by
AirPartML, an internationally-agreed XML dialect constrained by a series
of detailed XML Schema. Each member of the consortium publishes the
availability of their spares on the web using AirPartML. These
descriptions can subsequently be searched and retrieved by other
consortium members when seeking parts for maintenance. The protocol for
use of the descriptions requires invalid documents to be rejected.
Oceanic wishes to also publish RDF descriptions of their parts and would
prefer to reuse the AirPartML documents which are produced by systems
that have undergone exhaustive testing for correctness. There is no
provision in the existing schemas for extension elements and changing
the schemas to accommodate RDF would require an extended international
standardisation effort, likely to take many years.
This means they cannot alter their XML documents to use GRDDL.

A network level means of associating XML instances with a GRDDL transform would allow Oceanic Consortium to serve RDF via GRDDL without altering their XML documents.

The Atom Use Case

Atom defines two types of collections of resources: Entry Collections and Media Collections. In both cases, new
members are added to the collection by POSTing a representation of the resource to the URI of the collection. The server
responds to the POST with an HTTP Location header that gives the URI of the newly created resource.

A feature of ATOM is that resources may subsequently be edited and the URI at which this is possible may differ from that
given by the server as its Location. It follows that the edit URI needs to be declared in the HTTP Response to the POST by
a means other than the Location header. The use of HTTP Link: (see below) is suggested in the relevant documentation [ATOM-PACE]. This
section has had to be withdrawn for several reasons, among them the lack of certainty over the status of HTTP Link.

The Mobile Web BP, Content Transformation Use Case

Alternative representation of a resource

Step 1:

User sends a request from a mobile device to a server.

Then either:

Step 2a:

The request is routed via a content transformation gateway, and passed untouched to the server.

Step 3a:

The server has a mobile specific representation of the resource, but fails to notice that the user is using a mobile device, and responds with the generic desktop representation of the resource.

or

Step 2b:

The request is routed via a content transformation gateway, which changes the HTTP headers to fake a desktop browser

Step 3b:

The server has a mobile specific representation of the resource, but cannot tell the user is using a mobile device, and responds with the generic desktop representation of the resource.

Step 4:

The gateway receives the response and sees that there's an alternative "handheld" representation of it. Instead of passing the response to the user, it redirects the user to the alternative "handheld" representation of the resource.

François Dauost adds: I say it's a clumsy use case, because we wrote the Content Transformation Guidelines to ensure that the "b"
path (2b and 3b) should not appear in practice. But "should not" is still may... Anyway, the "a" path (2a and 3a) may be a valid use
case, although probably not an existing one (how many web sites would have a mobile version and simply don't know how to identify a mobile user-agent?)

Anyway, the TAG, in: http://www.w3.org/2001/tag/doc/alternatives-discovery.html#id2261787
… suggests using "linking mechanisms provided by the
representation being served". But images, audio, video, …
are also subject to transformation, and either don't include such
"linking mechanisms" or don't do so uniformly (without a
separate parser per content type).

POWDER applied to content transformation

Obviously, the mobileOK example typically fits here! It would serve as a flag that tells the content transformation proxy not to
transform resources (and again, that includes images…), that are labeled as such.

Another similar use case:

Step 1:

User sends a request from a mobile device to a server.

Step 2:

The request is routed via a content transformation gateway to the server.

Step 3:

The gateway responds with the mobile specific representation of the resource. It includes a link to a POWDER file describing the resource in terms of content transformation: what kind of transformation is allowed on the document, what is explicitly forbidden.

Again, the Link element could be used, but that doesn't work with images and the like"

Example: a site with images may be optimized for mobile-browsing, and images are adapted to most screen widths. But it
may still not know everything about all devices, and leaves the possibility to recode the images if the device doesn't support a given
format. To do that, it flags the images with a POWDER file describing that recoding of the images is allowed if the device does not support the format.

Semantic Web Use Case

It is desirable to be able to find documentation for a URI given just
the URI. The documentation assists humans in understanding uses of
the URI (e.g. in RDF or OWL) and in considering an existing URI for
use in some application. People may explore this kind of
documentation using an semantic web browser such as Tabulator. There
may also be applications that are able to make use of formally stated
assertions (RDF, OWL) that help to define, declare [BOOTH], or otherwise document the URI, either in the
general case of in the situation where they have some idea of what to
expect from this information (which properties are supposed to be
asserted for particular kinds of things, such as a person's name).

The accepted method for finding such information on the semantic web
is the "follow your nose" algorithm, which (in simple form) says

If the URI contains a fragment identifier, then the racine of the
URI
(i.e., the part before the #) should lead to a documentation page.

If the URI does not contain a fragment identifier, then an
attempt to
dereference the URI should yield a 303-redirect that leads to a
a documentation page.

Specificity of resource/documentation relationship

A 303 response carries no implication that the redirect will lead
to the documentation that an application wants - in fact, 303 was not
designed with "follow your nose" in mind at all. For example, an
application looking for a URI declaration has no assurrance that the
document found by following the 303 will be one, and an application
looking for RDF has no assurrance that the 303 document has an RDF
representation. For these purposes it is desirable to relate the
resource to the documentation not via the nonspecific Location:
header but via a more specific relationship such as "is described by"
(specified by a URI of course). The response would still be a 303
with a Location: header, but more specific information could be
conveyed via an additional HTTP header or some other mechanism.

Providing metadata for URIs that yield 200 responses

Metadata (i.e. "pertinent information" for information resources)
has many purposes on the semantic web.
Metadata such as author, title, creation date, and license is valuable in
a semantic web context.
However, it is also useful to be able to accurately describe
an information resource - in particular, to characterize it
by specifying it class(es) and properties.
When using a URI on the semantic web it is important to know what it
denotes, and this may not be possible just by examining its
representations. (For example, consider an "RDF hall of shame" web
site containing examples of incorrect RDF. Representations of the
example resources from this site are by definition not good sources
of information about the resources.)
A link to metadata outside the HTTP
response would be one way to convey such information.

303 responses are incompatible with 200 responses, so
semantic-web-related documentation is not available for URIs that
denote information resources and for which servers (and clients)
would like to obtain representations via 200 responses. While in
principle nothing rules out the use of # and 303 for information
resources, they do not provide a graceful migration path for
providing metadata for existing resources, because in the # case the
URI must change and in the 303 case responses (and client behavior)
would have to change.

It is argued that a uniform method for access to metadata could have
the effect of lowering the barrier to entry to the semantic web
for existing documents and could be a boost to the semantic web by
bringing large numbers of entities onto the semantic web
with relatively little pain.

Use case: Someone browses to an interesting document (HTML, PDF,
PPT, DOC, PNG, etc). A browser plugin
and/or document authoring tool plugin provides a "citation" feature
that fetches information for the document, e.g. bibliographic information and
durable location (if available). The information is communicated in
RDF and the tool needn't know the details of all formats.
The information is placed in a triple store and/or something like an
Endnote or Bibtex database.

Use case: A URI occurs in some interesting RDF and someone
wants to know what it denotes. Browsing to the URI takes one to a blog.
How does one know that it is a blog (its class) and which blog it is
(other than the one whose URI is ...)?
What other statements can be made about the blog - author(s), license,
permanence policy?

Bibliographic metadata

The following develops the bibliographic metadata case in a bit more
detail.

Acme Publishing is an established publisher of academic journals
serving thousands of hits on its corpus of PDF files daily. It has
learned about RDF and in order to promote its journals wishes to
provide bibliographic information for its articles in RDF, to assist
automated agents that are RDF-aware.

Although the PDF files have a place to put metadata, this is deemed
an unsuitable location as (1) many of its millions of PDF articles
are quite old and regenerating them is so risky as to be infeasible,
and (2) Acme judges that it is unreasonable to expect that client
software will know how to parse a PDF to get at the metadata.

Acme's first approach is to create a CGI script that takes the
article's URI as input and returns the bibliographic RDF for that
article. This gets few adopters and the publisher realizes that
monolithic action will not be very effective. At a trade conference
they realize that other publishers are having the same thought, and
there is discussion of how they can standardize so that agents can be
generic across various publishers - indeed over the whole web.

Minimal modifications to its web server, such as CGI scripts, special
response headers, or new HTTP request methods are within budget.
Asking existing customers to change the URLs they're already using,
or to change the way they use HTTP, is not acceptable.

Cross-site communication of end user information

User (T. C. Mits) wants personal information such as public key or buddies list to
be known by many sites (ecommerce, social networking). User does not
want to have to enter information separately at each site. User wants a
single 'key' (login) that will lead all participating sites, even
newly encountered ones, to the information.

User registers with a chosen 'home' server that will
keep personal information

User chooses, or is provided with, a URI, controlled by home server,
that the user can
remember easily. Perhaps the URI of a login page, home
page, profile, mailbox, buddy list URI.

Home server receives and retains personal information

User goes to site 2. Provides URI.

(Site 2 authenticates with site 1 somehow if info is to be private)

User authenticates with site 2 based on personal information.

Information is not fetched by doing a GET of the URI - that
would require parsing HTML, understanding microformats, or coordination
with CMS, and these hurdles may be too high
("equality principle")

Information is fetched using a side protocol of the kind we are
considering here (link header, etc) (XRD uses CN, see below, now
acknowledged to be wrong)

Information is in some easily parsed form such as XML or RDF

JAR anticipates that some will object that the information should
have been put in the representation (i.e. found via GET) in a
standard way - after all the same entity controls both the
representations and the "side protocol" (link, etc.).
This is the approach taken by openid 1.0, which failed to get the
uptake desired, in part because this didn't work often enough.
The whole point here is that sometimes the way a site is deployed,
the department that wants to provide the personal info in a sensible
format
has no influence over the department or vendor that's arranging the
GET responses.

Suggested mechanisms

Link: HTTP header

Mark Nottingham's RFC draft seeks to clear up confusion over status of HTTP
Link (included in [RFC2068], but removed from [RFC2616]). It notes that ATOM defines a linking mechanism that is similar, but not identical,
to HTML's link element and specifically does not map an XLink header to HTTP Link. It suggests that relationship types by declared as a URI
with IANA as the single registry for relative URIs (e.g. next, prev, stylesheet etc.)

Formally, it proposes:

The Link header field is semantically equivalent to the <LINK> element in HTML, as well as the atom:link element in Atom [RFC4287].
Link = "Link" ":" #("<" URI-Reference ">"
*( ";" link-param ) )
link-param = ( ( "rel" "=" relationship )
| ( "rev" "=" relationship )
| ( "title" "=" quoted-string )
| ( link-extension ) )
link-extension = token [ "=" ( token | quoted-string ) ]
relationship = URI-Reference |
<"> URI-Reference *( SP URI-Reference) *lt;"> )
The title parameter MAY be used to label the destination of a link
such that it can be used as identification within a human-readable
menu.
Examples of usage include:
Link: <http://www.cern.ch/TheBook/chapter2>; rel="Previous"
Link: <mailto:timbl@w3.org>; rev="Made"; title="Tim Berners-Lee"
…
Relationship values are URIs that identify the type of link. If the
relationship is a relative URI, its base URI MUST be considered to be
"http://www.iana.org/assignments/link-relations.html#", and the value
MUST be present in the link relation registry.

Note that Link: responses can be put in HEAD and POST as well as GET
responses, and that they work just as well for 303 responses as for
200 responses.

Working code

The draft has continued to be discussed on the HTTP mailing list (see Specific headers for Specific Tasks below). As
part of this, Julian Reschke raised an issue concerning
current implementation of HTTP Link. Following on from that a test page was set up that uses
HTTP Link (and only HTTP Link) to associate a stylesheet. Firefox 2 and Opera 9 both apply the stylesheet (I.E. 7 doesn't).

Tabulator supports Link: with rel="meta".

Mozilla has another use for HTTP Link - to allow its browser to pre-fetch resources
in its idletime and so speed up page-rendering.

Specific headers for specific tasks

Brian Smith (active in the ATOM community) has suggested that
parsing HTTP Link Headers is hard and that a more efficient solution would be to create new application-specific headers. In
essence, make the relationship type part of the header. So, rather than

Link: <http://foo.org> rel=edit;

one would use:

Edit-Links: <http://foo.org>

Brian Smith says:

This could be done by changing the registration rules for HTTP headers so
that header fields with a "-Links" suffix must have the above syntax, with
the definitions of the "media", "type", and "title" parameters to be the
fixed to be the same as in HTML 4 (or 5) and Atom 1.0. Each link header
would have to define the processing rules for when multiple links are
provided, and applications must be prepared to handle multiple links of the
same type, even when they are not expected (that is why I chose "-Links"
instead of "-Link").

The core advantage of this method appears to be the ease of parsing - you only take notice of headers you know you're interested in.

Alternative HTTP request type: PROPFIND

WebDAV (RFC 4918)
defines a PROPFIND HTTP method for obtaining metadata.
The request details and response are both encoded in XML using
elements from the DAV namespace.
The RFC gives
several examples.

New HTTP request type: MGET (URIQA)

Patrick Stickler has developed The URI Query Agent Model which proposes a new
HTTP method of 'MGET' that returns a concise bounded description of the resource available at the given URI. The full paper suggests
support for adding to and deleting from those descriptions with MPUT and MDELETE.

URIQA is fully developed and implemented in the Nokia URIQA semantic web service. Patrick Stickler includes a good summary of
several of the arguments surrrounding this issue - such as why not use conneg, why not use HTML's link element and so on.

The URIQA approach has some similarities with PICS which defines an HTTP header of 'Protocol-request' which is sent with a
GET request when seeking a PICS label describing a resource.

URI manipulation

Instead of going out to a server for URI1 to obtain the name URI2 of a
second resource that carries metadata for URI1, we could adopt a client-side
convention for obtaining URI2 systematically from URI1. There is a
faint resemblance here to favicon.ico and robots.txt, although those
are site-specific secondary URIs instead of resource-specific
secondary URIs. Here are two rules that have been suggested in
particular contexts. It might be possible to pursue this idea to
generalize beyond either of these cases.

/about/

For a given URI http://a.b/c/d/e, construct a new URI http://purl.org/about/a.b/c/d/e

Configure the purl.org server so that http://purl.org/provide-about/a.b/c/
d/e redirects to something akin to a structured wiki page or a REST
service. (Let us assume for the moment that whoever currently provides
the LSID WSDL that contains this information currently is the
provider of this service.)

Archival Resource Key (ARK)

The Archival Resource Key has been developed by the California Digital Library.
To obtain a metadata link for an ARK one simply appends "?" to the URI.

Put it in the content of the response

One way to uniformly transmit metadata is to designate one or more
particular representation types (media types) to be the one(s)
that are supposed to carry the resource's metadata. We could decide,
for instance, that among a
resource's many representations, if there is to be metadata, the
metadata should reside in the RDF/XML representation (another
alternative would be HTML, using <link> elements perhaps). If no
rdf/xml representation exists, it should be created for the purpose of
carrying the metadata.

Unfortunately this approach is likely to clash with the idea that if
there are multiple representations then they should all carry the same
"abstract information". The software responsible for providing the
metadata is unlikely to be competent at translating arbitrary media
types (e.g. a compressed "tar" file containing Erlang code with French
comments) into HTML or RDF/XML.

Using a multipart media type has also been suggested. Metadata could
go in one part, and the true content in another. Among the relevant
documents turned up by a web search for "multipart metadata" is
a 1999 IETF Internet-draft proposing an "ancillary" value for the
multipart Content-disposition header, addressing exactly the present need.
We have no information on the potential viability of this approach.

Others

Put the information off site, in external metadata repositories,
brokers, forwarding services, etc. In the library world metadata is
never the business of the information provider (publisher, printer,
etc) - you can't rely on them to care, to do the right thing, or even to have
the necessary information. (Mackenzie Smith)

Use a search engine (Google, or a hypothetical Semantic Google) to
find the information you're looking for. (Roy Fielding)

Issues

More than one commentator has stated that it is inappropriate, or
even wrong, to use HTTP headers to transmit important information of
this kind. It's not what GET is for.

Mike Linksvayer of Creative Commons points out that some metadata,
such as licensing information, needs to stay very close to
the content. (This is why a copyright notice is printed inside a book
instead of on the dust jacket.)
For this reason he would discourage any mechanism that
would facilitate separation of metadata from content. Were such
a mechanism to become available, he
would discourage its use when an acceptable alternative exists.

Mechanisms that require special server configuration may not be
accessible to all author/publishers; there will be an interaction
with the way content is hosted and so on. (E.g. Pat
Hayes's message.)

Mechanisms that require special client configuration in order to
access metadata are also
problematic. For this reason it is urged that when
metadata (or link to same) can be communicated via the
content, it should be, even when this is redundant with other channel(s).

On the other hand, if information is communicated in multiple ways,
there are greater chances for inconsistency and confusion. Clients
may be forced to decide who they trust more, the entity generating
the metadata (link) or the entity that originated the content.
Certain applications may be forced to decide which information
source is more "authoritative".

A concern has been raised about potential confusion over whether
the metadata pertains to the resource or to one particular
representation. For example, a list of keywords (or even
authors) might be
specific to a language variant or to a particular draft, while a
permanence policy might be meant to apply to all
"representations" varying across time, language, and format.
Different mechanisms may have different implications, e.g. putting
metadata in a representation might predispose providers and clients
to thinking that metadata is representation-specific while

200/303 symmetry:
There is an appeal to being able to access pertinent
information (metadata, documentation, description)
in a uniform way that is (a) insensitive to the question of whether the
resource is an information resource or not and also (b) completely
orthogonal to content negotiation.

Some details of Mark's Link: header proposal have been
critiqued.

It
is observed that the relationships listed for initial
inclusion in the registry do not include "meta" in spite of the its
informal use beginning with early RDF specifications (1989??); this should be
fixed.

The use of a default base URI for relationship
URIs has been criticized as not reflecting semantic web best practices -
biasing to the IANA registry might steer people away from expressing
the correct relationship (between resource and meta-resource) in favor
of a relationship that has a short name.

N.b. as Link: and <link> are supposed to be in alternative
delivery mechanisms for the same kind of information, any
changes to HTTP Link: need to be coordinated with HTML <link>
and vice versa.

As metadata standards and repositories are a big deal outside the
Web, uninvestigated efforts such as ISO's may merit attention.