Eric,
I had some discussions with Eric Miller, Bertram Ludaescher and others
after the meeting and, in the text below, I've tried to summarize some
of that discussion. While I want to acknowledge their contributions and
they may recognize some of their words below, I've made enough changes
to summarize multiple emails, recall conversations, and add new material
and references, that it shouldn't be taken as properly representing
anyone's position or a consensus of any sort (not sure our discussion(s)
ever reached conclusion...). Never-the-less, I thought it would be
useful to add it into this discussion as background information.
Jim
James D. Myers
Chief Scientist, Computational Sciences and Mathematics Division
Computational and Information Sciences Directorate
Pacific Northwest National Laboratory
Phone: 610-355-0994
Fax: 208-474-4616
Jim.Myers@pnl.gov <mailto:Jim.Myers@pnl.gov>
LSID discussions: There are three main sections below - issues, general
comments, and some potential alternatives/directions.
Issues:
The "LSID" name:
Are life science identifiers different enough that they need to be
treated separately? Do we then need a physical science identifier, a
computer science identifier, etc.?
LSID as a protocol as well as a name:
Similar issue, but one that can also be described as death-by-plugins -
if everyone who wants to control a namespace for identifiers makes a new
protocol requiring a plug-in...
Persistence policy as part of the name/protocol:
Is persistence such a unique and overriding piece of metadata that it
should be part of the name and/or require a separate protocol? Does the
name of data change when a researcher decides it is valid and should be
kept forever? There seem to be problems analogous to the 'don't encode
location in the name because it might move' issue.
Persistence policy as a binary option:
There are many shades of grey in persistence - How long is the
guarantee? What happens to data with a 5, 10, or 50 year retention
schedule after which is to be deleted? Is access also guaranteed, or
just unique naming? Is the guarantee best effort? Does it apply to bits
or an 'equivalent' (by whose definition?) item, e.g. the PDF copy of an
obsolete MS Word 1.0 document? Is persistence policy handled better as
metadata defined by a schema(s)?
Metadata retrieval as part of a persistent identifier protocol:
Is metadata unique to persistent resources? Is there a reason to
balkanize metadata access by tying the mechanism to a type of resource?
Or should the semantic web provide a mechanism allowing metadata
association with 'any' resource, persistent or not, via a standard
mechanism?
General Commentary:
1) A model for naming resources that a community can agree on is
a good / powerful thing; LSID has defined such a model and has a large
growing community behind it.
Yes, but...
the issues above could limit growth and lead to fragmentation of
the community as it raises awareness of what globally unique IDs can do
and encourages other "my community's ID" protocols, and/or modifications
that attempt to get around the issues noted above. Will chemists all
adopt LSID simply because some of the molecules they work on are related
to biology rather than materials science? Will a pharmaceutical company
adopt LSID for data with retention schedules?
2) Persistence identification and the ability to persistently
resolve names are not artifacts of any technology - they are an
organization / community investment. It is unclear what investment the
LS community has at this point for supporting resolution services (DNS,
HTTP, or other).
Should expectations of persistence shouldn't be managed by
naming convention rather than protocol - http://persistent.my.org/
addresses or the use of Handle-style/meaning free URLs (e.g.
http://456.10123.name.org/myname - see below). The convention of "www.*
<outbind://579/www.*> " for web servers seems to have worked very well
for conveying that expectation that these machines support HTTP.
3) The non-http URI approach requires an extra level of
infrastructure for resolving objects. For use in browsers this requires
an additional plug-in. There seem to be very few available; and then
only on certain browsers. Further I don't think many realize that
browsers are perhaps 1/10th of the applications that follow links (e.g.
robots, etc. and this is a different issue completely. One the DOI /
publishers are unfortunately finding out at this very moment).
A Handle-style proxy mechanism helps a bit here, but it is
certainly not as clean/clear as specifying HTTP redirect as *the*
resolution mechanism.
4) non-http URIs put barriers up for adoption to other
communities. There are reasons (sometimes) to do this, but has this been
explored for LSID and the implications understood?
And since science is becoming more interdisciplinary, the
protocol really needs to be science-wide or pervasive even if namespaces
are controlled by smaller orgs.
5) The LSID community has socially agreed that the use of LSID
will point top an immutable resource - the thing one points at will be
the same 5, 10, n years later. How can this be enforced socially or
technically? What's the penalty for reusing an LSID? If the LSID, bits
to persist, and the hash are all owned by one organization, the bits and
hash could be changed together.
This requirement is science-wide - it's been the argument
against allowing any URLs as references in the literature, and everyone
is moving to treat data in the same way. Life science is ahead in the
number of individual data items to be tracked and in how large the
community is that needs to persistently refer to things, hence they have
the biggest problem right now, but everyone in science (and beyond) has
it at some level. Socially, it isn't clear that LSID provides any more
leverage than, for example, a naming convention as in #2. Technically,
without a means to make name/hash pairs non-reputable (e.g. by
registering them with a neutral third party or using a digital
signature), LSID cannot detect reuse of names.
6) It is unclear how best to use LSID; more specifically *when*
to use it and when *not* to. There was talk at the meeting of using
these for documents, reports, concepts declared on the Semantic Web,
etc.
There's a slippery slope here and it will be hard to have a
clear convention. I may want to name my raw data, the average of my raw
data, a calibrated version of my data, my latest/best data, a graph of
my data, the paper about the data, etc. From various discussions of
versioning, it is clear that there are use cases that need to
name/expose both the individual versions and the 'latest' version,
whatever number that currently is, which means bit-level persistence
will probably not meet all life-science needs, which may lead to 'abuse'
of LSIDs with 0-byte data to refer to things with dynamics.
7) Is LSID bad?
No. The level of adoption of LSID is impressive (though it isn't
clear how much of that is simply attaching lsids for future use versus
actively producing and consuming them). While the discussions at the
Semantic Web for Life Sciences workshop was negative at times, one
should not criticize LSIDs without acknowledging that they are a step
forward and are definitely enabling and educating the community.
However, the semantic web and the life sciences will need more general
mechanisms for naming and associating metadata with resources, and a
means to provide more detailed persistence information; promoting LSIDs
as a short-term solution may not be the best option if progress on these
issues can be made quickly.
Potential Alternatives:
Naming:
The Handle System - similar to LSID with its own protocol and resolution
mechanism. Used in DOIs. Has a proxy mechanism so no plug-in is required
- http://hdl.handle.net/<some-handle
<http://hdl.handle.net/%3csome-handle> > will invoke a resolver service
and redirect you to the resource. The Handle System has its own protocol
with its own metadata methods and thus shares those issues with LSIDs,
its proxy, and the fact that the protocol and namespaces are separate
(i.e. the lsid community could organize part of handle space for
themselves) seem like advantages over LSID. Handles are also being
proposed as part of the Grid naming mechanism (see
http://www.globusworld.org/program/abstract.php?id=33,
https://forge.gridforum.org/projects/ogsa-wg/document/draft-charter-nami
ng-wg/en).
Persistent URLs - standard URLs maintained by authorities that use HTTP
Redirect to provide access to resources. The PURL website has extensive
documentations and FAQ information: http://purl.oclc.org
<http://purl.oclc.org/>
Naming convention only - Use standard URLs and DNS resolution.
Resolvers/authorities could be identified via a convention such as
addresses starting with "uid", e.g. http://uid.my.org/. If URIs used as
persistent names are "meaning-free" addresses , e.g.
http://456.10123.name.org/myresourcename
<http://456.10123.name.org/myresourcename> , it would be easy to
transfer resolution duties between organizations, i.e. to reassign
10123.name.org from my organization to yours if my org doesn't want to
maintain things anymore. Use HTTP redirects as a resolution mechanism.
Metadata:
Protocols such as LSID and The Handle System have their own extensible
metadata mechanisms. For URL-based options, there are proposals for ways
to add metadata capabilities to URLs:
The Nokia MPUT/MGET/MDELETE methods proposed as part of their URI Query
Agent Model (URIQA) (http://sw.nokia.com/uriqa/URIQA.html). GET/POST
mechanisms for requesting/setting metadata about third-party resources
are also defined. URIQA defines the concept of a Concise Bounded
Description of a resource (http://swdev.nokia.com/uriqa/CBD.html) as the
set of RDF statements accessible via these methods.
Clark et. al. propose an alternate mechanism using XPointer and HTTP in
"A Semantic Web Resource Protocol:Xpointer and HTTP"
(http://www.mindswap.org/papers/swrp-iswc04.pdf).
Persistence Policy:
With any of these naming and metadata combinations, persistence could be
treated in the same way as other metadata - statements about persistence
policy could be standardized and accessed via the same mechanism used to
discover authors, type, creation date, provenance, etc. Persistence
policy could be a simple (binary) or complex (retention schedules,
definition of identity/equivalence used, ...) as desired by various
sub-communities.
Additional URLs:
Handles: www.handle.net <outbind://579/www.handle.net>
Tim B-L musings on names from '96:
http://www.w3.org/DesignIssues/NameMyth.html
Meaning-free DNS names:
http://www.frankston.com/public/essays/DNSSafeHaven.asp
Comparison of Handles and PURLs (by a Handle advocate?):
http://web.mit.edu/handle/www/purl-eval.html
LSID spec: http://www.omg.org/docs/dtc/04-05-01.pdf
"Persistent Indentification (sic): A Key Component of an
E-Government Infrastructure, Updated July 26, 2004" - discusses PURLS
and Handles and other alternatives:
http://cendi.dtic.mil/publications/04-2persist_id.html
-----Original Message-----
From: public-semweb-lifesci-request@w3.org
[mailto:public-semweb-lifesci-request@w3.org] On Behalf Of
Eric.Neumann@sanofi-aventis.com
Sent: Monday, March 14, 2005 6:29 PM
To: public-semweb-lifesci@w3.org
Subject: LSID: What's still needed to make it work within the
semantic web?
We had some very productive discussions on the value of the LSID
specification at the workshop in October, and many of us would like to
see it reach a functional conclusion. Much of the discussion was around
what still needs to be done with the specification, so that LSID's
become a beneficial and practical element of the life science community.
I would like to suggest those interested in seeing the LSID
specification come to completion, to participate in this thread, and try
and define some critical next steps for its success in being adopted by
most data sources.
I would also recommend people to re-read the 3 position papers
on LSID from last October's workshop:
http://www.w3.org/2004/07/swls-agenda.html . Steve Chervitz's paper from
Affymetrix has some very useful insights in it that I think many would
appreciate.
To quickly review, LSID offers both a unique identifier model
for authoritative life science data, and a mechanism by which they can
be resolved to actual (unmutable) data bytes and meta-data (mutable).
Some lingering quaestions include:
* What metadata accessible through LSID should be
standardized; this may be more about general info-descriptive semantics
like Dublin Core and RSS, than biological or chem semantics.
* A precise way to handle versioning, derivation, some
other relationship types for provenance
* Are URN-aware resolvers an acceptable means for data
retrieval for all members of the life science community? Are there any
alternatives that are simpler?
* Guidelines for encoding data for common bioinformatics
data types in LSID; are we all clear what is data and what is metadata?
Would this include all kinds of RDF graphs that relate to the original
data item? Do we need best practices on utilizing common ontologies such
as GO within a data entry?
* How to specify Dynamic data (latest version) effectively
(minimal http calls of LSIDs)
I hope other members of the LSID specification are able to
participate on this thread, to help clarify the issues, and identify
where most value can be gained.
Eric