ADS Dataset Verification and Resolution Services

This document contains information about the Dataset Verification and
Linking efforts underway among the
NASA Archives and Data Centers and the University
of Chicago Press (publisher of ApJ, AJ and PASP).
This activity has taken place
under the auspices and guidance of the
NASA Astrophysics Data Centers
Executive Council (ADEC)
and aims at fulfilling the promise
of further integrating astronomical literature and the on-line data
it is based upon.

The NASA Astrophysics Data System
(ADS) will provide the tools necessary to publishers and users
at large for both dataset verification and linking through
stable, top-level services that can be maintained for the foreseeable future.
Links created to datasets from on-line manuscripts will always refer to
a dataset via a URI created using a well-defined identifier, and the
URI will be turned into one or more URLs in real-time by a central
resolver provided by ADS.
This will provide a high level of reliability and persistence to the links,
as well as providing an upgrade path into any future VO efforts in this
direction.

Overview

Dataset citation, verification and linking will work as follows:

Astronomy data centers and archives will start attaching
permanent dataset identifiers to the data they distribute.

Astronomers will write papers referencing the dataset they have
used. As per the instructions given to them by the AAS, they will
start using the appropriate markup to identify datasets in the papers.

During the publishing pipeline, UCP will extract the identifiers
and send a query to a central dataset identifier service (hosted by
ADS) to find out if (a) the dataset is valid and (b) a URL can be
associated to it.

The central dataset identifier verification service will query a number of
(relevant) datacenters using its own protocol, will cache the results,
and will return a status flag to the UCP query, indicating if a
dataset is known or not.

For the dataset identifiers that are known, URLs can be built
by using the base URL of a dataset identifier resolver and the dataset
identifier itself, e.g.
http://vo.ads.harvard.edu/dv/DataResolver.cgi?ADS/Sa.ROSAT#X/701576n00
If the verification is successful, UCP should include such URL in its
on-line article.

When the article goes on-line, a user clicking on the link
associated with the dataset will be taken initially to the URL above.
What happens next depends on whether ADS has one or more datacenters
claiming to have data relative to this dataset (there could even be
different mirror sites for a given data center). If only one final
URL is available for the dataset in question, our cgi script can
simply forward the user to it. If more than a single URL is
available, we could display a simple menu listing all the information
we have about the available links.

ADS will take the responsibility of maintaining services that are
aware of all relevant datacenters that may have datasets available
on-line, and which datasets are available from which data centers.

Dataset Identifiers

In order to allow easy integration of this effort in the emerging VO
framework, the ADEC has decided to adopt a syntax for the dataset
identifiers which is consistent with the current
IVOA
Identifier Proposed Recommendation
(Plante et al 2003).
This adoption will facilitate integration of these identifiers and the
tools that manipulate them in the VO.

IVOA Identifiers

According to the IVOA Identifiers Draft, the general URI format
for an individual dataset identifier is a string of the kind:

ivo://AuthorityId/ResourceKey#PrivateId

While we refer the reader to the recommendation
for a full explanation of the syntax, a few things are worth pointing out:

Use of the ivo:// scheme denotes the fact that the rest of the
identifier should be interpreted as a string abiding by the
IVOA Identifiers specification, and that
the identifier and the resource it refers to have been registered with
an IVOA-compliant registry.

AuthorityId is a naming authority registered within the IVOA
community; the use of this string within the identifier
establishes a namespace within which the rest of the identifier
can be considered unique. In general, the AuthorityId does not need
to correspond to a specific institution but rather to an entity that has
been granted use of the namespace.

ResourceKey is a name for a resource that is unique within the
namespace estabilished by the AuthorityID. In general it will
correspond to a unique resource made available to the VO by or on
behalf of the AuthorityID. A typical example of a ResourceKey
in this context is a data collection generated by a particular project
or mission.

PrivateId represents a unique string within the ResourceKey and it
denotes a particular dataset belonging to the collection.

Using Dataset Identifiers in the Literature

Given the fact that much of the VO infrastructure is still under
design and development, the ADEC has decided on a specific recommendation
for referring to dataset identifiers in the astronomical literature.
The general form of these identifiers is:

ADS/FacilityId#PrivateId

Comparing these identifiers with the general IVOA syntax we can make
the following observations:

No protocol scheme has been specified. This is due to the fact
that until IVOA-compliant registries are available, and AuthorityIds
can be established by them, it would be incorrect to claim that these
identifiers are in fact IVOA compliant.
However, it is to be expected that these identifiers can be resolved
as IVOA identifiers in the not too distant future by a simple
syntactic operation.

The AuthorityId string "ADS" has been specified. This simply
recognizes the current role of ADS in managing the namespace used for
these identifiers, in the
absence of a community-wide namespace granting authority. It does not
suggest nor imply that ADS controls or manages the dataset itself.

The ResourceId token will be interpreted as a Facility. An
ever-growing list of facilities is
maintained by ADS. Data centers should contact ADS should they need
to register new entries.

The PrivateId string can be anything that the data center
desires, with the provision that the identifiers string as a whole
should abide by
the general syntax of a URI,
as required by the IVOA identifiers specification.

Generating Dataset Identifiers

All Data Centers and Archives which provide public access to their
data should structure their databases and interfaces so that when a
particular dataset is released to the public, it is uniquely tagged by
an identifier ID created as discussed above. Users who download one
of such datasets should be made aware of the identifiers associated
with it and how it should be referenced in the published literature.

In order for a datacenter to ensure that the identifiers it is
generating comply with the syntax endorsed by the ADEC, the following
must occur:

The PrivateId is a unique identifier within the
FacilityId, and its association with the dataset will not change.

A profile for the datacenter has been registered with the
ADS, and in it FacilityId has been listed as
one of the resources that the center has data for.

The datacenter provides a dataset verification
service which will be used to verify the validity and location
of identifiers published in the literature.

Once a datacenter has published a dataset ID, it should provide access
to it. Ideally this will be a human-readable page on its web server
displaying the dataset's relevant metadata and offers the user the
option to download the dataset itself in some form or fashion. It is
left up to the datacenter to decide what to do if and when a revised
version of a particular dataset is published. In general, however,
it is understood that access to the latest revision of a dataset should
be an option if not the default.

Registering a Datacenter Profile

In order for ADS to coordinate the verification and linking of dataset
identifiers to the appropriate datacenters, it is necessary for the
datacenters to provide some basic metadata about its data holdings
and services. While it is expected that the appropriate metadata will
one day be made available by a public VO registry, its format and
access methods are at this time not available. As an intermediate
solution to the problem, we require that the data centers maintain
a simple profile which will provide ADS with the necessary metadata.

The data center profile is simple XML document that lists the
data center name and description, the name and email address of
the person responsible for the maintenance of the profile,
the URL of the web service to be used for dataset verification,
and the list of facilities that the datacenter has data for.
For more information, please see
this simple example.

Two options are available for creating and maintaining such a profile
document:

Create the appropriate XML document and make it available
at a stable URL on your web site.

Install the ITWG SOAP data verification toolkit that has a built-in
option to generate such a profile when invoked with the proper syntax
(please see below for more details).

Once the datacenter profile document has been created, the person responsible
for it should let ADS know
of the profile's location by submitting its URL to the
Datacenter profile
registration form.
ADS will review the profile and merge it into its list of datacenters
to be used for the verification of dataset identifiers. Also, ADS will
periodically harvest the URL corresponding to the datacenter's
profile and will update its list of datacenters and supported facilities
accordingly. Once registered, a datacenter can update its profile
without further intervention from the ADS.

Providing Data Verification Capabilities

In order to promote an open framework that can be used for the
distributed verification of dataset identifiers across data centers,
the ADEC ITWG (Interoperability Technical Working Group) has created the
specification for a SOAP-based web service. The corresponding
WSDL file can be used to
generate client and server interfaces to the service.
Each datacenter
providing data verification services should provide and maintain a
service that abides by this specification.

ADS provides a central verification service that fans out queries
to the appropriate datacenters. A diagram showing the architecture
of the system and the datacenters currently providing the verification
services (as of mid-2003) is
available.

To facilitate the deployment of verification services, the
ADS also developed a
PERL toolkit
that greatly simplifies the
creation of a compliant web service.
Among other things, by defining a few variables and installing a simple CGI
script based on this toolkit you will be able to automatically define your
site's profile described above.
For more information, please see the
README file.

Miscellanea

What follows is a list of additional resources and information
regarding this effort.
Please bear in mind that this is work in progress, so some of the
things listed below may not be (yet) working the way they are intended
to.

A Dataset Identifier Verification
form that can be used to find out whether a particular identifier
is known to any of the datacenters that participate in the dataset
verification "network." This is a very bare-bone implementation of
what will become the standard way for a user to perform the
verification (via a web browser) and for the publishers to perform
batch-verification during manuscript preparation.

The following resources are mostly of interest to the Data Centers
maintainers:

A PERL SOAP Toolkit
that can be used by Data Centers to set up a dataset verification
service. The README file
contains some information about installation and deployment

The WSDL file for the ADS test
SOAP service that performs dataset verification. Each datacenter
providing data verification services should have a similar profile
(except of course for the location of the service itself).

A number of things still remain to be done:

Bring on-line verifiers for all NASA centers.

Enhance the central data verifier by threading the fan-out queries.

Have the ADS central data verification script cache the
verification results so that we don't need to constantly query all
datacenters for every verification task.

Create a simple data verification resolver that can act
as a central proxy to all the instances of a verified dataset, as
described above. This is a trivial task once the caching of results
is implemented.

Update ADS's central dataset verification service so that we can use a
standard API for it (I'm thinking of using the same SOAP service that is
used to verify identifiers with the individual data centers).
Obviously, this needs to be worked out between ADS, UCP and other
publishers and can become a simple system upgrade once all the other
pieces are running.

If you have comments or questions about the contents of this document,
please send me an email.