Access and Query Proposal for the F2F1

This page concerns provenance access and query in preparation for the first F2F meeting, focusing on a limited range of aims, proposing solutions and presenting open issues regarding those aims. The wider objectives of the Access and Query Task Force will be clarified in discussion after the meeting.

Plan

By 9/Jun: Agree on scope for F2F1 (mailing list)

By 23/Jun: Populate proposals for solutions to the scope questions (wiki)

2 and 3 may be interleaved. 4 may just be the outcome of the process, with some editorial cleaning up.

Scope

Consider the following scenario. A user gains access to an online resource through browsing the web and downloading it, by receiving by email, transferring it via FTP, or by some other protocol. The client software (browser, email client etc.) offers an "Oh yeah?" button, by which the provenance of the resource will be retrieved and displayed. What does the client do on the button being clicked, what information does it need in order to perform the retrieval, and where does that information come from?

To meet time constraints before the F2F meeting, this document will concern two questions only:

Given the identity, I, of a resource state representation and a location, L, from which to retrieve provenance, how do we obtain the provenance of the representation from the location?

How can a browser find I and L (as above) for an HTML document that was downloaded, so that its provenance may be retrieved?

Initial Points for Decision

Below are some initial, limited points on which we may be able to decide at the F2F1 before opening up discussion about the proposals.

P1. There may be data regarding the provenance of a thing accessible from multiple sources.

P2. The information required to obtain access to some provenance of a thing may be supplied in many different ways, and we do not aim to enumerate them all.

P3. The WG effort will concern how the provider of a thing can supply information required to obtain access to some provenance of that thing (which may, as a side effect, include recommendations on how others can do the same).

P4. Regarding some provenance data obtained from dereferencing a provenance URI, calling a provenance service, or some other means. Which of the following is true?

(a) It is apparent from the data itself what single thing it describes the provenance of.

(b) Provenance data documents a set of things and how they are related by past occurrences, so to extract the provenance of any one thing requires knowing how it is identified in the data.

(c) Something else.

P5. Regarding some provenance data obtained from dereferencing a provenance URI, calling a provenance service, or some other means. Which of the following is true?

(a) To meet the standard, it should be immutable.

(b) It can change over time without restriction.

(c) To meet the standard, there are particular ways in which it should not change, e.g. any one account should remain as it was.

use the POWDER describedBy link relation type, or register a new one? I tend to favour the latter as, while POWDER and provenance information are both descriptions of a resource, the intent and coverage is different. Thus, rel="describedBy" in the above example might become rel="provenance", subject to registration of the new link relation type (cf. http://tools.ietf.org/html/rfc5988#section-6.2.1).

I do not include the type= option in this example as shown in the POWDER example. This is in keeping with the notion that the mechanism for accessing provenance should be independent of its format.

Comments

as indicated below for HTML Link, I don't think it's necessary to refer to POWDER, but we can take the same approach as POWDER. Indeed, i don't think that the relationship "describedby" is appropriate to link to provenance. Also, we need to have a reference to provenance service (see below).

I agree we can use POWDER approach without referencing POWDER. Not sure why we need a reference to a provenance service. -- GK

this approach is good if we can't embed provenance information in the document itself

embedded provenance is not excluded - your phrasing implies it would be preferred, but I think the non-embedded case is more fundamental. --GK

phrasing was not intended to express a preference. It is an observation. --Luc Moreau 23:23, 17 June 2011 (UTC)

could also apply to RSS feeds

trivially true to the extent it could apply to any HTTP retrieval. I wouldn't make explicit mention of RSS.

What is provenance-uri? is it unique? is provenance found at one and only one place? is there one and only one authority to provide provenance information about something? -- Luc

Provenance can be found many places - but the provenance linked from the resource itself can often be seen as more authorative. I think I would combine this approach with the Provenance Service - you should be able to find and provide provenance without knowing of or having to set up any provenance service. -- Stian

Can more than one "provenance" Link be provided? Imagine the resource knows of some of the (possibly conflicting) accounts of the provenance, distributed across different web resources. -- or would the link go to a meta-provenance that describes the other links? --Stian

Proposal: Use HTTP to retrieve provenance

(proposal by GK - 2011-06-09)

A general presumption is that provenance information is accessed in the same way as any web resource. Typically, this will be via HTTP. Thus, any given provenance information will be associated with a URI, and may be accessed by dereferencing that URI using nomal Web mechanisms.

The problem of accessing required provenance information then reduces to the problem of finding its URI.

Comments

This assumes that the URI is dereferenceable, which may not be always the case.
Also, one needs to distinguish the URL where a resource is located, from the identity (probably expressed as URI) of the state representation.

True, this assumes the provenance URI is dereferencable. I think this should be considered the default case. I'd want to see clear use-cases before considering alternatives. -- GK

Your notion of provenance-uri is not clearly defined. Is it a URI indicating where provenance can be found, or is it a URI assigned to the entity for the purpose of tracking its provenance. If the latter, your own email http://lists.w3.org/Archives/Public/public-prov-wg/2011May/0131.html makes it clear that it could be a non-dereferenceable URI (e.g. a UUID URN) --Luc Moreau 23:57, 17 June 2011 (UTC)

I strongly disagree about any need distinguish between URL and URI. There be dragons. Just say nothing about this. -- GK

I think that there is a fundamental difference between our approaches, which I summarise as follows.

Your approach seems to associate a provenance-uri with every representation returned (it can be passed in the message header or in the representation itself). I cannot ascertain whether there is one provenance-uri returned, or whether there can be multiple.

My approach is to:

uniquely name a 'thing' (I) (which may be different from the resource URL in the case of a stateful resource)

provide a location (L) where provenance can be retrieved

so, I think L combined with I would give a provenance-uri

this approach is more flexible, since it allows us to ask other services, do you know about thing I?

Could you not just say that the provenance URI identifies 'a' provenance resource (dereferencable or not) which *should* provide a minimal provenance of the Thing - but might also give provenance of many other things and links to deeper provenance about the Thing? --Stian

To locate statements about the Thing in the provenance, and avoiding separate provenance identifiers for each Thing, you need to carry over the identifier - are you suggesting that as the Thing is not necessarily the current Resource State then you also need an "provenance-about" identifier in the link? --Stian

This allows us to use http content negotiation, and return serializations of the provenance in multiple format (here rdf+xml).

When the URI is dereferenceable, there is also the possibility that the provenance service is the same host as the host that delivered the document/resource state representation. In that case, the uri can be given a default, which is the one for the target:

Note that the original proposal used ?provenance, but then, it wasn't clear it was the same mechanism, where simply the parameter entity was given a default value.

Comments

-- I don't think we should go here. It added complication for which there is no clear need. And I think your proposed use of HTTP host may be wrong in any case (but I haven't checked it carefully). There may be a variety of ways to pull information from a third party service. I don't think we should begin to define a mechanism until we have a clear requirement. If here is a mechanism to specify, the charter suggests a SPARQL query, IIRC -- GK, 2011-06-17

Problem: The client / user has a representation of a Web resource without any indication of provenance and it wants to obtain provenance information about the representation (or about the represented Web resource).

It's because it leads to introduction of "requirements" like this that I was concerned about the XG pronouncements on access and query. The user requirement a stated is fair enough, but the assumption of a new specification to achieve it is not. I think this is introducing a range of complexities for which there is no clear need of a standard. I strongly hold the view that provenance should be treated as a resource like any other web resource, and the introduction of special mechanisms for provenance should be kept to an absolute minimum. You make reference to "protocol similar to the SPARQL" -- I say that "similar to" is not good enough: it should *be* SPARQL. I see no compelling case that SPARQL cannot do any reasonably required job. -- GK.

--Comment by Daniel G (18:17, 16 June 2011). Of course, but my concern was that if this design will involve some automatic recursive redirection form the server instead of doing it manually. For example, if I type uri_path?provenance?provenance?provenance in the browser will it take me directly to the provenance information of the provenance information of the provenance info of the uri_path? Or do I have to access those uris myself step by step?

Your browser should do this for you.

This said, it does not make much sense:

the result of a GET for uri_path?provenance may vary over time, since more provenance information (from multiple observers) can become available

hence, each result of a GET for uri_path?provenance needs to have its own identity (which could be embedded in the result, or returned in the http header, as described above)

it is this second identity that needs to be used to obtain the provenance of the provenance

uri_path?provenance?provenance would fail to identify which provenance information we want to obtain provenance for.

You are right. I had not considered to add multiple provenance providers for the same resource. --DanielG 2011-06-23

-- I don't particularly like this approach - I would very much like to see a way to have a standalone Provenance Service similar to how SparQL endpoints and search engines exists today - but I don't like the ?provenance or fake-host approach. A RESTful approach would be to have a single resource which describes how to search for the provenance from the service, a "fill in the blanks" approach. In (X)HTML this would be a <form> with the field "URI" - but multiple other parameters might be desired, such as author, time, source. (I'll put this in as a separate proposal) Is it within the scope of this WG to define such a service, or can we just define a minimal provenance service which could be extended depending on capabilities? --Stian, 2011-06-20

Proposal: RESTful provenance service, by Stian

A RESTful approach where a resource will present a kind of form for the client to fill in to perform a search for provenance information. Different representational formats might provide different fields and mechanisms for how to perform the search. It is out of the scope of this WG to define all such formats and possible extensions, but we probably want to suggest a minimal standard representation.

The main idea is to follow the "hypermedia as the engine of the resource state" - and so the representation should say how to perform the search, instead of the WG trying to formalize URI patterns or HTTP mechanisms.

XHTML representation

An example of how an XHTML representation with an XML microformat can provide a form which programmatically can be recognized as a way to access a Provenance Service.

In this example the form has been annotated with microdata using the prov namespace which the WG can define for the minimal properties required. In this case:

prov:searchType="form" means this <form> describes how to access a Provenance Service.

prov:thingURI="uri" means the given <input> field can be filled in to search for the provenance of the Thing identified with the given URI. (The field-name can be chosen by the service)

Other (extensible) fields might be present - but not required if prov:thingURI is filled in.

The search-form is just a regular application/x-www-form-urlencoded (can be overriden by <form enctype>) with the given field 'uri' set to the escaped version of the URI. If the form is using GET instead of POST, or returns with a redirect to a Search Result resource (as above), then you can also search for the provenance of a search result.

prov:results="3" means the service knows of 3 provenance resources matching the search parameters. These might or might not be dereferenceable or still exist.

Search parameters used can optionally be repeated, here prov:thingURI is shown.

prov:source="internal" means that the <a href> is a link to a provenance resource, and the provenance is considered "internal" to this service

prov:source="external" means that the <a href> is a link to a provenance resource, and the provenance is considered "external" to this service - so the service takes not responsibility for the correctness of said provenance.

JSON representation

An example of how a JSON representation can provide a URI template which programmatically can be recognized as a way to access a Provenance Service.

The returned JSON might contain other keys than "org.w3.www.2011.prov.service" which the client might or might not understand.

In this example the initial 'form' JSON contains an URI template where the client can fill in the URI for the thing it wants the provenance of, and then perform a GET. Other URI templates might be available under other keys, and might have additional parameters.

"org.w3.www.2011.prov.service" means that keys under here are following the WG service specifications. JSON does not have namespaces or schemas, so this is just an attempt to allow mixing of standardised data and third-party data.

"search-templates" contains a list of URI templates for performing provenance search. The client can choose any template where he understands all the {parameters}. There should be at least (and at most) one search template which requires only the parameter {thing-uri}.

In the result, "search" repeats the parameters of the search.

"result-count" shows the (possibly approximate) total number of results found.

"results" lists the found provenance resources. Each is described with a relative or absolute "provenance-uri" reference (similar to <a href>) and optionally the "source" to say if this is an "internal" or "external" provenance.

Comments

I quite like the way your provenance service allows for multiple provenances (or more precisely provenance-uris) to be returned. The provenance service I described above could only return one provenance representation.

Looking at a REST view of a provenance service, isn't a POST supposed to create a new resource? It seems that this interaction is not intended to create anything: instead can we use a GET?

Indeed it does create a new resource, http://example.com/some/search/results/7227 - but the XHTML form could also have method="GET". I mainly wanted to show how the format defines the interaction, and HTML is fairly well understood, although we should recommend a subset to avoid clients being complete browsers! --Stian

OK, is the resource created by the POST, or was it created when a new provenance-uri was registered in the provenance service for this thing-uri? all things being equal, would a new resource be created for another POST (with same parameters?) --Luc

Good question, but REST does not guarantee you that the resource did not exist before the POST. (Use POST for creation != POST means creation). This is just the regular Post/Redirect/Get pattern - I can change it to do a GET submission instead, but I wanted to avoid that, because it will look too much like you can only do URI patterns. --Stian

Note: I only added "prov:source='internal' to tag the <a> link (instead of prov:hit="hit") - I don't think the internal/external distinction is particularly important" -- Stian

I like this approach, it seems to cover all the problems when accessing the provenance of a resource. But what if we have several restful services providing different provenance themselves? (I don't believe that we assume to have every provenance source in one centralized service, am I right?). Would we have to include these services in the html document instead of the uri's directly? (for example using another of the approaches proposed in this page) -- Daniel G 06/23/2011, 13:58.

Embedding information necessary to retrieve provenance in an HTML document

Proposals to meet the second aim

The identity of the HTML document and the location from which to download its provenance can be embedded in the HTML itself.

I propose an adaptation of this mechanism, probably using a different link rel= value. The POWDER specification makes no explicit reference to the RFC 5988 registry of link relations, but I expect they could apply also to HTML header <Link> elements.

Documents MAY also include any of the attribution data from the POWDER document in meta tags. In particular, the issuedby field is likely to be useful to user agents deciding whether or not to fetch the full POWDER document. Any attribution data encoded in meta tags within an HTML document should be the same a s that in the POWDER document. In case of discrepancy, the POWDER document should be taken as more authoritative.

Proposal: Use HTML Link (without POWDER)

Same as above example, but using a provenance profile since POWDER does not include any provenance-related metadata, per se.

Comment

Why? I don't see the added value in the profile -- GK, 2011-06-17

Proposal: Use POWDER to provide a default provenance service

POWDER allows for some metadata to be associated with all resources within a host (or sub-path within).
The following example illustrates a powder.xml file that associates provenanceservice.com as the provenance service for all resources in example.com.

Comments

I don't see the point in this. This looks to me like added complexity without any clear use-case. I would suggest focusing first on the simple cases and then figuring our where the important gaps are. If you want to have the group complete on an aggressive schedule, introducing complex mechanisms without a clear need isn't going to help. GK, 2011-06-17

Proposal: Use of the <Meta> HTML tag or microdata format

(By Khalid Belhajjame)

As a proposal to the second question, the identity "I" and the location of the provenance "L" can be embeded in the HTML file. This can be achieved using the construct "Meta", e.g.:

Or by using microdata. For example, the following provides an example of a HTML document in which the properties "identifiedBy" and "provenance" are used to identify the identity of the HTML document and the location where its provenance can be found.

I agree with the thrust of Luc's comment. I see no need for link and meta as alternatives - pick one. Personally, I think the link tag approach is cleaner and more appropriate, and more widely applicable, and can be introduced more easily using existing mechanisms (via the link tag registry). -- GK, 2011-06-17

At this time, I think Microdata is controversial. Cf. http://lists.w3.org/Archives/Member/tag/2011Jun/0021.html under consideration by W3C TAG (member-only-link at this time). I don't discount its consideration at a later date, but I wouldn't expend effort on it right now. -- GK, 2011-06-17

Issues regarding the above

Open discussions, informative notes, criticisms, etc. regarding the proposals above. These would ideally be articulated as short specific questions which can be discussed individually at the F2F meeting.

Issue: Access mechanism should be independent of provenance format

(by GK - 2011-06-09)

The mechanisms for accessing provenance are independent of the format of the provenance information itself. While there is an expectation that provenance information will be returned as RDF or RDFa, other formats are possible and the access mechanisms should as far as possible avoid any dependence on the format of information received.

That's why we can use content negotiation when retrieving provenance: for instance, using the HTTP ACCEPT field to indicate whether we want json, rdf, or other representation.

Issues beyond scope

Issues about provenance access and query that go beyond the two specific aims above. These will either be discussed at the F2F meeting if there is time, or else postponed until after.

Comment by Daniel G (16-06-2011): Guidelines for provenance publishers, based on the consensus reached in the F2F meeting. (How can I publish my content in order to make it accesible to anyone else?)

How do we address situations where: the location L does not have the requested provenance information for I? The provenance has "moved" to a different location L2?

How do we specify multiple locations for the provenance location, L1 and L2, in the downloaded document?

(By Yogesh S. 2011-06-16)

Some scientists that I have talked to are worried about the concept of having provenance separate from their data. They believe it is a recipe for disaster. They want to make sure the data and any provenance info are within the same file, so that it is always traveling together as it moves from hand to hand. (by Yolanda G., 2011-06-23)

I agree this is a concern, but one that I think is part of a larger problem. There are any number of composite package formats that can achieve this. In the Wf4Ever project, we are exploring the notion of research object to compose data, provenance and much much more in the context of workflow preservation. -- GK.