Best Practices: Pragmatic Provenance for Government LOD

Status

Dec 2011 - Initial revisions by John Erickson (RPI)

Overview

Provide best practice recommendations for stakeholders on documenting the provenance of their linked government data and how to interpret that data so that consumers know what they are looking at. (suggested by Hadley Beeman)

Background =

In 1997 Tim Berners-Lee called for pervasive provenance on the Web:

At the toolbar (menu, whatever) associated with a document there is a button marked "Oh, yeah?". You press it when you lose that feeling of trust. It says to the Web, 'so how do I know I can trust this information?'. The software then goes directly or indirectly back to metainformation about the document, which suggests a number of reasons.

W3C GLD therefore seeks to recommend practices that enable government providers to create the metadata necessary to answer their users' "oh yeah?" questions about the linked data they publish. Our recommendations may include processes as well as the application of specific vocabularies/ontologies.

What do we mean by "Provenance?"

The W3C's Provenance Incubator Group (2010) provides this simple definition of provenance:

Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.

More recently the W3C Provenance WG (PROV-WG) defines "provenance" for their work:

The provenance of digital objects represents their origins. The PROV Data Model (PROV-DM) is a proposed standard to represent provenance records, which contain assertions about the entities and activities involved in producing and delivering or otherwise influencing a given object. By knowing the provenance of an object, we can make determinations about how to use it. Provenance records can be used for many purposes, such as understanding how data was collected so it can be meaningfully used, determining ownership and rights over an object, making judgments about information to determine whether to trust it, verifying that the process and steps used to obtain a result complies with given requirements, and reproducing how something it was generated...As a standard for provenance, PROV-DM accommodates all those different uses of provenance. Different people may have different perspectives on provenance, and as a result different types of information might be captured in provenance records.

What do we mean by "Pragmatic Provenance?"

The W3C Government Linked Data WG accepts PROV WG's definition of provenance but recognizes that PROV-DM is a powerful tool. W3C GLD WG seeks to provide best practice recommendations that will be useful to government data stakeholders, that make sense for GLD use cases and are easily adopted by practitioners.

W3C GLD could recommend a simple provenance scoring system for GLD analogous to TBL's 5 stars for linked data. Such a system might include:

One star: Using the basic W3C DCAT for Linked Data at the catalogs and dataset level

Two stars: DCAT enhanced with more complete Dublin Core and other metadata

Three stars: Above, but with based provenance metadata "within" the datasets