Overview

The idea of this document is to sketch what aspects of the provenance model can be formalized and how they can be formalized, as a first step towards establishing a consensus on the (intended) meaning of the components of the model and the consistency constraints or inferences that can be applied to the model to distinguish valid from invalid provenance records.

The PROV-CONSTRAINTS document contains formal content specifying a notion of validity (approximately, logical consistency) for PROV documents. The formal semantics, PROV-SEM, is planned for release as a W3C Note that will complement the procedural specification in PROV-CONSTRAINTS with a declarative specification formulated in terms of first-order logic. The formal semantics is work in progress. The drafts below are intermediate stages and some of them are out of date.

The current version (editor's draft) of the formal semantics can always be found at: FormalSemanticsED.

Status

This is work in progress. The semantics is being updated to be consistent with the Candidate Recommendation of PROV. The plan is to release the semantics as a Note over the next few months. At that point, the wiki pages containing drafts of the semantics will be superseded.

Idea of the semantics

As a starting point, I will assume that we intend the assertions made in a PROV-DM instance to be intended to describe one, consistent state of the world, much like a logical formula is said to be satisfied in a mathematical model. That is, I propose an approach similar to that taken in model theory, where the PROV-DM instance corresponds to a formula or theory of a logic, and the semantics corresponds to what logicians call a model.

For example, the formula <math>\forall x. P(x) \Rightarrow Q(x)</math> is satisfied in a mathematical model where the relation <math>P</math> denotes a set of elements that is contained in that denoted by <math>Q</math>. Here, the goal is to come up with a plausible "intended model" for interpreting PROV-DM instances, where the formulas are assertions in PROV-DM and the individuals are things and agents. This is complicated by the fact that many statements about provenance involve talking about objects that change over time.

The word "world" is used in PROV-DM to talk about the actual state of affairs that the PROV-DM instance describes, which is what I would usually call a "model". The word "model" is used in PROV-DM mainly in the sense of "data model", that is, to talk about what I would otherwise call the syntax of PROV-DM. To avoid confusion with the uses of terms in PROV-DM, I will use "world model" to describe the mathematical structure that corresponds to actual state of affairs, and will try to avoid ambiguous, unqualified uses of the word "model".

Axiomatization and relationship to PROV-CONSTRAINTS

One goal of the semantics is to link the procedural specification of validity and equivalence with traditional notions of logical consistency and equivalence of theories, for example in first-order logic. A first-order axiomatization that corresponds to the formal constraints and is sound for reasoning about the models described below is in progress at the end of the document.

Basics

I will use syntax for PROV-DM records (which I will usually call formulas) as described in the Candidate Recommendation of PROV-DM (PROV-DM CR).

A PROV-DM instance, or set of atomic formulas <math>\phi_1</math>...<math>\phi_n</math>, is interpreted as a conjunction, that is, the overall instance is considered to hold in a given structure if each atomic formula in it holds.

The rest of the document will discuss the structures and define when an atomic assertion holds in a given world.

Identifiers

A lowercase symbol <math>x,y,...</math> on its own denotes an identifier. Identifiers may or may not be URIs. I view identifiers as being like variables in logic (or blank nodes in RDF): just because we have two different identifiers <math>x</math> and <math>y</math> doesn't tell us that they denote different things, since we could discover that they are actually the same later. We write <math>Identifiers</math> for the set of identifiers of interest in a given situation (typically, the set of identifiers present in the PROV instance of interest).

Times and Intervals

We assume a linearly ordered set <math>(Times,\leq)</math> of time instants. For convenience we assume the order is total or linear order, corresponding to a linear timeline; however, PROV does not assume that time is linear and events could be partially ordered and not necessarily reconciled to a single global clock.

We also consider a set <math>Intervals</math> of closed intervals of the form <math>\{t \mid t_1 \leq t \leq t_2\}</math>.

Attributes and Values

We assume a set <math>Attributes</math> of attribute labels and a set <math>Values</math> of possible values of attributes.

Formulas

The following atomic formulas correspond to the statements of PROV-DM. We assume that definitions 1-4 of PROV-CONSTRAINTS have been applied in order to expand all optional parameters; thus, we use uniform notation <math>r(id,a_1,\ldots,a_n)</math> instead of the semicolon notation <math>r(id;a_1,\ldots,a_n)</math>.

Each parameter is either an identifier, a constant (e.g. a time or other literal value in an attribute list), or a null symbol "-". Null symbols can only appear in the specified arguments in <math>wasAssociatedWith</math> and <math>wasDerivedFrom</math>, as shown in the grammar below.

Note that this description does not say what the structure of an object is, only how it may be described in terms of its time interval and attribute values. An object could just be a record of fixed attribute values; it could be a bear; it could be the Royal Society; it could be a transcendental number like <math>\pi</math>. All that matters from our point of view is that we know how to map the object to its time interval and attribute mapping.

The range of the <math>value</math> function us <math>Values_\bot</math>, that is, <math>Values \uplus \{\bot\}</math>, the set of values with an additional element <math>\bot \notin Values</math>. When <math>value(x,a,t) = \bot</math>, we say that attribute <math>a</math> is undefined for <math>x</math> at time <math>t</math>.

It is possible for two Things to be indistinguishable by their attribute values and lifetime, but have different identity.

Objects

A Object is described by a time interval and attributes with unchanging values. Objects encompass entities, interactions, and activities.

To model this, a world includes

a set <math>Objects</math>

a function <math>lifetime : Objects \to Intervals</math> from objects to time intervals

Intuitively, <math>lifetime(e)</math> is the time interval during which object <math>e</math> exists. The value <math>value(e,a)</math> is the value of attribute <math>a</math> during the object's lifetime.

As with Things, the range of <math>value</math> includes the special undefined value <math>\bot</math>, making <math>value</math> effectively a partial function. It is also possible to have two different objects that are indistinguishable by their attributes and time intervals. Objects are not things, and the sets of <math>Objects</math> and <math>Things</math> are disjoint; however, certain objects, namely entities, are linked to things.

Entities

An entity is a kind of object that describes a time-slice of a thing, during which some of the thing's attributes are fixed. We assume:

a set <math>Entities \subseteq Objects</math> of entities, disjoint from <math>Activities</math> and <math>Events</math> below.

a function <math>thingOf : Entities \to Things</math> that associates each Entity with a Thing, such that for each <math>t \in lifetime(obj)</math>, and for each attribute <math>a</math> such that <math>value(obj,a) \neq \bot</math>, we have <math>value(obj,a) = value(thingOf(obj),a,t)</math>.

<math>lifetime(e) \subseteq lifetime(t)</math>.

Remark: Although both entities and things can have undefined attribute values, their meaning is slightly different: for a thing, <math>value(x,a,t) = \bot</math> means that the attribute <math>a</math> has no value at time <math>t</math>, whereas for an entity, <math>value(x,a) = \bot</math> only means that the entity does not record a fixed value for <math>a</math>. This does not imply that <math>value(thingOf(e),a,t) = \bot</math> when <math>t \in lifetime(e)</math>. In particular, if the <math>value(x,a,t)</math> has multiple values during the lifetime of <math>e</math>, then <math>value(e,a)</math> must be <math>\bot</math>, since assigning a value to <math>value(e,a)</math> would violate condition (3) above.

Plans

We identify a specific subset of the entities called plans, <math>Plans \subseteq Entities</math>.

Agents

An agent is an entity that can act, by controlling, starting, ending, or participating in activities. Agents can act on behalf of other agents. We introduce:

a set <math>Agents \subseteq Objects</math> of agents.

Actvities

An activity is an object that encompasses a set of events. We introduce

a set <math>Activities \subseteq Objects</math> of activities, disjoint from <math>Entities</math> and <math>Events</math>

Interactions

We consider a <math>Interactions \subseteq Objects</math> which are split into Events between entities and activities, Associations between agents and activities, and Derivations that describe chains of generation and usage steps. (The first two sets may overlap.) Interactions are disjoint from entities, activities and agents.

Events

An Event is an interaction whose lifetime is a single time instant, and relates an activity to an entity (which could be an agent). Events have types including usage, generation, starting and ending (possibly more may be added such as destruction/invalidation of an entity). Events are instantaneous. We introduce:

A function <math>type: Events \to \{start,end,use,generate\}</math> such that Events have types in <math>\{start,end,use,generate\}</math>.

Associations

An Association is an interaction relating an agent to an activity. Associations can overlap with events; for example, a start event is also an association. To model associations, we introduce:

A set <math>Associations \subseteq Interactions</math>, such that every event <math>evt \in Events</math> that is a start or end event is also an association. That is, <math>type(evt) \in \{start,end\}</math> implies <math>evt \in Associations</math>

Associations are used below in the <math>ActsFor</math> and <math>AssociatedWith</math> relations.
Add types for association or delegation?

Derivations

A Derivation is an interaction chaining one or more generation and use steps. Derivations can also carry attributes, so we introduce an explicit kind of interaction for them that can carry attributes.

A set <math>Derivations \subseteq Interactions</math>, disjoint from <math>Events \cup Associations</math>.

See below for the associated derivation path and DerivedFrom relation.

Relations

Simple relations

The entities, interactions, and activities in a world model are related in the following ways:

A relation <math>Used \subseteq Events \times Entities</math> saying when an event used an entity. An event can use at most one entity, and if <math>(evt,e)\in Used</math> then <math>time(evt) \in lifetime(e)</math> and <math>type(g) = use</math> must hold.

A relation <math>Generated \subseteq Events \times Entities</math> saying when an event generated an entity. An event can generate at most one entity, and if <math>(evt,e)\in Generated</math> then <math>min(lifetime(e)) = time(evt)</math> and <math>type(g) = generation</math> must hold.

A relation <math>Invalidated \subseteq Events \times Entities</math> saying when an event invalidated an entity. An event can invalidate at most one entity, and if <math>(evt,e)\in Invalidated</math> then <math>min(lifetime(e)) = time(evt)</math> and <math>type(g) = invalidation</math> must hold.

A relation <math>AssociatedWith \subseteq Association \times Agents \times Activities \times Plans^?</math> indicating when an agent is associated with an activity, and giving the identity of the association relationship, and an optional plan.

A relation <math>ActsFor \subseteq Agents \times Agents \times Activities</math> indicating when one agent acts on behalf of another with respect to a given activity.

Derivation paths and DerivedFrom

Recall that above we introduced a subset of interactions called Derivations. These identify paths of the form

Note: The reason why we need paths and not just individual derivation steps is that imprecise wasDerivedFrom formulas can represent multiple derivation steps.

Putting it all together

A world model W is a structure containing all of the above described data. If we need to talk about the objects or relations of more than one world model then we may write <math>W_1.Objects</math>; otherwise, to decrease notational clutter, when we consider a fixed world model then the names of the sets, relations and functions above refer to the components of that model.

TODO: List the components.

Semantics

In what follows, let <math>W</math> be a fixed world model with the associated sets and relations discussed in the previous section, and let <math>I</math> be an interpretation of identifiers as objects in <math>W</math>.

The annotations [WF] refer to well-formedness constraints that correspond to typing constraints.

Interpretations

We need to link identifiers to the objects they denote. We do this using a function which we shall call an interpretation.

The mapping from identifiers to objects may not change over time. Thus, we consider interpretations as follows:

An interpretation function <math>I : Identifiers \to Objects</math> describing which object is the target of each identifier.

Satisfaction

Consider an atomic formula <math>\phi</math>, a world <math>W</math> and an interpretation <math>I</math>. We define notation <math>W,I \models \phi</math> which means that <math>\phi</math> is satisfied in <math>W,I</math>. For basic assertions, the definition of the satisfaction relation is given in the next few subsections. For a conjunction of assertions <math>\phi_1,\ldots,\phi_n</math> we write <math>W,I \models \phi_1,\ldots,\phi_n</math> to indicate that <math>W,I \models \phi_1</math> and ... and <math>W,I \models \phi_n</math> hold.

For example, the following formulas both hold if <math>x</math> denotes an entity <math>e</math> such that <math>value(e,a) = 4, value(e,b) = 5, value(e,c) = 6</math> hold:

entity(x,[a=4,b=5])
entity(x,[a=4,b=5,c=6])

Activity Records

An activity record is of the form <math>activity(id,st,et,attrs)</math> where <math>id</math> is a identifier referring to the activity, <math>st</math> is a start time and <math>et</math> is an end time.

We say that <math>W,I \models activity(id,st,et,attrs)</math> if and only if:

Semantics of Relations

Entity-Activity

Generation

The generation assertion is of the form <math>wasGeneratedBy(id,e,a,t,attrs)</math> where <math>id</math> is an event identifier, <math>e</math> is an entity identifier, <math>a</math> is an activity identifier, <math>attrs</math> is a set of attribute-value pairs, and <math>t</math> is an optional time.

<math>W,I \models wasGeneratedBy(id,e,a,t,attrs)</math> if and only if:

Use

The use assertion is of the form <math>used(id,a,e,t,attrs)</math> where <math>id</math> denotes an event, <math>a</math> is an activity identifier, <math>e</math> is an object identifier, <math>attrs</math> is a set of attribute-value pairs, and <math>t</math> is an optional time.

Invalidation

The invalidation assertion is of the form <math>wasInvalidatedBy(id,e,a,t,attrs)</math> where <math>id</math> is an event identifier, <math>e</math> is an entity identifier, <math>a</math> is an activity identifier, <math>attrs</math> is a set of attribute-value pairs, and <math>t</math> is an optional time.

<math>W,I \models wasInvalidatedBy(id,e,a,t,attrs)</math> if and only if:

The two Entities refer to the same Thing, that is, <math>thingOf(ent_1) = thingOf(ent_2)</math>.

The lifetime of <math>obj_1</math> is contained in that of <math>ent_2</math>,i.e. <math>lifetime(ent_1) \subseteq lifetime(ent_2)</math>.

For each attribute such that <math>value(obj_2,a) \neq \bot</math> we have <math>value(obj_1,attr) = value(obj_2,attr)</math>.

The second criterion says that the two Entities present aspects of the same Thing. Note that the third criterion allows <math>obj_1</math> and <math>obj_2</math> to have the same lifetime (or that of <math>obj_2</math> can be larger). The last criterion allows <math>obj_1</math> to have more defined attributes than <math>obj_2</math>, but they must agree on the attributes defined by <math>obj_2</math>.

Remark: There has been discussion whether <math>specializationOf</math> is transitive and/oranti-symmetric:

Transitivity: If <math>specializationOf(a,b)</math> and <math>specializationOf(b,c)</math> hold then <math>specializationOf(a,c)</math> hold. This holds for the above definition.

Antisymmetry: If <math>specializationOf(a,b)</math> and <math>specializationOf(b,a)</math> hold then <math>a=b</math>. This doesn't follow from the current definition (but it would if we stipulated that two entities that have the same interval, attribute and thing are equal).

Alternate

The <math>alternateOf</math> relation indicates when two entity records present (possibly different) aspects of the same thing. The two entities may or may not overlap in time.

<math>alternateOf(a, b)</math> if and only if there exists c such that <math>specializationOf(a,c)</math> and <math>specializationOf(b,c)</math>? This does not necessarily hold without further assumptions about the Entities.

Axiomatization

Definitional Rules

The definitions are essentially used to map the compact notation and implicit placeholder notation used n PROV-N with the abstract syntax of PROV-DM used in the rest of this semantics. We can formalize the definitional expansion rules as follows:

Definition 4 (optional-placeholders)

<math>activity(id,t1,-,attrs) \iff \exists t2.~activity(id,t1,t2,attrs)</math>. Here, t1 must not be a placeholder.

For each <math>r \in \{ used, wasGeneratedBy, wasStartedBy, wasEndedBy, wasInvalidatedBy, wasAssociatedWith, actedOnBehalfOf \}</math>, if the <math>i</math>th parameter of <math>r</math> is an expandable parameter of <math>r</math> as specified in Table 3 of PROV-CONSTRAINTS then the following definition holds:

Constraint 52 (impossible-specialization-reflexive)

Constraint 53 (impossible-property-overlap)

For each r and s in { used, wasGeneratedBy, wasInvalidatedBy, wasStartedBy, wasEndedBy, wasInformedBy, wasAttributedTo, wasAssociatedWith, actedOnBehalfOf} such that r and s are different relation names, the following constraint holds:

Inference 12 (revision-is-alternate-inference)

In this inference, any of <math>a</math>, <math>g</math> or <math>u</math> may be placeholders.<math>
\forall id,e_1,e_2,a,g,u.~wasDerivedFrom(id,e_2,e_1,a,g,u,.(=(prov:type,prov:Revision),[])) \Longrightarrow alternateOf(e_2,e_1)
</math>

Inference 12 (revision-is-alternate-inference)

In this inference, any of <math>a</math>, <math>g</math> or <math>u</math> may be placeholders.∀ id,e_1,e_2,a,g,u. wasDerivedFrom(id,e_2,e_1,a,g,u,.(=(prov:type,prov:Revision),[])) ⟹ alternateOf(e_2,e_1)