IIT Database Group

Ariadne

Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. Ariadne is a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained stream provenance. In addition to applying this technique to compute provenance eagerly during query execution, we also investigated how to decouple provenance computation from query processing to reduce runtime overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We have developed stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne is implemented as an extension of the Borealis DSMS.

Features

On-demand provenance generation for streaming queries

Instrument parts of the query network to propagate provenance

Temporary preservation of inputs and reconstruction for provenance retrieval

Eager generation of provenance

Lazy generation of provenance through replay

Compression techniques for provenance

Operator Instrumentation

We call the main approach that Ariadne uses to generate provenance for a streaming query network operator instrumentation. The key idea behind our operator instrumentation approach is to extend each
operator implementation so that the operator is able to annotate its output
with provenance information based on provenance annotations on its inputs.
So far, we are not aware of any system that actually implements such an
approach. Under operator instrumentation, provenance annotations are processed in line with the regular data.
That is, the structure of the original query network is kept as is (operators are simply replaced
with their instrumented counterparts).
Thus, most issues caused by non-determinism are dealt with in a rather
natural way, since the execution of the original query network is traced. Furthermore, provenance for a subnetwork can be traced by instrumenting only operators in that subnetwork.
The only drawback of operator instrumentation is the need to extend all
operators. However, this extension can be implemented with reasonable effort.

Lazy and Eager Provenance Generation and Retrieval

With operator instrumentation, provenance can be generated either eagerly during query execution (our default approach)
or lazily upon request. We support both types of generation, because their performance characteristics in terms of storage, runtime, and retrieval overhead are different. This enables the user to trade runtime-overhead on the original query network for storage cost and runtime-overhead when retrieving provenance.

The figure above shows an overview of the reduced-eager operator instrumentation approach. We temporarily store the input tuples for the instrumented parts of the network (e.g., for input streams \(S_1\) and \(S_2\), since we want provenance for the entire query network). The tuples in the output stream of the instrumented network carry the provenance annotations as described above. Recall that a provenance annotation is the set of identifiers for the tuples in the provenance of an output.
Provenance is reconstructed for retrieval from the annotations using a new operator called p-join (\(\ltimes\)). For each output tuple \(t\), this operator retrieves all input tuples in the provenance using the set of identifiers from the provenance annotation and outputs all combinations of \(t\) with a tuple from its provenance. Each of these combinations is outputted as a single tuple to stream \(P\).

We call this approach Reduced-Eager, because we are eagerly propagating a reduced form of provenance (the tuple identifier sets) during query execution and lazily reconstructing provenance independent of the execution of the original network. In comparison with using sets of full tuples as annotations, this approach pays a price for storing and reconstructing tuples. However, because compressed
representations can be used, this cost is offset by a significant reduction in
provenance generation cost (in terms of both runtime and latency).
Since reconstruction is separate from generation, we can
often avoid reconstructing complete provenance tuples during provenance
retrieval, e.g., if the user only requests provenance for some results (query over provenance).