The overwhelming flow of information in many data stream applications forces
many companies to outsource to a third-party the deployment of a Data Stream
Management System (DSMS) for performing desired computations. Remote
computations intrinsically raise issues of trust, making query execution
assurance on data streams a problem with practical implications. Consider a
client observing the same data stream as a remote server (e.g., network
traffic),
that registers a continuous query on the server's DSMS, and receives answers
upon request. The client needs to verify the integrity of the results using
significantly fewer resources than evaluating the query locally. Towards that
goal, we propose a {\em probabilistic algorithm} for selection and
aggregate/group-by queries, that uses constant space irrespective of the
result-set size, has low update cost, and arbitrarily small probability of
failure. We generalize this algorithm to allow some tolerance on the number of
errors permitted (irrespective of error magnitude), and also discuss the
hardness of permitting arbitrary errors of small magnitude. We also perform an
empirical evaluation using live network traffic.

Update: (05/14/08)Fixed some bugs and newly released the PIRS* and FM-PIRS!

This version includes the followings: PIRS,
PIRS test with World Cup Data Set, PIRS test with IPs data set, PIRS^\gamma (PIRSET,
the exact solution) , PIRS^\pm\gamma (PIRSAT, the approximate solution) and
their corresponding buffering techniques (using LRU to explore locality), and
PIRS-attack to simulate attacks to PIRS. Now also includes PIRS* (to identify which groups contain errors and) and FM-PIRS that could indentify how many errors in all groups!

Quick Install

The subfolder's names are self-explain (pirsgi refers to PIRS*, gi stands for group identifier). Each
subfolder contains a Makefile for easy-compilation. All the main test program
has a verbose help output to explain what parameters it expects.

We have tested PIRS (and its variants) for two sets of real data streams. One
is the World Cup data
set which consists of the access log for 1998 world cup web sites. You may
find detailed description of this data set from
here. The second
one is the live IP traces from AT&T backbone
network. Due to privacy and legal requirement, we are not able to release the
IPs dataset. However, the World Cup data stream is public. Our experiment consists of the traffic/request from
Day 46 to Day 47. You may download the original data from
WCUP_data_day4647.
However, you need to combine the individual small data files into one file. For
testing purpose, you may grab any one small file from
ITA website and
test-run PIRS and its variants.