Probabilistic Database Summarization
for Interactive Data Exploration

Abstract

We present a probabilistic approach to generate a small, query-able summary of a dataset for interactive data exploration. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques and give three critical optimizations to improve preprocessing time and query accuracy. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the United States and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can successfully answer queries faster than sampling while introducing, on average, no more error than sampling and can better distinguish between rare and nonexistent values.

Interactive data exploration allows a data analyst to browse, query, transform, and visualize data at “human speed” [7]. It has been long recognized that general-purpose DBMSs are ill suited for interactive exploration [19]. While users require interactive responses, they do not necessarily require precise responses because either the response is used in some visualization, which has limited resolution, or an approximate result is sufficient and can be followed up with a more costly query if needed. Approximate Query Processing (AQP) refers to a set of techniques designed to allow fast but approximate answers to queries. All successful AQP systems to date rely on sampling or a combination of sampling and indexes. The sample can either be computed on-the-fly, e.g., in the highly influential work on online aggregation[12] or systems like DBO [14] and Quickr [16], or precomputed offline, like in BlinkDB [2] or Sample+Seek [9]. Samples have the advantage that they are easy to compute, can accurately estimate aggregate values, and are good at detecting heavy hitters. However, sampling may fail to return estimates for small populations; targeted stratified samples can alleviate this shortcoming, but stratified samples need to be precomputed to target a specific query, defeating the original purpose of AQP.

In this paper, we propose an alternative approach to interactive data exploration based on the Maximum Entropy principle (MaxEnt). The MaxEnt model has been applied in many settings beyond data exploration; e.g., the multiplicative weights mechanism [11] is a MaxEnt model for both differentially private and, by [10], statistically valid answers to queries, and it has been shown to be theoretically optimal. In our setting of the MaxEnt model, the data is preprocessed to compute a probabilistic model. Then, queries are answered by doing probabilistic inference on this model. The model is defined as the probabilistic space that obeys some observed statistics on the data and makes no other assumptions (Occam’s principle). The choice of statistics boils down to a precision/memory tradeoff: the more statistics one includes, the more precise the model and the more space required. Once computed, the MaxEnt model defines a probability distribution on possible worlds, and users can interact with this model to obtain approximate query results. Unlike a sample, which may miss rare items, the MaxEnt model can infer something about every query.

Despite its theoretical appeal, the computational challenges associated with the MaxEnt model make it difficult to use in practice. In this paper, we develop the first scalable techniques to compute and use the MaxEnt model. As an application, we illustrate it with interactive data exploration. Our first contribution is to simplify the standard MaxEnt model to a form that is appropriate for data summarization (Sec. 3). We show how to simplify the MaxEnt model to be a multi-linear polynomial that has one monomial for each possible tuple (Sec. 3, Eq. (5)) rather than its naïve form that has one monomial for each possible world (Sec. 2, Eq. (2)). Even with this simplification, the MaxEnt model starts by being larger than the data. For example, the flights dataset is 5 GB, but the number of possible tuples is approximately 1010, more than 5 GB. Our first optimization consists of a compression technique for the polynomial of the MaxEnt model (Sec 4.1); for example, for the flights dataset, the summary is below 200MB, while for our larger dataset of 210GB, it is less than 1GB. Our second optimization consists of a new technique for query evaluation on the MaxEnt model (Sec. 4.2) that only requires setting some variables to 0; this reduces the runtime to be on average below 500ms and always below 1s.

We find that the main bottleneck in using the MaxEnt model is computing the model itself; in other words, computing the values of the variables of the polynomial such that it matches the existing statistics over the data. Solving the MaxEnt model is difficult; prior work for multi-dimensional histograms [18] uses an iterative scaling algorithm for this purpose. To date, it is well understood that the MaxEnt model can be solved by reducing it to a convex optimization problem [23] of a dual function (Sec. 2), which can be solved using Gradient Descent. However, even this is difficult given the size of our model. We managed to adapt a variant of Stochastic Gradient Descent called Mirror Descent [5], and our optimized query evaluation technique can compute the MaxEnt model for large datasets in under a day.

In summary, in this paper, we develop the following new techniques:
{myitemize}

A closed-form representation of the probability space of possible worlds using the Principle of Maximum Entropy, and a method
to use the representation to answer queries in expectation (Sec 3).

A new method for selecting 2-dimensional statistics based on a modified KD-tree (Sec 4.3).

We implement the above techniques in a prototype system that we call EntropyDB and evaluate it on the flights and astronomy datasets. We find that EntropyDB can answer queries faster than sampling while introducing no more error, on average, and does better at identifying small populations.

We summarize data by fitting a probability distribution over the active domain. The distribution assumes that the domain values are distributed in a way that preserves given statistics over the data but are otherwise uniform.

For example, consider a data scientist who analyzes a dataset of flights in the United States for the month of December 2013. All she knows is that the dataset includes all flights within the 50 possible states and that there are 500,000 flights in total. She wants to know how many of those flights are from CA to NY. Without any extra information, our approach would assume all flights are equally likely and estimate that there are 500,000/502=200 flights.

Now suppose the data scientist finds out that flights leaving CA only go to NY, FL, or WA. This changes the estimate because instead of there being 500,000/50=10,000 flights leaving CA and uniformly going to all 50 states, those flights are only going to 3 states. Therefore, the estimate becomes 100,000/3=3,333 flights.

This example demonstrates how our summarization technique would answer queries, and the rest of this section covers its theoretical foundation.

2.1 Possible World Semantics

To model a probabilistic database, we use the slotted possible world semantics where rows have an inherent unique identifier, meaning the order of the tuples matters. Our set of possible worlds is generated from the active domain and size of each relation. Each database instance is one possible world with an associated probability such that the probabilities of all possible worlds sum to one.

In contrast to typical probabilistic databases where the probability of a relation is calculated from the probability of each tuple, we calculate a relation’s probability from a formula derived from the MaxEnt principle and a set of constraints on the overall distribution. This approach captures the idea that the distribution should be uniform except where otherwise specified by the given constraints.

2.2 The Principle of Maximum Entropy

The Principle of Maximum Entropy (MaxEnt) states that subject to prior data, the probability distribution which best represents the state of knowledge is the one that has the largest entropy. This means given our set of possible worlds, PWD, the probability distribution Pr(I) is one that agrees with the prior information on the data and maximizes

−∑I∈PWDPr(I)log(Pr(I))

where I is a database instance, also called possible world. The above probability must be normalized, ∑IPr(I)=1, and must satisfy the prior information represented by a set of k expected value constraints:

sj=E[ϕj(I)],j=1,k

(1)

where sj is a known value and ϕj is a function on I that returns a numerical value in R.
One example constraint is that the number of flights from CA to WI is 0.

Following prior work on the MaxEnt principle and solving constrained optimization problems [4, 23, 20], the MaxEnt probability distribution takes the form

Pr(I)=1Zexp(k∑j=1θjϕj(I))

(2)

where θj is a parameter and Z is the following normalization constant:

Z\lx@stackreldef=∑I∈PWD(exp(k∑j=1θjϕj(I))).

To compute the k parameters θj, we must solve the non-linear system of k equations, Eq. (1), which is computationally difficult. However, it turns out [23] that Eq. (1) is equivalent to ∂Ψ/∂θj=0 where the dualΨ is defined as:

Ψ\lx@stackreldef=k∑j=1sjθj−ln(Z).

Furthermore, Ψ is concave, which means solving for the k parameters can be achieved by maximizing Ψ. We note that Z is called the partition function, and its log, ln(Z), is called the cumulant.

This section explains how we use the MaxEnt model for approximate query answering. We first show how we use the MaxEnt framework to transform a single relation R into a probability distribution represented by P. We then explain how we use P to answer queries over R.

3.1 Maximum Entropy Model of Data

We consider a single relation with m attributes and schema R(A1,…,Am) where each attribute, Ai, has an active domain Di, assumed to be discrete and ordered.1 Let Tup=D1×D2×⋯×Dm={t1,…,td} be the set of all possible tuples. Denoting Ni=|Di|, we have d=|Tup|=∏mi=1|Ni|.

An instance for R is an ordered bag of n tuples, denoted I. For each I, we form a frequency vector which is a d-dimensional vector2nI=[nI1,…,nId]∈Rd, where each number nIi represents the count of the tuple ti∈Tup in I (Fig. 1). The mapping from I to nI is not one-to-one because the instance I is ordered, and two distinct instances may have the same counts. Further, for any instance I of cardinality n, ||nI||1=∑inIi=n. The frequency vector of an instance consisting of a single tuple {ti} is denoted nti=[0,…,0,1,0,…,0] with a single value 1 in the ith position; i.e., {nti:i=1,d} forms a basis for Rd.

While the MaxEnt principle allows us, theoretically, to answer any query probabilistically by averaging the query over all possible instances; in this paper, we limit our discussion to linear queries. A linear query is a d-dimensional vector q=[q1,…,qd] in Rd. The answer to q on instance I is the dot product ⟨q,nI⟩=∑di=1qinIi. With some abuse of notation, we will write I when referring to nI and ti when referring to nti. Notice that ⟨q,ti⟩=qi, and, for any instance I, .

Fig. 1 illustrates the data and query model. Any counting query is a vector q where all coordinates are 0 or 1 and can be equivalently defined by a predicate π such that ⟨q,I⟩=|σπ(I)|; with more abuse, we will use π instead of q when referring to a counting query. Other SQL queries can be modeled using linear queries, too. For example, SELECT A, COUNT(∗) AS cnt FROM R GROUP BY A ORDER BY cnt DESC LIMIT 10 corresponds to several linear queries, one for each group, where the outputs are sorted and the top 10 returned.

Our goal is to compute a summary of the data that is small yet allows us to approximatively compute the answer to any linear query. We assume that the cardinality n of R is fixed and known. In addition, we know k statistics, Φ={(cj,sj):j=1,k}, where cj is a linear query and sj≥0 is a number. Intuitively, the statistic (cj,sj) asserts that ⟨cj,I⟩=sj. For example, we can write 1-dimensional and 2-dimensional (2D) statistics like |σA1=63(I)|=20 and |σA1∈[50,99]∧A2∈[1,9](I)|=300.

Next, we derive the MaxEnt distribution for the possible instances I of a fixed size n. We replace the exponential parameters θj with ln(αj) so that Eq. (2) becomes

Pr(I)=1Z∏j=1,kα⟨cj,I⟩j.

(3)

We prove the following about the structure of the partition function Z:

Lemma \thelemma

The partition function is given by

Z=Pn

(4)

where P is the multi-linear polynomial

P(α1,…,αk)\lx@stackreldef=∑i=1,d∏j=1,kα⟨cj,ti⟩j.

(5)

{proof}

Fix any n=[n1,…,nd] such that ||n||1=∑di=1ni=n. The number of instances I of cardinality n with nI=n is n!/∏ini!. Furthermore, for each such instance, ⟨cj,I⟩=⟨cj,n⟩=∑ini⟨cj,ti⟩. Therefore,

Z=

∑IPr(I)=∑n:||n||1=nn!∏ini!∏j=1,kα∑ini⟨cj,ti⟩j

=

⎛⎝∑i=1,d∏j=1,kα⟨cj,ti⟩j⎞⎠n=Pn.

The data summary consists of the polynomial P (Eq. (5)) and the values of its parameters αj; the polynomial is defined by the linear queries cj in the statistics Φ, and the parameters are computed from the numerical values sj.

Example \theexample

Consider a relation with three attributes R(A,B,C), and assume that the domain of each attribute has 2 distinct elements. Assume n=10 and the only statistics in Φ are the following 1-dimensional statistics:

(A=a1,3)(B=b1,8)(C=c1,6)(A=a2,7)(B=b2,2)(C=c2,4).

The first statistic asserts that |σA=a1(I)|=3, etc. The polynomial P is

P=

α1β1γ1+α1β1γ2+α1β2γ1+α1β2γ2+

α2β1γ1+α2β1γ2+α2β2γ1+α2β2γ2

where α1,α2 are variables associated with the statistics on A, β1,β2 are for B3, and γ1,γ2 are for C.

The red variables are the added 2-dimensional statistic variables; we use [αβ]1,1 to denote a single variable corresponding to a 2D statistics on the attributes AB. Notice that each red variable only occurs with its related 1-dimensional variables. αβ1,1, for example, is only in the same term as α1 and β1.

Now consider the earlier instance I. Its probability becomes Pr(I)=α101β1β92γ1γ92[αβ]1,1[βγ]1,1/P10.

To facilitate analytical queries, we choose the set of statistics Φ as follows:
{myitemize}

Each statistic ϕj=(cj,sj) is associated with some predicate πj such that ⟨cj,I⟩=|σπj(I)|. It follows that for every tuple ti, ⟨cj,ti⟩ is either 0 or 1; therefore, each variable αj has degree 1 in the polynomial P in Eq. (5).

For each domain Di, we include a complete set of 1-dimensional statistics in our summary. In other words, for each v∈Di, Φ contains one statistic with predicate Ai=v. We denote Ji⊆[k] the set of indices of the 1-dimensional statistics associated with Di; therefore, |Ji|=|Di|=Ni.

We allow multi-dimensional statistics to be given by arbitrary predicates. They may be overlapping and/or incomplete; e.g., one statistic may count the tuples satisfying A1∈[10,30]∧A2=5 and another count the tuples satisfying A2∈[20,40]∧A4=20.

We assume the number of 1-dimensional statistics dominates the number of attribute combinations; i.e., ∑mi=1Ni≫2m.

If some domain Di is large, it is beneficial to reduce the
size of the domain using equi-width buckets. In that case, we assume the elements of Di represent buckets, and Ni is the number of buckets.

We enforce our MaxEnt distribution to be overcomplete[23, pp.40] (as opposed to minimal). More precisely, for any attribute Ai and any instance I, we have ∑j∈Ji⟨cj,I⟩=n, which means that some statistics are redundant since they can be computed from the others and from the size of the instance n.

Note that as a consequence of overcompleteness, for any attribute Ai, one can write P as a linear expression

P=∑j∈JiαjPj

(7)

where each Pj, j∈Ji is a polynomial that does not contain the variables (αj)j∈Ji. In Example 3.1, the 1-dimensional variables for A are α1, α2, and indeed, each monomial in Eq. (3.1) contains exactly one of these variables. One can write P as P=α1P1+α2P2 where α1P1 represents the first two lines and α2P2 represents the last two lines in Eq. (3.1). P is also linear in β1, β2 and in γ1, γ2.

3.2 Query Answering

In this section, we show how to use the data summary to approximately answer a linear query q by returning its expected value E[⟨q,I⟩]. The summary (the polynomial P and the values of its variables αj) uniquely define a probability space on the possible worlds (Eq. (3) and (5)). We start with a well known result in the MaxEnt model. If cℓ is the linear query associated with the variable αℓ, then

E[⟨cℓ,I⟩]=nPαℓ∂P∂αℓ.

(8)

We review the proof here. The expected value of ⟨cℓ,I⟩ over the probability space (Eq. (3)) is

E[⟨cℓ,I⟩]=

1Pn∑I⟨cℓ,I⟩∏jα⟨cj,I⟩j=1Pn∑Iαℓ∂∂αℓ∏jα⟨cj,I⟩j

=

1Pnαℓ∂∂αℓ∑I∏jα⟨cj,I⟩j=1Pnαℓ∂Pn∂αℓ=nPαℓ∂P∂αℓ.

To compute a new linear query q, we add it to the statistical queries cj, associate it with a fresh variable β,
and denote Pq the extended polynomial:

Pq(α1,…,αk,β)\lx@stackreldef=∑i=1,d∏j=1,kα⟨cj,ti⟩jβ⟨q,ti⟩

(9)

Notice that Pq[β=1]≡P; therefore, the extended data summary defines the same probability space as P. We can apply Eq. (8) to the query q to derive:

E[⟨q,I⟩]=nP∂Pq∂β.

(10)

This leads to the following naïve strategy for computing the expected value of q: extend P to obtain Pq and apply formula Eq. (10). One way to obtain Pq is to iterate over all monomials in P and add β to the monomials corresponding to tuples counted by q. As this is inefficient, Sec. 4.2 describes how to avoid modifying the polynomial altogether.

3.3 Probabilistic Model Computation

We now describe how to compute the parameters of the summary. Given the statistics Φ={(cj,sj):j=1,k}, we need to find values of the variables {αj:j=1,k} such that E[⟨cj,I⟩]=sj for all j=1,k. As explained in Sec 2, this is equivalent to maximizing the dual function Ψ:

Ψ\lx@stackreldef=k∑j=1sjln(αj)−nlnP.

(11)

Indeed, maximizing P reduces to solving the equations ∂Ψ/∂αj=0 for all j. Direct calculation gives us ∂Ψ/∂αj=sjαj−nP∂P∂αj=0, which is equivalent to sj−E[⟨cj,I⟩] by Eq. (8). The dual function Ψ is concave, and hence it has a single maximum value that can be obtained using convex optimization techniques such as Gradient Descent.

In particular, we achieve fastest convergence rates using a variant of Stochastic Gradient Descent (SGD) called Mirror Descent [5], where each iteration chooses some j=1,k and updates αj by solving nαjP∂P∂αj=sj while keeping all other parameters fixed. In other words, the step of SGD is chosen to solve ∂Ψ/∂αj=0. Denoting Pαj\lx@stackreldef=∂P∂αj and solving, we obtain:

αj=sj(P−αjPαj)(n−sj)Pαj.

(12)

Since P is linear in each α, neither P−αjPαj nor Pαj contain any αj variables.

We repeat this for all j, and continue this process until all
differences |sj−nαjPαjP|, j=1,k,
are below some threshold. Algorithm 1 shows pseudocode for the solving process.

We now discuss three optimizations: (1) summary compression in Sec. 4.1, (2) optimized query processing in Sec. 4.2, and (3) selection of statistics in Sec. 4.3.

4.1 Compression of the Data Summary

The summary consists of the polynomial P that, by definition, has |Tup| monomials where |Tup|=∏mi=1Ni. We describe a technique that compresses the summary to a size closer to O(∑iNi).

We start by walking through an example with three attributes, A, B, and C, each with an active domain of size N1=N2=N3=1000. Suppose first that we have only 1D statistics. Then, instead of representing P as a sum of 10003 monomials, P=∑i,j,k∈[1000]αiβjγk, we factorize it to P=(∑αi)(∑βj)(∑γk); the new representation has size 3⋅1000.

Now, suppose we add a single 3D statistic on ABC: A=3∧B=4∧C=5. The new variable, call it δ, occurs in a single monomial of P, namely α3β4γ5δ. Thus, we can compress P to (∑αi)(∑βj)(∑γk)+α3β4γ5(δ−1).

Instead, suppose we add a single 2D range statistics on AB, say A∈[101,200]∧B∈[501,600], and call its associated variable δ1. This will affect 100⋅100⋅1000 monomials. We can avoid enumerating them by noting that they, too, factorize. The polynomial compresses to (∑αi)(∑βj)(∑γk)+(∑200i=101αi)(∑600j=501βj)(∑γk)(δ1−1).

Finally, suppose we have three 2D statistics: the previous one on AB plus the statistics B∈[551,650]∧C∈[801,900] and B∈[651,700]∧C∈[701,800] on BC. Their associated variables are δ1, δ2, and δ3. Now we need to account for the fact that 100⋅50⋅100 monomials contain both δ1 and δ2. Applying the inclusion/exclusion principle, P compresses to the following (the i and ii labels are referenced later).

The size, counting only the αs, βs, and γs for simplicity, is 3000+1200+1350+250.

Before proving the general formula for P, note that this compression is related to standard algebraic factorization techniques involving kernel extraction and rectangle coverings [13]; both techniques reduce the size of a polynomial by factoring out divisors. The standard techniques, however, are unsuitable for our use because they require enumeration of the product terms in the sum-of-product (SOP) polynomial to extract kernels and form cube matrices. Our polynomial in SOP form is too large to be materialized, making these techniques infeasible. It is future work to investigate other factorization techniques geared towards massive polynomials.

We now make the following three assumptions for the rest of the paper.
{myitemize}

Each predicate has the form πj=⋀mi=1ρij where m is the number of attributes and ρij is the projection of πj onto Ai. If j∈Ji, then πj≡ρij. For any set of indices of multi-dimensional statistics S⊂[k], we denote ρiS\lx@stackreldef=⋀j∈Sρij, and πS\lx@stackreldef=⋀iρiS; as usual, when S=∅, then ρi∅≡true.

Each ρij is a range predicate Ai∈[u,v].

For each I, the multi-dimensional statistics whose attributes are exactly those in I are disjoint; i.e., for j1, j2 whose attributes are I, ρij≢true for i∈I, ρij≡true for i∉I, and πj1∧πj2≡false.

Using this, define JI⊆P([k])4 for I⊆[m] to be the set of sets of multi-dimensional statistics whose combined attributes are {Ai:i∈I} and whose intersection is non-empty (i.e., not false). In other words, for each S∈JI, ρiS∉{true,false} for i∈I and ρiS≡true for i∉I.

For example, suppose we have the three 2D statistics from before: πj1=A1∈[101,200]∧A2∈[501,600], πj2=A2∈[551,650]∧A3∈[801,900], and πj3=A2∈[651,700]∧A3∈[701,800]. Then, {j1}∈J{1,2} and {j2},{j3}∈J{2,3}. Further, {j1,j2}∈J{1,2,3} because ρ2j1∧ρ2j2≢false. However, {j1,j3}∉J{1,2,3} because ρ2j1∧ρ2j3≡false. Using these definitions, we now give the compression.

The proof uses induction on the size of I, but we omit it for lack of space.

To give intuition, when I=∅, we get the sum over the 1D statistics because when S=∅, (ii) equals 1. When I is not empty, (ii) has one summand for each set S of multi-dimensional statistics whose attributes are I and whose intersection is non-empty. For each such S, the summand sums up all 1-dimensional variables αj, j∈Ji that are in the ith projection of the predicate πS (this is what the condition (πj∧ρiS)≢false checks) and multiplies with terms αj−1 for j∈S.

At a high level, our algorithm computes the compressed representation of P by first computing the summand for when I=∅ by iterating over all 1-dimensional statistics. It then iterates over the multi-dimensional statistics, and builds a map from I to the attributes that are defined on I; i.e., I→JI such that |S|=1 for S∈JI. It then iteratively loops over this map, taking the cross product of different values, JI and JI′, to see if any new JI∪I′ can be generated. If so, JI∪I′ is added to the map. Once done, it iterates over the keys in this map to build the summands for each I.

The algorithm can be used during query answering to compute the compressed representation of Pq from P (Sec. 3.2) by rebuilding ii for the new q. However, as this is inefficient and may increase the size of our polynomial, our system performs query answering differently, as explained in the next section.

We now analyze the size of the compressed polynomial P. Let Ba denote the number of non-empty JI; i.e., the number of unique multi-dimensional attribute sets. Since Ba<2m and ∑mi=1Ni≫2m, Ba is dominated by ∑mi=1Ni. For some I, part (i) of the compression is O(∑mi=1Ni). Part (ii) of the compression is more complex. For some S∈JI, the summand is of size O(∑mi=1Ni+|S|). As |S|≤Ba≪∑mi=1Ni, the summand is only O(∑mi=1Ni). Putting it together, for some I, we have the size is O(∑mi=1Ni+|JI|∑mi=1Ni)=O(|JI|∑mi=1Ni).

|JI| is the number of sets of multi-dimensional statistics whose combined attributes are {Ai:i∈I} and whose intersection is non-empty. A way to think about this is that each Ai defines a dimension in |I|-dimensional space. Each S∈JI defines a rectangle in this hyper-space. This means |JI| is the number of rectangle coverings defined by the statistics over {Ai:i∈I}. If we denote R=maxI|JI|, then the size of the summand is O(R∑mi=1Ni).

Further, although there are 2m possible I, JI is non-empty for only Ba+1I (the 1 is from I=∅). Therefore, the size of the compression is O(BaR∑mi=1Ni).

Theorem 4.2

The size of the polynomial is O(BaR∑mi=1Ni) where Ba is the number of unique multi-dimensional attribute sets and R is the largest number of rectangle coverings defined by the statistics over some I.

In the worst case, if one gathers all possible multi-dimensional statistics, this compression will be worse than the uncompressed polynomial, which is of size O(∏mi=1Ni). However, in practice, Ba<m, and R is dependent on the number and type of statistics collected and results in a significant reduction of polynomial size to one closer to O(∑mi=1Ni) (see Fig. 4 discussion).

4.2 Optimized Query Answering

In this section, we assume that the query q is a counting query defined by a conjunction of predicates, one over each attribute Ai; i.e., q=|σπ(I)|, where

π=ρ1∧⋯∧ρm

(16)

and ρi is a predicate over the attribute Ai. If q ignores Ai, then we simply set ρi≡true. Our goal is to compute E[⟨q,I⟩]. In Sec. 3.2, we described a direct approach that consists of constructing a new polynomial Pq and returning Eq. (10). However, as described in Sec. 3.2 and Sec. 4.1, this may be expensive.

We describe here an optimized approach to compute E[⟨q,I⟩] directly from P. The advantage of this method is that it does not require any restructuring or rebuilding of the polynomial. Instead, it can use any optimized oracle for evaluating P on given inputs. Our optimization has two parts: a new formula E[⟨q,I⟩] and a new formula for derivatives.

New formula for E[⟨q,I⟩]: Let πj be the predicate associate to the jth statistical query. In other words, ⟨cj,I⟩=|σπj(I)|. The next lemma applies to any query q defined by some predicate π. Recall that β is the new variable associated to q in Pq (Sec. 3.2).

Lemma \thelemma

For any ℓ variables αj1,…,αjℓ of Pq:

(1) If the logical implication πj1∧⋯∧πjℓ⇒π holds, then

αj1⋯αjℓ∂ℓPq∂αj1⋯∂αjℓ=

αj1⋯αjℓβ∂ℓ+1Pq∂αj1⋯∂αjℓ∂β

(17)

(2) If the logical equivalence πj1∧⋯∧πjℓ⇔π holds, then

αj1⋯αjℓ∂ℓPq∂αj1⋯∂αjℓ=β∂Pq∂β

(18)

{proof}

(1) The proof is immediate by noting that every monomial of Pq that contains all variables αj1,…,αjℓ also contains β; therefore, all monomials on the LHS of Eq. (17) contain β and thus remain unaffected by applying the operator β∂/∂β.

(2) From item (1), we derive Eq. (17); we prove now that the RHS of Eq. (17) equals β∂Pq∂β. We apply item (1) again to the implication π⇒πj1 and obtain β∂Pq∂β=βαj1∂2Pq∂β∂αj1 (the role of β in Eq. (17) is now played by αj1). As P is linear, the order of partials does not matter, and this allows us to remove the operator αj1∂/∂αj1 from the RHS of Eq. (17). By repeating the argument for π⇒πj2, π⇒πj3, etc, we remove αj2∂/∂αj2, then αj3∂/∂αj3, etc from the RHS.

Corollary \thecorollary

(1) Assume q is defined by a point predicate π=(A1=v1∧⋯∧Aℓ=vℓ) for some ℓ≤m. For each i=1,ℓ, denote ji the index of the statistic associated to the value vi. In other words, the predicate πji≡(Ai=vi). Then,

(1) Eq. (19) follows from Eq. (10), Eq. (18), and the fact that Pq[β=1]≡P.
(2) Follows from (1) by expanding q as a sum of point queries as in Lemma. 4.2 (1).

In order to compute a query using Eq. (20), we would have to examine all m-dimensional points that satisfy the query’s predicate, convert each point into the corresponding 1D statistics, and use Eq. (19) to estimate the count of the number of tuples at this point. Clearly, this is inefficient when q contains any range predicate containing many point queries.

New formula for derivatives Thus, to compute E[⟨q,I⟩], one has to evaluate several partial derivatives of P. Recall that P is stored in a highly compressed format, and therefore, computing the derivative may involve nontrivial manipulations. Instead, we use the fact that our polynomial is overcomplete, meaning that P=∑j∈JiαjPj, where Pj, j∈Ji does not depend on any variable in {αj:j∈Ji} (Eq. (7)). Let ρi be any predicate on the attribute Ai. Then,