Factorised Representations of Query Results1

Abstract

Query tractability has been traditionally defined as a function of
input database and query sizes, or of both input and output sizes,
where the query result is represented as a bag of tuples. In this
report, we introduce a framework that allows to investigate
tractability beyond this setting. The key insight is that, although
the cardinality of a query result can be exponential, its structure
can be very regular and thus factorisable into a nested representation
whose size is only polynomial in the size of both the input database
and query.

For a given query result, there may be several equivalent
representations, and we quantify the regularity of the result by its
readability, which is the minimum over all its representations
of the maximum number of occurrences of any tuple in that
representation. We give a characterisation of select-project-join
queries based on the bounds on readability of their results for any
input database. We complement it with an algorithm that can find
asymptotically optimal upper bounds and corresponding factorised
representations.

This paper studies properties related to the representation of results of select-project-join queries under bag semantics. In approaching this challenge, we depart from the standard flat representation of query results as bags of tuples and consider nested representations of query results that can be exponentially more succinct than a mere enumeration of the result tuples. The relationship between a flat representation and a nested, or factorised, representation is on a par with the relationship between logic functions in disjunctive normal form and their equivalent nested forms obtained by algebraic factorisation. When compared to flat representations of query results, factorised representations are both succinct and informative.

Cust

ckey

name

c1

1

Joe

c2

2

Dan

c3

3

Li

c4

4

Mo

Ord

ckey

okey

date

o1

1

1

1995

o2

1

2

1996

o3

2

3

1994

o4

2

4

1993

o5

3

5

1995

o6

3

6

1996

Item

okey

disc

i1

1

0.1

i2

1

0.2

i3

3

0.4

i4

3

0.1

i5

4

0.4

i6

5

0.1

Figure 1: A TPC-H-like database.

Example 1.

Consider a simplified TPC-H scenario with customers, orders, and discounted line items, as depicted in Figure 1. Each tuple is annotated with an identifier. The query Cust⋈ckeyOrd⋈okeyItem reports all customers together with their orders and line items per order. A flat representation of the result is presented below:

Q

ckey

name

okey

date

disc

c1o1i1

1

Joe

1

1995

0.1

c1o1i2

1

Joe

1

1995

0.2

c2o3i3

2

Dan

3

1994

0.4

c2o3i4

2

Dan

3

1994

0.1

c2o4i5

2

Dan

4

1993

0.4

c3o5i6

3

Li

5

1995

0.1

For each result tuple, the identifiers of tuples that contributed to it are shown. For instance, the input tuples with identifiers c1, o1, and i1 contribute to the first result tuple. Our factorised representation is based on an algebraic factorisation of a polynomial that encodes the result. This encoding is constructed as follows. Each result tuple is annotated with a product of identifiers of tuples contributing to it. The whole result is then a sum of such products. For this example, the sum of products of identifiers is:

ψ1=c1o1i1+c1o1i2+c2o3i3+c2o3i4+c2o4i5+c3o5i6.

An equivalent nested expression would be:

ψ2=c1o1(i1+i2)+c2(o3(i3+i4)+o4i5)+c3o5i6.

A factorised representation of the result is an extension
of this nested expression with values from the result tuples:

c3⟨3,Li⟩o5⟨5,1995⟩i6⟨0.1⟩.

To correctly interpret this representation as a relation, we also need
a mapping of identifiers to schemas. For instance, the identifiers
c1 to c3 are mapped to (ckey,name), which serves as schema
for tuples ⟨1,Joe⟩, ⟨2,Dan⟩, and ⟨3,Li⟩.□

We can easily recover the result tuples from the factorised
representation with polynomial delay, i.e., the delay between
two successive tuples is polynomial in the size of the representation.
For this, consider the parse tree of the representation. The inner
nodes stand for product or sum, and the leaves for identifiers with
tuples. A result tuple is a concatenation of the tuples at the leaves
after choosing one child for each sum and all children for each
product. We assume here that from a user perspective, iterating over
the result with small delay is more important than presenting the
whole result at once.

Factorised representations can be more informative than flat
representations in that they better explain the result and spell out
the extent to which certain input fields contribute to result tuples
either individually or in groups with other fields. This enables a
shift in the presentation of the result from a tuple-by-tuple view to
a kernel view, in which commonalities across result tuples are
made explicit by exploiting the factorised representation. We can
depict it graphically as its parse tree or textually as a
serialisation of this tree in tabular form.

Example 2.

The textual presentation of our factorised representation in
Example 1 could be the left one below:

ckey

name

okey

date

disc

1

Joe

1

1995

0.1

0.2

2

Dan

3

1994

0.4

0.1

4

1993

0.4

3

Li

5

1995

0.1

name

items

Joe

LCD

Dan

x

LED

Li

Mo

BW

It is easy to see that two discounted line items (with discount 0.1
and 0.2) are for the same order 1 of customer Joe.

Consider now the following factorised representation

s4⟨Mo⟩p3⟨BW⟩

where s1 to s4 identify suppliers, and p1 to p3 identify
items. This representation encodes that Joe, Dan, and Li supply both
LCD and LED TV sets, and Mo supplies BW TV sets. A textual
presentation of this result could be the right one above. The blocks
between the horizontal lines encode tuples obtained by combining any
of the names with any of the items. This relational product is
suggested by the x symbol between the blocks. (We skip the
details on the mapping between the parse trees of factorised
expressions and their tabular presentations.)□

In the factorised representation ψ2 and in contrast to its
equivalent flat representation ψ1, each identifier only occurs
once. We seek good factorised representations of a query result in
which each identifier occurs a small number of times. The maximum
number of occurrences of any identifier in a representation, or in any
of its equivalent representations, defines the readability of
that representation. Readability implies bounds on the representation
size. In our example, the size of the factorised representation is at
most linear in the size of the input database, since its readability
is one.

Our study of readability is with respect to tuple identifiers and
aligns well with query evaluation under bag semantics. This is
different from readability with respect to values. For instance,
ψ2 has readability one, yet a value may occur several times in
the tuples of ψ2, e.g., the discount value of 0.1. Studying
readability with respect to values is especially relevant to query
evaluation under set semantics.

We introduce factorised representations, a succinct and complete
representation system for (results of queries in) relational
databases. In contrast to the standard tabular representation of a bag
of tuples, factorised representations can be exponentially more
succinct by factoring out commonalities across tuples. They also allow
for an intuitive presentation, whereby commonalities across tuples are
made explicit.

We give lower and upper bounds on the readability of basic queries
with equality or inequality joins.

The following holds for select-project-join queries with
equality joins.

We introduce factorisation trees that define generic classes of
factorised representations for query results. Such trees are
statically inferred from the query and are independent of the database
instance. A factorised representation Φ(T) modelled on T has
the nesting structure of T for any input database.

We give a tight characterisation of queries based on their
readability with respect to factorisation trees. For any query Q, we
can find a rational number f(Q) such that the readability of Q(D)
is at most |Q|⋅|D|f(Q) for any database D, while for any
factorisation tree T there exist databases for which the
factorisation of Q(D) modelled on T has at least
(|D|/|Q|)f(Q) occurrences of some identifier.

For any query Q, we present an algorithm that iterates over the
factorisation trees of Q and finds an optimal one T. Given T,
we present a second algorithm that computes in time O(|Q|⋅|D|f(Q)+1) for any database D a factorised representation
Φ(T) of Q(D) with readability at most |Q|⋅|D|f(Q)
and at most |D|f(Q)+1 occurrences of identifiers.

Our characterisation captures as a special case the known class of
hierarchical non-repeating queries [dalvi07efficient] that have
readability one [OH2008]. We also show that non-hierarchical
non-repeating queries have readability Ω(√|D|)
for arbitrarily large databases D.

Section 10 shows how to extend the above results to
selections that contain equalities with constants. Proofs are deferred
to the appendix.

Our study has strong connections to work on readability of Boolean
functions, provenance and probabilistic databases, streamed query
evaluation, syntactic characterisations of queries with polynomial
time combined complexity or polynomial output size, and selectivity
estimation in relational engines. The present work is nevertheless
unique in its use of succinct nested representations of query results.

The notion of readability is borrowed from earlier work on Boolean
functions, e.g., [Golumbic06a, Golumbic08, Elbassioni09]. Like in
our case, a formula Φ is read-m if each variable appears
at most m times in Φ, and the readability of a formula or a
function Φ is the smallest number m such that there is a
read-m formula equivalent to Φ. Checking whether a monotone
function in disjunctive normal form has readability m=1 can be done
in time linear in both the number of terms and number of
variables [Golumbic08]. This problem is open for m=2, and
already hard for m>2 or for m=2 and monotone nested
functions [Elbassioni09]. This strand of work differs from ours
in two key points. Firstly, we only consider algebraic, and not
Boolean, equivalence; in particular, idempotence (x⋅x=x) is
not considered since a reduction in the arity of any product in the
representation would violate the mapping between tuple fields and
schemas. Secondly, we only consider functions/formulas arising as
results of queries, and classify queries based on worst-case analysis
of the readability of their results.

The hierarchical property [dalvi07efficient] of queries plays a
central role in studies with seemingly disparate focus, including the
present one, probabilistic databases, and streamed query
evaluation. Our characterisation of query readability essentially
revolves around how far the query is from its hierarchical
subqueries. We show that, within the class of queries without
repeating relation symbols, the readability of any non-hierarchical
query is dependent on the size of the input database, while for any
hierarchical query, the readability is always one. This latter result
draws on earlier work in the context of probabilistic
databases [OH2008, OHK2009, FinkOlteanu:ICDT:2011], where read-once
polynomials over random variables are useful since their exact
probability can be computed in polynomial time. Read-m functions for
m>2 are of no use in probabilistic databases, since probability
computation for such functions over random variables is
#P-hard [Vadhan2001]. In our case, however, readability
polynomial in the sizes of the input database and query is acceptable,
since it means that the size of the result representation is
polynomial, too.

Mirroring the dichotomies in the probabilistic and query readability
contexts, it has been recently shown that the hierarchical property
divides queries that can be evaluated in one pass from those that
cannot in the finite cursor machine model of
computation [Grohe:TCS:2009]. In this model, queries are
evaluated by first sorting each relation, followed by one pass over
each relation. It would be interesting to investigate the relationship
between the readability of a query Q and the number of passes
necessary in this model to evaluate Q.

Our study fits naturally in the context of provenance
management [Green:PODS:2007]. Indeed, the polynomials over tuple
identifiers discussed in Example 1 are
provenance polynomials and nested representations are algebraic
factorisations of such polynomials. In this sense, our work
contributes a characterisation of queries by readability and size of
their provenance polynomials.

Earlier work in incomplete databases has introduced a representation
system called world-set decompositions [OKA08gWSD] to represent
succinctly sets of possible worlds. Such decompositions can be seen as
factorised representations whose structure is a product of sums of
products.

There exist characterisations of conjunctive queries with polynomial
time combined complexity [AHV95]. The bulk of such
characterisations is for various classes of Boolean queries under set
semantics. In this context, even simple non-Boolean conjunctive
queries such as a product of n relations would require evaluation
time exponential in n. Our approach exposes the simplicity of this
query, since its readability is one and the smallest factorised
representation of its result has linear size only and can be computed
in linear time. Factorised representations could thus lead to larger
classes of tractable queries.

Finally, there has been work on deriving bounds on the cardinality of
query results in terms of structural properties of
queries [Gottlob99, AGM08, Gottlob09a]. Our work uses the results
in [AGM08] and quantifies how much they can be improved due to
factorised representations.

Databases. We consider relational databases as
collections of annotated
relation instances, as in
Example 1. Each relation instance R
is a bag of tuples in which each tuple is annotated by an
identifier. We denote by I(R) the set of
identifiers in R, by S(R) the schema
of R, and call the pair
(I(R),S(R)) its
signature.

The size of a relation instance R is the number of
tuples in R, denoted by |R|. The number of
distinct tuples in R is denoted by ||R||. The
size |D| of a database D is the total number of tuples in all
relations of D.

Remark 1.

For the purpose of analysing the complexity of our algorithms, we
assume that the tuples in the input database are of constant size. In
many scenarios, this is however not realistic since even the encodings
of the tuple identifiers must have size at least logarithmic in
D. If the maximal size of a tuple in D is C(D), the time
complexity increases by an additional factor C(D) or similar,
depending on the exact computation model used.□

Queries. We consider conjunctive or select-project-join
queries written in relational algebra but with evaluation under bag
semantics. Such queries have the form
π¯A(σφ(R1×…×Rn)), where
R1,…,Rn are relations, φ is a conjunction of equalities
of the form A1=A2 with attributes A1 and A2, and ¯A
is a list of attributes of relations R1 to Rn. The size |Q| of
the query Q is the total number of relations and attributes in Q.

Let Q=π¯A(σφ(R1×⋯×Rn))
be a query and D be a database containing a relation instance
Ri of the correct schema for each relation Ri in
Q. The result Q(D) of the query Q on the database D is a
relation instance whose tuples are exactly those
π¯A(t1×⋯×tn) for which ti∈Ri and t1×⋯×tn⊨φ. The tuple
π¯A(t1×⋯×tn) is annotated by
id1id2…idn, where idi is the identifier of ti in
Ri.

Every query can be brought into an equivalent form where all relations
as well as all their attributes are distinct. To recover the original
query Q0 from the rewritten one Q, we keep a function μ that
maps the relations in Q to relations in Q0, and the attributes of
R in Q to those of μ(R) in Q0. For technical reasons, we
will only consider the rewritten queries in further text, the mapping
μ will carry the information about different relation symbols
representing the same relation. If a query Q has two relations with
the same mapping μ(R), then Q is repeating; otherwise,
Q is non-repeating.

For any attribute A, let A∗ be its equivalence class, that is,
the set of all attributes that are transitively equal to A in
φ, and let r(A) be the set of relations that have attributes in
A∗.

A query is hierarchical2, if for
any two attributes A and B, either r(A)⊆r(B), or
r(A)⊃r(B), or r(A)∩r(B)=∅.

Example 3.

The query from Example 1 in the introduction
is non-repeating and not hierarchical.

Consider the relations R, S, and T over schemas {AR},
{AS,BS}, and {BT,U} respectively. The query
π¯A[σAR=AS,BS=BT(R×S×T)] is not
hierarchical (independently of the set ¯A), since
r(AS)⊈r(BS), r(AS)⊅r(BS), but
r(AS)∩r(BS)={S}. The query
π¯A[σAR=AS,BS=BT,AR=U(R×S×T)],
equivalent to R(A),S(A,B),T(B,A), is hierarchical, since
r(AR)=r(AS)=r(U)={R,S,T}⊃r(BS)=r(BT)={S,T}.□

In this section we formalise the notion of factorised representations,
their algebraic equivalence, and readability. We also give tight
bounds on the readability of certain factorised representations that
are used in the next sections to derive bounds on the readability of
query results.

Definition 1.

A factorised representation, or f-representation for short, Φ
over a set of signatures Sign is

Φ1+⋯+Φn, where Φ1 to Φn are f-representations
over Sign, or

Φ1⋯Φn, where Φ1 to Φn are f-representations over Sign1
to Signn, respectively, and these signatures form a
disjoint cover of Sign, or

id⟨t⟩, where id∈Ri and t is a tuple over schema
Si, and
Sign={(Ri,Si)}.

The polynomial of Φ is Φ without tuples on
identifiers. The size of (the polynomial of) Φ is the
total number of occurrences of identifiers in Φ.□

Two examples of f-representations are given in
Section 1. A relational database can have several
algebraically equivalent f-representations, in the sense that these
f-representations represent the same tuples and
polynomials. Syntactically, we define equivalence of f-representations
as follows.

Definition 2.

Two f-representations are equivalent if one can be obtained
from the other using distributivity of product over sum and
commutativity of product and sum.□

Each f-representation has an equivalent flat f-representation,
which is a sum of products. A product i1⟨t1⟩⋯in⟨tn⟩ defines the tuple ⟨t1∘⋯∘tn⟩
over schema ⋃iSi, which is a concatenation of
tuples ⟨t1⟩ to ⟨tn⟩, and is annotated by the product
i1…in.

Definition 3.

The relation encoded by an f-representation Φ consists of all
tuples defined by the products in the flat f-representation equivalent
to Φ.□

Since flat f-representations are standard relational databases annotated
with identifiers, it means that any relational database can be encoded
as an f-representation. This property is called completeness.

Proposition 1.

Factorised representations form a complete representation system for
relational data.

In particular, this means that there are f-representations of the
result of any query in a relational database.

Definition 4.

Let Q=π¯A(σφ(R1×⋯×Rn))
be a query, and D be a database.
An f-representation Φ encodes the result Q(D) if its
equivalent flat f-representation contains exactly those products
id1⟨π¯A(t1)⟩⋅…⋅idn⟨π¯A(tn)⟩ for which
π¯A(t1×⋯×tn)∈Q(D), and idi
is the identifier of ti for all i.

The signature set of Φ consists of the signatures (Ii,Si) for each query relation Ri, such that Ii is the set of identifiers of the relation instance in D
corresponding to Ri, and Si is the schema of Ri in Q
restricted to the attributes in ¯A.□

Flat f-representations can be exponentially less succinct
than equivalent nested f-representations, where the exponent is the
size of the schema.

Proposition 2.

Any flat representation equivalent to the f-representation
(x1α+y1β)⋅…⋅(xnα+ynβ) over the
signatures ({x1,…,xn},A) and
({y1,…,yn},B) has size 2n.

In addition to completeness and succinctness, f-representations allow for efficient enumeration of their tuples.

Proposition 3.

The tuples of an f-representation Φ can be enumerated with O(|Φ|log|Φ|) delay and space.

Besides the size, a key measure of succinctness of f-representations is their
readability. We extend this notion to query results for any input
database in Section 7.

Definition 5.

An f-representation Φ is read-k if the maximum number of
occurrences of any identifier in Φ is k. The readability
of Φ is the smallest number k such that there is a read-k
f-representation equivalent to Φ.□

Since the readability of Φ is the same as of its polynomial, we
will use polynomials of f-representations when reasoning about their
readability.

Example 4.

In Example 1, the polynomial ψ1 is
read-3 and the polynomial ψ2 is read-1. They are equivalent and
hence both have readability one.□

Given the readability ρ and the number n of distinct
identifiers of a polynomial, we can immediately derive an upper bound
nρ on its size. A better upper bound can be obtained by taking
into account the (possibly different) number of occurrences of each
identifier. However, for polynomials of query results, the bound
nρ is often dominated by the readability ρ.

In Section 7, we define classes of queries
that admit polynomials of low readability, such as constant
readability. We next give examples of polynomials with readability
depending polynomially on the number of identifiers.

Lemma 1.

Theorem 1.

The readability of the polynomial pN,M=∑Ni=1∑Mj=1risijtj is NMN+M+O(1).

If we drop the set of identifiers sij, the readability becomes
one. However, if we restrict the relationship between the remaining
identifiers, the readability increases again.

Theorem 2.

The readability of the polynomial qN=∑Ni,j=1;i≠jritj is
Ω(logNloglogN) and O(logN).

The polynomials pN,M and qN are relevant here due to their
connection to queries: pN,M is the polynomial of the query
σφ(R×S×T), where φ:=(AR=AS∧BS=BT) and the schemas of R, S, and T are {AR},
{AS,BS}, and {BT} respectively, on the database where
R, S and T are full relations with
|R|=n and |T|=m. Also, qN is the polynomial
of the disequality query σAR≠BT(R×T). If i≠j is replaced by i≤j in qN, the lower and upper bounds on
readability on this new polynomial q′N still hold, and we obtain
the result of an inequality query.

A lower bound of √logNloglogN on the
readability of q′N is already known even in the case when Boolean
factorisation is allowed [Golumbic06a].

We next introduce a generic class of factorised representations for query results, constructed using so-called factorisation trees, whose nesting structure and readability properties can be described statically from the query only. We present an algorithm that, given a factorisation tree T of a query Q, and an input database D, computes a factorised representation of Q(D), whose nesting structure is that defined by T. Factorisation trees are used in Section 7 to obtain bounds on the readability of queries.

Definition 6.

A factorisation tree (f-tree) for a query Q is a rooted unordered forest T, where

there is a one-to-one mapping between inner nodes in T and equivalence classes of attributes of Q,

there is a one-to-one mapping between leaf nodes in T and relations in Q, and

the attributes of each relation only appear in the ancestors of
its leaf.□

Example 5.

Consider the relations R, S, T, and U over schemas
{AR,BR,C}, {AS,BS,D}, {AT,ET}, and {EU,F}
respectively, and the query Q=σφ(R×S×T×U) with φ=(AR=AS,AR=AT,BR=BS,ET=EU). Figure 2 depicts
two f-trees for Q.

Consider now the query Q′=σφ(R×S×T) with
φ=(AR=AS,AR=AT,BR=BS). Figure 7 on page 7 shows
two f-trees for Q′ as well as a partial tree that cannot be extended
to an f-tree since the attributes AS and D of S lie in
different branches.□

Each f-tree for Q is a recipe for producing an f-representation of
the result Q(D) for any database D. For a given query Q and
database D, this f-representation is called the T-factorisation
of Q(D) and is denoted by Φ(T). Figure 3
gives a recursive function \llbracket⋅\rrbracket that computes
the T-factorisation of Q(D). A more detailed implementation of
this function, including an analysis of its time and space complexity,
is given in Section 9.

Figure 3: The T-factorisation of a query result Q(D) is computed as Φ(T)=\llbracketT\rrbracket(⊤), where ⊤ is the constant true (an empty conjunction). For a relation R in Q, R is the corresponding relation instance in the input database D.

The function \llbracket⋅\rrbracket recurses on the structure of
T. The parameter γ is a conjunction of equality conditions
that are collected while traversing the f-tree top-down. Initially,
γ is an empty conjunction ⊤. In case T is a forest
{T1,…,Tn}, we return the f-representation defined by the
product of f-representations of each tree in T. If T is single tree A∗(U) with root A∗ and children U, we return the f-representation of a sum over all possible domain values a of the attributes in A∗ of the f-representations of the children U. To compute these, for each possible value a we simply recurse on U, appending to γ the equality condition A∗=a. Finally, in case T is a leaf R, we return a sum of f-representations for result tuples in R, that is,
only those tuples that satisfy γ. (When evaluating the
selection with γ on R, we only consider the equalities on
attributes of R.) In the f-representation we only include attributes
from Q’s projection list, along with the tuple identifier.

The symbolic products and sums in Figure 3 are of course
expanded out to produce a valid f-representation. However, we will
often keep the sums symbolic, abbreviate ∑a∈DomA∗
to ∑A∗ and write R instead of ∑tj∈σγ(R)idj⟨πhead(Q)(tj)⟩ for
the expression generated by the leaves. The condition γ can be
inferred from the position in the expression, so we can still recover
the original representation and write out the sums explicitly. Such an
abbreviated form is independent of the database D and conveniently reveals the structure of any T-factorisation.

Figure 5: A procedure for producing T1-factorisations in explicit form. The abbreviated form is . T1 is the left f-tree in Figure 2.

Remark 2.

For any query Q, consider the f-tree T in which the nodes
labelled by the attribute classes all lie on a single path, and the
leaves labelled by the relations are all attached to the lowest node
in that path. Such a tree T produces the T-factorisation in
which we sum over all values of all attributes and for each
combination of values we output the product over all relations of the
sums of tuples which have the given values. If all the tuples in the
input relations are distinct, the T-factorisation is just a sum of
products, that is, the flat f-representation of the result.

Thus, for a non-branching tree T we obtain a flat representation of
Q(D). The more branching the tree T has, the more factorised the
T-factorisation of Q(D) is.□

The correctness of our construction for a general query Q and
database D is established by the following result.

Proposition 4.

For any f-tree T of a query Q and any database D, Φ(T)
is an f-representation of Q(D).

We next introduce definitions concerning f-trees for later
use. Consider an f-tree T of a query Q. An inner node A∗
of T is relevant to a relation R if it contains an
attribute of R. For a relation R, let Path(R) be the set
of inner nodes appearing on the path from the leaf R to its root in
T, Relevant(R)⊆Path(R) be the set of
nodes relevant to R, and Non-relevant(R)=Path(R)∖Relevant(R). For example, in the left f-tree of
Figure 2, Non-relevant(R)=∅ and
Non-relevant(U)={A∗R}. In the right f-tree,
Non-relevant(U)=∅, yet Non-relevant(R)=Non-relevant(S)={E∗T}. In fact, there is no f-tree for
the query in Example 5 such that
Non-relevant(R)=∅ for each relation R. This is
because the query is not hierarchical.

Proposition 5.

A query is hierarchical iff it has an f-tree T such that \em Non-relevant(R)=∅ for each relation R.

The left two trees shown in Figure 7 are f-trees of
a hierarchical query. The first f-tree satisfies the condition in
Proposition 5, whereas the second does
not.

The readability of a query Q on a database D is the
readability of any f-representation of Q(D), that is, the minimal
possible k such that there exists a read-k representation of
Q(D).

In this section we give upper bounds on the readability of arbitrary
select-project-join queries with equality joins in terms of the
cardinality |D| of the database D. We then show that these
bounds are asymptotically tight with respect to statically chosen
f-trees. By this we mean that for any query Q, if we choose an
f-tree T, there exist arbitrarily large database instances D for
which the T-factorisation of Q(D) is read-k with k
asymptotically close to our upper bound. In the next section we give
algorithms to compute these bounds. We conclude the section with a
dichotomy: In the class of non-repeating queries, hierarchical queries
are the only queries whose readability for any database is 1 and hence
independent of the size of the database.

A key result for all subsequent estimates of readability is the
following lemma that states the exact number of occurrences of any
identifier of a tuple ⟨t⟩ in the T-factorisation of Q(D)
as a function of the f-tree T, the query Q=π¯A(σφ(R1×⋯×Rn)), and the database D.

Let R=Ri be a relation of Q, denote by the condition S(R)=⟨t⟩ the conjunction of equalities of the attributes of
R to corresponding values in ⟨t⟩, and denote NR=Non-relevant(R). In the T-factorisation of Q(D),
multiple occurrences of the same identifier from R arise from the
summations over the values of attributes from
NR. Lemma 2 quantifies how many
different choices of such values in the summations thus yield a given
identifier from R. Recall that the projection attributes ¯A
do not influence the cardinality of the query result and hence the
number of occurrences of its identifiers, since we consider bag
semantics.

Lemma 2.

The number of occurrences of the identifier r of a tuple ⟨t⟩
from R in the T-factorisation of Q(D) is

∣∣∣∣(πNR(σS(R)=⟨t⟩σφ(R1×⋯×Rn)))(D)∣∣∣∣.

For example, for the left f-tree in Figure 2, all
identifiers in R, S, and T occur once, whereas any identifier of
U may occur as many times as distinct A∗ values in R, S, and
T. For the leftmost f-tree in Figure 7, all
identifiers in all relations occur once, since no relation has
non-relevant nodes.

Lemma 2 represents an effective tool to
further estimate the readability and size of T-factorisations. Our
results build upon existing bounds for query result sizes and yield
readability bounds which can be inferred statically from the
query. Lemma 2 can be potentially also
coupled with estimates on selectivities and various assumptions on
attribute-value
correlations [Muralikrishna:SIGMOD:1988, Poosala:VLDB:1997, Getoor:SIGMOD:2001, Re:Cardinality:2010]
to infer database-specific estimates on the readability.

7.1 Upper Bounds

Let D be a database, let Q=π¯A(σφ(R1×⋯×Rn)) be a query,
let T be an f-tree of Q, and let R be a relation in
Q. Denote NR=Non-relevant(R), by φR the condition
φ restricted to the attributes of NR, by QR the query
σφR(πNRR1×⋯×πNRRn), and by
DR the database obtained by projecting each relation in
D onto the attributes of NR.

Lemma 3.

The number of occurrences of any identifier r from R in the
T-factorisation of Q(D) is at most
||QR(DR)||.

The number of distinct tuples in an equi-join query such as QR can
be estimated in terms of the database size using the results in
[AGM08]. Intuitively, if we can cover all attributes of the query
QR by some k of its relations, then ||QR(DR)|| is at most
the product of the sizes of these relations, which is in turn at most
|D|k. This corresponds to an edge cover of size k in the
hypergraph of QR. The following result strenghtens this idea by
lifting covers to a weighted version.

Definition 7.

For an equi-join query Q=σφ(R1×⋯×Rn), the fractional edge cover numberρ∗(Q) is the cost of an optimal solution to the linear program with variables {xi}ni=1,

minimising

∑ixi

subject to

∑i:Ri∈r(A)xi≥1for all %
attributes A, and

xi≥0for all i.% □

Lemma 4 ([Agm08]).

For any equi-join query Q and for any database D, we have
||Q(D)||≤|D|ρ∗(Q).

Corollary 1.

The number of occurrences of any identifier r from R in the
T-factorisation of Q(D) is at most |D|ρ∗(QR).

Proof.

By Lemma 3, the number of occurrences of
r in the T-factorisation of Q(D) is bounded above by
||QR(DR)||. By Lemma 4, this is bounded
above by |DR|ρ∗(QR), which is equal to
|D|ρ∗(QR).
∎

Corollary 1 gives an upper bound on the number of
occurrences of identifiers from each relation. Let M be the maximal
number of relations which can contain the same identifier, that is,
the maximal number of relations in Q mapping to the same relation
name by μ. Defining f(T)=maxRρ∗(QR) to be
the maximal possible ρ∗(QR) over all relations R from Q, we
obtain an upper bound on the readability of the T-factorisation of
Q(D).

Corollary 2.

The T-factorisation of Q(D) is at most
read-(M⋅|D|f(T)).

By considering the T-factorisation with lowest readability, we obtain an upper bound on the readability of Q(D). Let f(Q)=minTf(T) be the minimal possible f(T) over all f-trees T for Q.

Corollary 3.

For any query Q and any database D, the readability of Q(D) is at most M⋅|D|f(Q).

Since M≤|Q|, the readability of Q(D) is at most |Q|⋅|D|f(Q).

Example 7.

For the query Q in Example 5 and the left f-tree in Figure 2, the relation U is the only one with a non-empty query QU=σφU(πARR×πASS×πATT), where the condition φU is AR=AS=AT. Since the other relations have empty covers (thus of cost zero), we conclude that their identifiers occur at most once in the query result. We can cover QU with any subset of R, S, and T. A minimal edge cover can be any of the relations, and the number of occurrences of any identifier of U is thus linear in the size of that relation. The fractional edge cover number is also 1 and we obtain the same bound.

For the right f-tree in Figure 2, both R and S have non-empty queries QR and QS defining their non-relevant sub-query of Q: QR=QS=σφ(πETT×πEUU), where φ is ET=EU. The attributes ET and EU can be covered by U or by T. A minimal cover thus has size 1. The minimal fractional edge cover has also cost 1.

Now consider a different query over the relations R(AR,ER),
S(AS,BS,CS), T(AT,BT,DT) and U(CU,DU,EU), given by
^Q=σφ(R×S×T×U), with φ=(AR=AS=AT,BS=BT,CS=CU,DT=DU,ER=EU).

Consider the left f-tree T1 shown in Figure 6. For the relation R, we have Non-relevant(R)={B∗S,C∗S,D∗T}, and hence the restricted query QR will be ^QR=σBS=BT,CS=SU,DT=DU(πBS,CSS×πBT,DTT×πCU,DUU). We need at least two of the relations S,T,U to cover all attributes of QR, the edge cover number is thus 2. However, in the fractional edge cover linear program, we can assign to each relation the value xS=xT=xU=1/2. The covering conditions at each attribute are satisfied, since each attribute belongs to two of the relations. The total cost of this solution is only 3/2. It is in fact the optimal solution, so ρ∗(QR)=3/2. It is easily seen that ρ∗(^QT)=ρ∗(^QU)=1 (since ^QT can be covered either by S or U, and ^QU can be covered by either S or T) and ρ∗(^QS)=0 (since ^QS has no attributes), so f(T1)=3/2. We obtain the upper bound |D|3/2 on the number of occurrences of identifiers from R, and hence on the readability of any T1-factorisation.

Note however that in the right f-tree T2 in Figure 6, each of ^QR, ^QS, ^QT and ^QU is covered by only one of its relations, and hence f(T2)=1. Any T2-factorisation will therefore have readability at most linear in D.

In fact, no f-tree T for ^Q has f(T)<1, so T2 is in this sense optimal and f(^Q)=1.
□

7.2 Lower Bounds

We also show that the obtained bounds on the numbers of occurrences of identifiers are essentially tight. For any query Q and any f-tree T, we construct arbitrarily large databases for which the number of occurrences of some symbol is asymptotically as large as the upper bound.

The expression for the number of occurrences of an identifier, given
in Lemma 2, states the size of a specific
query result. As a first attempt to construct a small database D
with a large result for the query QR, we pick k attribute classes
of QR and let each of them attain N different values. If each
relation has attributes from at most one of these classes, each
relation in D will have size at most N, while the result of QR
will have size Nk.

This corresponds to an independent set of k nodes in the hypergraph
of QR. We can again strenghten this result by lifting independent
sets to a weighted version. Since the edge cover and the independent
set problems are dual when written as linear programming problems,
this lower bound meets the upper bound from the previous
subsection. The following result, derived from results
in [AGM08], forms the basis of our argument.

Lemma 5.

For any equi-join query Q, there exist arbitrarily large databases D such that ||Q(D)||≥(|D|/|Q|)ρ∗(Q).

Now let Q=π¯A(σφ(R1×⋯×Rn))
be a query, let T be an f-tree of Q and let R be a
relation in Q. Define NR, φR and QR as before. We can
apply Lemma 5 to the expression from
Lemma 2 to infer lower bounds for numbers
of occurrences of identifiers in the T-factorisation of Q(D).

Lemma 6.

There exist arbitrarily large databases D such that each identifier
from R occurs in the T-factorisation of Q(D) at least
(|D|/|Q|)ρ∗(QR) times.

We now lift the result of Lemma 6 from the identifiers
from R to all identifiers in the T-factorisation of Q(D).

Corollary 4.

There exist arbitrarily large databases D such that the
T-factorisation of Q(D) is at least read-(|D|/|Q|)f(T).

Finally, by minimising over all f-trees T, we find a lower bound on
readability with respect to statically chosen f-trees.

Corollary 5.

Let Q be a query. For any f-tree T of Q there exist arbitrarily
large databases D for which the T-factorisation of Q(D) is at
least read-(|D|/|Q|)f(Q).

Example 8.

Let us continue Example 7. For the left f-tree in Figure 2, an independent set of attributes covering the relations R, S, and T of the query QU is {A∗R}. Since QU only has one attribute, this is also the largest independent set, and the fractional relaxation of the maximum independent set problem has also optimal cost 1.

For the right f-tree in Figure 2 the situation is similar. A maximum independent set of attributes covering the relations T and U of the queries QR and QS is {E∗T} and has size 1.

The situation is more interesting for the query ^Q. Recall that for the left f-tree T1 in Figure 6, ^QR=σBS=BT,CS=SU,DT=DU(πBS,CSS×πBT,DTT×πCU,DUU), its attribute classes being NR={B∗S,C∗S,D∗T}. The maximum independent set for ^QR has size 1, since any two of its attribute classes are relevant to a common relation. However, the fractional relaxation of the maximum independent set problem allows to increase the optimal cost to 3/2. In this relaxation, we want to assign nonnegative rational values to the attribute classes, so that the sum of values in each relation is at most one. By assigning to each attribute class the value 1/2, the sum of values in each relation is equal to one, and the total cost of this solution is 3/2. This is used in the proof of Lemma 6 to construct databases D for which the identifiers from R appear at least (|D|/3)3/2 times in the T1-factorisation of ^Q(D), thus proving the upper bound from Example 7 asymptotically tight.

Since all f-trees T for ^Q have f(T)≥1, the results
in this subsection show that for any such f-tree T we can find
databases D for which the readability of the T-factorisation of
Q(D) is at least linear in |D|.
□

7.3 Characterisation of Queries by Readability

Theorem 3.

Fix a query Q. For any database D, the readability of Q(D) is
O(|D|f(Q)), while for any f-tree T of Q, there exist
arbitrarily large databases D for which the T-factorisation of
Q(D) is read-Θ(|D|f(Q)).

Corollary 6.

Fix a query Q. If Q is hierarchical, the readability of Q(D)
for any database D is bounded by a constant. If Q is
non-hierarchical, for any f-tree T of Q there exist arbitrarily large databases
D such that the T-factorisation of