OCTOBER 2017 | VOL. 60 | NO. 10 | COMMUNICATIONS OF THE ACM 87
query templates which are not fully specified: the predicates
defined on single tables are placeholders. The selectivity
of such a predicate, meaning the average fraction of tuples
satisfying the predicate, is unknown to our MPQ algorithm.
Hence, the selectivity of each predicate placeholder must be
represented by a parameter. Our algorithm finds all plans
realizing optimal cost trade-offs for each possible parameter value combination.

We generate the queries for our benchmark randomly.
We use the method described by Steinbrunn et al.
15 to
produce random queries that join a given number of
tables. The number of rows in each table and the selectivity of each predicate is chosen randomly according to
that method. We distinguish two classes of queries: chain
queries and star queries. For chain queries, the binary
join predicates connect query tables in a chain. For star
queries, the binary join predicates connect one table (the
middle of the “star”) to all other query tables. The number
of predicates is for both query classes one less than the
number of tables.

We describe the plan search space that our algorithm
considers. Our algorithm considers all possible orders in
which tables can be joined with only one restriction: whenever we have the choice between joining two relations that
are connected via a binary join predicate and joining two
relations where this is not the case then only joins of the first
category are considered. This restriction on the join order
is often used in query optimization.
14, 15 In addition to the
join orders, our algorithm considers different scan and join
operators. For scanning single tables on which a predicate
is defined, we consider a full scan and an index-based scan.
Which of the two operators is preferable depends on the
selectivity of the predicate. If the selectivity is low (few tuples
will satisfy the predicate) then the index scan is often preferable. If the predicate is satisfied for most tuples then the full
scan is more efficient. We model the selectivity of a predicate defined on a single table by a parameter. The optimal
choice for the scan operator therefore depends on the value
of that parameter. We consider two join operators: a distributed join and a single-node hash join. For sufficiently large
amounts of input data, the distributed join saves execution
time. On the other side, the distributed join requires to rent
more computational resources from the cloud provider and
is therefore more expensive. Hence, we can realize different tradeoffs between execution time and execution fees
by selecting between alternative join operators. We implemented our MPQ algorithm in Java 1. 7. We used Gurobi 5.6a
as linear solver. All experiments were executed on an iMac
equipped with an i5-3470S processor with 2. 9 GhZ and 16 GB
of RAM.

5. 2. Experimental results

Figure 5 shows our experimental results. Each data point inthat figure corresponds to the median value of 25 randomlygenerated test cases. We report optimization time, the num-ber of generated query plans (counting plans for the inputWe perform a similar analysis to determine the expectednumber of plans per subquery in MPQ. We consider linearcost functions. We denote the number of parameters by n.A linear function is therefore defined by a vector consist-ing of n + 1 components, specifying the function slope foreach parameter and a constant. We still denote the num-ber of considered plan cost metrics by m. Each query planis therefore associated with m linear functions. The multi-dimensional cost function of each query plan can thereforebe described by a matrix containing m × (n+ 1) components,specifying for each cost metric the cost slopes and a con-stant. Assume that we have two cost functions and that allconstants and slopes describing the first function are lowerthan the corresponding entries for the second cost function.Then the first cost functions has for each cost metric a lowerconstant cost component and a lower slope in each param-eter. In other words, the first cost function has lower valuesthan the second one for arbitrary parameter values and costmetrics. If both cost functions are associated with queryplans then the plan associated with the second function isclearly irrelevant.

We can exploit this fact as follows. Assume that we choose
an arbitrary number of D-dimensional vectors randomly
with independent identical distribution. Then the expected
number of vectors such that no other vector has a lower or
equivalent value in each component is bounded by 2D.
7 We
assume that vectors describing the cost functions of different query plans are chosen randomly with independent and
identical distribution. Setting D = m × (n + 1), we infer that
the expected number of vectors such that no other vector
has lower or equal values in all components is bounded by
2m×(n+ 1). As outlined before, this is at the same time an upper
bound on the expected number of relevant query plans per
subquery.

In order to obtain an upper bound on the asymptotic
space complexity, we multiply the aforementioned bound by
the number of subqueries. We generate new plans by combining two relevant plans. The number of generated plans
grows therefore as the square of the number of relevant
plans. All generated plans for the same subquery are compared pair-wise during pruning. The number of plan comparisons grows therefore as the fourth power of the number
of relevant plans. Multiplying by the number of subquery
splits yields the time complexity measured by the number of
plan comparisons.

5. EXPERIMENTS

5. 1. Experimental setup

We evaluate our MPQ algorithm experimentally. More precisely, we study how optimization time depends on the
input query size and on the number of considered parameters. Our experiments are based on an example scenario
in which SQL queries are processed in the cloud. Hence,
we compare alternative query plans according to two cost
metrics: execution time and monetary execution fees. We
consider a restricted class of SQL queries: each query is
described by a set of tables to join, by predicates defined
on single tables, and by binary join predicates defined on
table pairs. We assume that our MPQ algorithm is applied to a http://www.gurobi.com/.