Randomized Algorithms for Optimizing Large Join Queries

One-line summary:
Two randomized algorithms (iterative improvement and simulated annealing),
as well as a hybrid, are presented, analyzed, and measured; the hybrid
scheme seems to win.

Overview/Main Points

State space: each state in query optimization corresponds
to an access plan (strategy) of the query to be optimized. Here,
plans are join processing trees, where leaves are base relations,
internal nodes are join operators, and edges indicate the flow of
data.

Neighbours: neighbours of a state are determined by a set
of transformation rules. Applying a rule usually changes the
cost of the tree only in a local spot of the tree, so recomputing
costs is cheap.

if not, go there anyway with a
probability directly related to
temperature and inversely related
to cost

reduce temperature

stop when frozen

Hybrid - Two phase optimization (2PO):

Run II for some period of time to find a number of
local optima

take the best local optima, and use it as the starting
state of SA, but run SA with a low starting temperature
so it doesn't jump around too much

Performance:

experiments run with random queries (start and tree
queries, ranging between 5 and 100 joins). Three
different relation catalogs of varying relation
cardinalities, unique values, and selectivities were
used.

in nearly all cases, 2PO beats SA, and SA beats II.
Although II initially decreases in cost quite quickly,
it does not converge to the lowest global cost very
quickly at all. SA eventually does converge to lowest
global cost, but takes a while to get there. 2PO is
best of both worlds.

State space analysis:

Some (rather cheesy) experiments were done to try and
quantify the shape of the very large dimensional state
space.

Their conclusion was that state space should be
visualized as a cup, in which the floor of the cup has
many local minima close to each other.

Relevance

Randomized algorithms work well for query optimization; they produce query
plans that are competitive in quality with exhaustive search, but at faster
rates.

Flaws

The assumptions made for the cost functions seem highly
unlifelike: no pipelining of joins, minimal buffering, no
duplicate elimination.

Scant support for the claim that the execution time savings of
using r-local minima outweigh the potential misses of
true local minima.

The state space analysis was just totally unbelievable. They
leap to conclusions about a potentially very hairy topology with
nearly no data at all.

Are these the only randomized algorithms that are applicable?
What others should be explored?

The workloads they ran the optimizations against were generated
with healthy doses of randomness, leading me to suspect that they
had little to do with the real world. How do these algorithms
perform with real-world workloads?