Query Evaluation Techniques for Large Databases (7)

One-line summary:
A comparison of techniques, costs, and implications of sort and hashed
based query processing algorithms is presented in section 7 of this paper.

Overview/Main Points

In-memory algorithm: if data set fits in memory, quicksort
is the sorting algorithm of choice, and classic in-memory hashing
is the hashing technique of choice.

Divide-and-conquer paradigm: in sorting, large data set is
divided into subsets (physically - chunks as big as memory), and
combined using (logical) merging. In hashing, a large data set
is partitioned (logically) using hash values, and the partitions
combined (physically) by simple concatenation.

I/O Patterns: while writing runs after sorting, I/O is
sequential write, but while merging sort runs, I/O is random
read. While hash partitioning, I/O is random write, and while
merging partitions, I/O is sequential read.

I/O Optimizations: Sorting can take advantage of
read-ahead controlled by forecasting to reduce I/O delay. For
hashing, can do write-behind of partitions. Also,
double-buffering and striping can be used in both. Goal in both
is to match load on I/O and CPU to keep system fully utilized.

Very large inputs: For very large input sets, multi-level
merging in sorting and hierarchical/recursive partitioning in
hashing are used. There is some issue as to the optimality of
the final merge in sort, and also the size of the leaf partitions
in hashing; both need to maximally use memory, but naive
recursion may not guarantee that the lowest level run
sizes/partitions will do this. Merge optimizations (for
sorting), and bucket-tuning and hybrid hash
joins (for hashing) are techniques that will do this. Also,
replacement selection (for sorting) is another optimization that
makes better use of memory.

Aggregation/duplicate removal: With hashing, if hash on
the aggregation key, then aggregation/duplicate removal can be
done within the hash buckets, and the operation's output may fit
in memory, so classic hashing could be used. With sorting, if
replacement-selection is used for run-generation, a similar trick
can be played.

Algorithm phases: For sorting, the algorithm proceeds in
three phases: run generation, intermediate merging, and final
merging. For hashing, the three phases are initial partitioning,
intermediate partitioning, and hybrid/in-memory hash methods
process partitions to produce output. If iterator interfaces are
used for input and output, these phases may be executing in
parallel, on demand.

Resource sharing: Depth-first partitioning (for hashing,
a dual to eager merging in sorting)
implies the final phase of hashing may execute while the initial
phases are not yet finished - this leads to poor memory
utilization, and the final phase is done in-memory.
Breadth-first partitioning (dual to lazy merging in sorting)
avoids this.

Partitioning skew and effectiveness: In hashing, if the
hash function has skew, partitions of different sizes will be
produced, which is bad. Similarly, merge run files of
different sizes are bad for sorting.

Bit vector filtering: If a one-to-one match operation is
to be performed, a bit vector (large hash array) is used to
early detect and reject items in the second input that cannot
possibly have a match in the first input. Bit vectoring can be
done at each level of multi-level partitioning or merging. It
can also be used in both directions - to reduce the first input
using a bit vector filter based on the second input.

Interesting orderings - multiple joins: Merge-joins based
on sorting output results in sorted order, so multiple
merge-joins based on the sorted key can be more efficient. More
difficult in hashing, as hash algorithms produce output in
unpredictable order; process the N multiple inputs in parallel,
effectively producing N-tuple partitions. Complex. Similar
story with aggregation followed by a join, as sort-based
aggregation again outputs data in sorted order.

Relevance

Points out a large number of similarities and contrasts between sort and
hash based query processing algorithms. Situations in one is better than
another, and situations in which care must be taken for either to work well
are detailed, which is helpful.

Flaws

The logical organization of this section seems to be very
meandering and without direction, which makes it hard to glean
high-level concepts and implications.

Which of the observations in this section are important and have
impact/implications in the real world, and which are not
important? It's hard to tell.