Streaming Cache-Oblivious B-Trees

Introduction

The B-tree [3,12] is the classic external-memory
dictionary data structure, and it has held its preeminent position for
over three decades. Most B-tree implementations are, in fact, B-trees [12,14], in which the full keys are all stored in the
leaves, but for convenience we refer to all variations as ``B-trees.''
The B-tree is typically analyzed in a two-level memory model, called the
Disk Access Machine (DAM) model [1]. The DAM model
assumes an internal memory of size organized into blocks of size and an arbitrarily large external memory; the cost in the model is
the number of transfers of blocks between the internal and external memory.

An -node B-tree supports searches, insertions, and deletions in
transfers and supports scans of contiguous elements in transfers. An important characteristic of the B-tree is that
it is provably optimal for searching. A simple information-theoretic argument
shows that a search requires at least
transfers.

In fact, there is a tradeoff between the cost of searching and inserting
in external-memory dictionaries [8], and B-trees represent only one
point on this trade-off space. Another point in the trade-off space is
represented by the buffered-repository tree (BRT) [10].
The BRT supports the same operations as a B-tree, but searches use transfers and insertions use amortized transfers. Thus, searches are slower in the BRT than in
the B-tree, whereas insertions are significantly faster.

More generally, Brodal and Fagerberg's data structure [8], which
we call the B-tree, spans a large range of this tradeoff: The B-tree is parameterized by , (
), and supports insertions in
transfers and searches in
transfers. Thus, when it matches the performance of a B-tree, and when , it matches the performance of a buffered repository tree.
An interesting intermediate point is when . When , searches are slower by a factor of roughly 2, but insertions
are faster by a factor of roughly .

This work explores this insert/search tradeoff in the cache-oblivious
(CO) model [13]. The cache-oblivious model is similar
to the DAM model, except that unlike the DAM model, the block size is unknown to the coder or to the algorithm and therefore cannot
be used as a tuning parameter. The B-tree, buffered-repository tree, and
B-tree are not cache oblivious; they are parameterized by .

There already exist several cache-oblivious dictionaries. The most well-studied
is the cache-oblivious B-tree [4,9,5]. The cache-oblivious
B-tree supports searches in
transfers, insertions in
transfers, and range queries returning elements in
transfersAnother cache-oblivious dictionary is a
cache-oblivious alternative to the buffered-repository tree, which we
call here the lazy-search BRT [2]. Although it is useful
in some contexts (such as cache-oblivious graph traversal) the lazy-search
BRT is unsatisfactory in one crucial way: searches are heavily amortized,
and the whole cost of searching is charged to the cost of previous insertions.
Indeed, any given search might involve scanning the entire data structure.

Results

The work introduces several cache-oblivious dictionaries that illustrate
different points in the insertion/search tradeoff. Specifically, we present
the following cache-oblivious data structures and results:

Shuttle Tree

The shuttle tree, our main result, retains the same asymptotic
search cost of the cache-oblivious B-tree while improving the insert cost.
Specifically, searches still take
transfers, whereas insertions are reduced to
amortized transfers. This bound represents a speedup as long as
, for any constant ; this inequality typically holds for external-memory applications.
Range queries returning elements take
transfers, which is asymptotically optimal.

This relatively complex expression for the cost of inserts can be understood
as follows: When the dominant term in the insertion cost is
,
insertions run a factor of
faster in the shuttle
tree than in a B-tree or cache-oblivious B-tree. Observe that this speedup
of
is superpolylogarithmic and subpolynomial
in . This speedup, while nontrivial, is not as large as the speedup in
the B-tree.

Lookahead Array

We give another data structure that we call a lookahead array.
The lookahead array is reminiscent of static-to-dynamic transformations [7]
and fractional cascading [11]. This data structure is parameterized
by a growth factor. If the growth factor is chosen to be
, then the lookahead array is cache aware, and achieves the same amortized
bounds as the B-tree. If the growth factor is chosen to be a constant such
as 2, then the lookahead array is cache-oblivious and matches the performance
of the BRT. We call this version the cache-oblivious lookahead array
(COLA). Unlike the BRT, the COLA is amortized, and any given
insertion may trigger a rearrangement of the entire data structure.

For disk-based storage systems, range queries are likely to be faster
for a lookahead array than for a BRT because the data is kept contiguously
in arrays, taking advantage of inter-block locality, rather than scattered
on blocks across disk. This is the same reason why the cache-oblivious
B-tree can support range queries nearly an order of magnitude faster than
a traditional B-tree; see e.g.,[6].

Deamortized Lookahead Array

We show how to deamortize the lookahead array and the cache-oblivious
lookahead array. Thus, we obtain the first cache-oblivious alternative
to the BRT . There is no amortization on searches and the worst-case cost
for an insert is no more than the cost of a search.

Experiments

We next show how efficiently the COLA performs. We implemented a COLA
and compared it with a B-tree. For databases in external memory, the COLA
was 90 times faster than the B-tree for random inserts, 2.5 times slower
for sorted inserts, and 1.7 times slower for searches.

We also found that design decisions as simple as merging elements largest
first or smallest first had a greater impact on performance than, for
example, the difference in cost of inserting in random or sorted order.
We believe factors such as disk scheduling and prefetching are responsible.
We examined several other interesting factors that had an impact on performance.

Research Support

This research was supported in part by NSF Grants CCF-0621511 and CCF-0541209,
and a Google research award.

References:

[1] Alok Aggarwal and Jeffrey Scott Vitter. The input/output complexity
of sorting and related problems. In Communications of the ACM,
31(9):1116--1127, September 1988.