Lecture 22: B-Trees

Suppose you have a database that is too large to be kept in memory, so it
is resident on disk. You want to support the functionalities SEARCH, ADD,
and DELETE.

Data on disks is organized in blocks of fixed size; typically
512 bytes, 1K Bytes, or 2K Bytes. The disk reader reads and writes one
block at a time. Reading a block from disk takes typically about 1 million
times as long as doing an in-memory operation. Therefore in designing
algorithms for data structures kept on disk, the primary consideration
is minimizing the number of disk reads, and you are willing to trade off
an awful lot of in-memory computation (though there are limits) in order
to save a single disk read.

A B-tree is a version of a 2-3 tree, in which each node of the tree is a block.
The 2-3 tree structure is modified to pack the maximum amount of information
into each block and thus reduce the total number of block reads.

Since we're talking about fixed size blocks, we need to be systematic about
memory size, for the only time in this course. All measurements are in bytes.

The records in the data base are of length R bytes, including the
keys. Let's say 16 Bytes: 4 bytes of key, 12 bytes of other data.

The address of a block on disk is A bytes. If you have a 4-Terabyte
disk, and 1K Byte blocks, then you have 4G blocks, so the size of
a block address A = 32 bits = 4 Bytes.

In a B-tree there are two kinds of nodes (each node is one block).

The leaves. These hold the actual data base records. The number
of records per leaf is S/R. Call this Q. For our numbers Q = 64.

The internal nodes. An internal node P with children C1 ... Ck
stores, for each child Ci: (a) the disk address of Ci (A bytes) and
(b) the value of the smallest key in the subtree under Ci (4 bytes),
except for the first subtree.
Therefore the maximum number of children is S/(A+4). Call this number B
(this is the B of B-tree). For our numbers B=128

As we observed with 2-3 trees, every value in a leaf is a tag in exactly
one ancestor node, except the very smallest value in the tree, which is
not a tag anywhere. This will be important below.

The rules for the tree are:

Each leaf node is at least half full (unless the entire database has
fewer that Q/2 records --- we will ignore that case henceforth).
That is, it has betweeen ceil(Q/2) and Q records.

Each internal node other than the root is at least half full; that is,
it has between ceil(B/2) and B children

The root has at least 2 children; that is, it has between 2 and B
children.

Height of the tree

In the smallest tree of height H,

The root has 2 children

Each internal node has B/2 children, and there are H-1 levels of internal
nodes.

Then the number of records N <= BH*Q so
H >= [log(N) - log(Q)]/log(B).
With Q=64, and B=128, if N = 1 billion then H >= 4. If N = 1 trillion, then
H >= 5.

The root is held in memory. If the database is heavily used, the first and
second level
down are also held in memory. For our numbers, keeping two levels requires
about 10 MBytes. Three levels would require 2 GByte -- feasible if the
computer dedicated to serving this database. Let L be the number of levels
held in memory.

You also probably cache some number of
recently used blocks below level L in memory,
in case you need them again soon. But this can only be a tiny fraction of all
the blocks below level L, because the branching factor B is so large.

Also worth noting: the ratio between the space required by the B-tree and
the space required by the raw data is somewhere between 1+1/B and 2+4/B,
depending almost entirely on how full the leaf nodes are. So the cost
in extra space is small, if the leaves are tightly packed and a factor
of 2 in the worst case.

Special case: In a 2-3 tree with N leaves, there are between
N-1 internal nodes (in the skinniest tree, with 2 children per node)
and approximately
N/2 internal nodes (in the bushiest tree, with 3 children per node).

Algorithms

The algorithms are essentially the same as for 2-3 trees.

Search

Use the keys at each node to find your way down the tree, and search for
the key in the leaf. Note that if you read in the node as an array, you
can do binary search among the keys in the node to find the proper
subtree.

Number of disk reads: = height of tree minus the number of
levels kept in memory. Machine operations: The number of levels is log(N)/log(B) and at
each level you are doing log(B) operations for a binary search, so a total
of log(N).

Adding an new record

Find where the new record should go. If the leaf is overfull, split it into
two leaves half full. Add these to the parent. If this is overfull, split it
into two. Contine upward. If you get to the root, split the root into two,
and create a new root with these two as children
(this is why the root is allowed to have two children).

Number of disk operations: Number of reads: H-L. Writes:
in the worst case, 2*H. (Caching doesn't help with writes; you
have to write out a block as soon as it is changed.) Average case: O(1) Machine operations: At each level of the tree, O(B) for insertion
into an array of length B. Hence O(B*height) = O(B*log(N)/log(B))
operation in total.

Note that B/log(B) is an increasing function of B, so the machine operations
increase for larger B. That is why, for an in-memory data structure, you use
2-3 trees rather than B trees.

Also, unlike 2-3 trees, it is no longer important to be clever about
passing the keys up the tree; you can use the obvious dumb algorithm to
adjust the key labels at each node. Why?

Deleting

Find the leaf node N with the record to be deleted.
Delete it, if it's there.
If (N is now less than half full)
if (either neighboring sibling is more than half full) {
pick the larger of the two neighboring sibs, if there are two;
move enough children from that one to this one that the two have
equal numbers (both more than half)
}
else // neither neighboring siblings has children to spare
{ P = N.parent;
move all of N's children to one of its neighboring siblings;
delete N;
N = P
iterate;
}
if you get up to the root, then delete the root.

The details of the code are messy, as you can imagine. Depending on how the
B-tree is to be used there are a lot of optimizations and tuning you can do.

Number of disk operations: Reads: either 3*(H-L)(the path to
the item to be deleted, plus both neighboring siblings at each level) or
2*(H-L) (if you keep more information at the parent node about the
children).
Writes: Best case 1: Worst case 2*H (each node on the path and one sibling).

Doing N adds

Suppose you do N adds. How many disk operations?

Reads: N*(H-L). The tree stays at height H much longer than it takes to go
through heights 0 ... H-1. So nearly all the N adds are done when the tree is
already at height H. But see below.

Writes: Here it pays to count a little carefully.

If you add an item to a leaf and the leaf doesn't split, then you do either
one or two writes. You certainly have to rewrite the leaf. If the item is
now the smallest item in the leaf, you have to rewrite the tag in the
internal node where the tag for the smallest item in this leaf appears.
So over the whole course of doing the N adds, these kinds of rewrites
involve somewhere between N and 2N writes; much closer to N if the data
arrives randomly, but 2N if the data arrives in backward sorted order.

If a node N splits and new node N1 is created then there are three`
writes involved; N, N1, and the common parent of N and N1 (which itself
may split, but we'll do that accounting separately). Therefore each such
event, involving 3 writes, has the effect of creating a new node. Therefore
the total number of writes of this kind is at most 3 * the number of nodes
in the tree = about 6*N/Q. This holds best case, worst case, all cases.
So the total number of writes is somewhere between
(1+6/Q)*N and (2+6/Q)*N.

However if we consider the fact that some of the nodes below level L are
cached then that might change the number of reads. It can't change the number
of writes; when the algorithm says you have to write, you have to write, or
risk losing your work. But when it says you have to read, you may not have to
read; the block may be already cached. If the data arrives in random order,
then the probability that a block below level L is cached is tiny, so that
doesn't affect the calculation. But if there is a high degree of
locality --- that is, items close together tend to arrive together
--- then there may be a substantial probability that the blocks you need to
read are already in memory, and then the total number of reads may be much
less than the above calculation.

The extreme case is where the data arrives in sorted order. In that case,
the blocks you need are always in memory, and zero reads are required;
only writes.

Fun things to do with 2-3 trees

All kinds of things you can do with 2-3 trees, especially if you're willing
to supplement them with some additional data structures.

Search. O(log(N)).

Add. O(log(N)).

Delete. O(log(N)).

Enumerate the elements in order. Do a depth-first traversal. Time: O(N).

Sorting algorithm. Add the items, then enumerate. Time O(N * log(N)) in
the worst case.
As a sorting algorithm this has no advantages whatsoever over
heapsortr: it has a much higher constant factor than heapsort, quicksort,
or mergesort; it is not in place; and it is much nastier to code up.

Construct a 2-3 tree from an ordered list. Time O(N).

Previous and next element. Go up and down the tree. Time O(log N).
However, if you're going to be doing this a lot, it pays to connect
the leaves in a doubly linked list. This can be done without increasing the
running time of the other operations. If this is done, then previous and
next are O(1).

Minimum, maximum. Go to the leftmost/rightmost descendant of the root.
Time O(log(N)). If you need to do this often, keep a global pointer to
the min and max, and update when you add and delete. Time: O(1).

Indexing: Find the Ith largest element. Modify the tree so that
each node shows the size of each of its subtrees. Modify the add and
delete operations to maintain these. Then proceed as in problem
set 6, problem 2.C. Time: O(log(N))

Anti-indexing. For an item X, find its rank in the set. Label the
tree as in the indexing task. As you search for X, sum up the sizes of
all the siblings you're passing on the left. Time: O(log(N))

Intersect sets U and V (similarly union, set difference).

Loop through the elements of U, look each one up in V. Time O(U log(V))

Loop through the elements of V, look each one up in U. Time O(V log(U))

Loop through the elements of U and V in parallel, use the two-fingered
method. Time O(U + V).

If sizes are recorded, find which of those three is smallest and use that.
Time O(min(U log(V), V log(U), U+V).

All these (I think) can be adapted to any kind of balanced tree, not just
two-three trees.

Splice and Split

Splice sets U and V. That is, U and V are each represented by a two
three tree, and you happen to know that the largest element in U is smaller
than the smallest element in V.
You want to destructively create the union of U and V.
Assume that nodes are tagged with the smallest value in the smallest
subtree, in addition to the smallest values in the other subtrees.

Let H = U.height (assume this is recorded at the root).
Let G = V.height;
if (H == G) {
create a new node W
make U and V
tag W with U.smallest, V.smallest
}
if (H > G)
{ starting at the root of U, go down H-G-1 steps to the right;
Call this node P;
make the root of V the last child of P;
// this puts the leaves of U and the leaves of V at the same
// level in the right order.
spilt upward, as necessary
}
if (H < G)
{ starting at the root of V, go down H-G-1 steps to the left;
Call this node P;
make the root of U the first child of P;
spilt upward, as necessary
}

Time requirement: O(|H-G|).

Split set U at X. That is, destructively construct the set of all the
elements less than or equal to X.

1. Find the path from the root to X, or to where X would be.
2. Prune all the siblings to the right of that path.
3. Working from bottom to top, delete all nodes with 1 child
(all of these are nodes on the path).
Comment: You now have a collection of two-three trees of increasing values
and strictly decreasing heights.
4. Splice them together one at a time..

If the heights of the trees are H1 > H2 > H3 ... > Hk, then the time
for all the splices is
(H1-H2) + (H2-H3) + ... + (H{k-1}-Hk) = H1-Hk

So the entire split algorithm runs in time proportional to the height of the
original tree O(log(N))).