7.1 Random Binary Search Trees

Consider the two binary search trees shown in Figure 7.1. The one
on the left is a list and the other is a perfectly balanced binary search
tree. The one on the left has height
and the one on the right
has height three.

Figure 7.1:
Two binary search trees containing the integers
.

Imagine how these two trees could have been constructed. The one on
the left occurs if we start with an empty BinarySearchTree and add
the sequence

No other sequence of additions will create this tree (as you can prove
by induction on
). On the other hand, the tree on the right can be
created by the sequence

Other sequences work as well, including

and

In fact, there are
addition sequences that generate the
tree on the right and only one that generates the tree on the left.

The above example gives some anecdotal evidence that, if we choose a
random permutation of
, and add it into a binary search
tree then we are more likely to get a very balanced tree (the right
side of Figure 7.1) than we are to get a very unbalanced tree
(the left side of Figure 7.1).

We can formalize this notion by studying random binary search trees.
A random binary search tree of size
is obtained in the
following way: Take a random permutation
of
and add its elements, one by one, into a
BinarySearchTree.

Note that the values
could be replaced by any ordered
set of
elements without changing any of the properties of the
random binary search tree. The element
is
simply standing in for the element of rank
in an ordered set of
size
.

Before we can present our main result about random binary search trees,
we must take some time for a short digression to discuss a type of number
that comes up frequently when studying randomized structures. For a
non-negative integer, , the -th harmonic number, denoted
, is defined as

The harmonic number has no simple closed form, but it is very
closely related to the natural logarithm of . In particular,

Readers who have studied calculus might notice that this is because
the integral
. Keeping in mind that an integral can be
interpreted as the area between a curve and the -axis, the value of
can be lower-bounded by the integral
and upper-bounded by
. (See Figure 7.2 for a graphical explanation.)

Figure 7.2:
The th harmonic number
is upper-bounded by
and lower-bounded by
.

We will prove Lemma 7.1 in the next section. For now, consider what
the two parts of Lemma 7.1 tell us. The first part tells us that if
we search for an element in a tree of size
, then the expected length
of the search path is at most
. The second part tells
us the same thing about searching for a value not stored in the tree.
When we compare the two parts of the lemma, we see that it is only
slightly faster to search for something that is in a tree compared to
something that is not in a tree.

The key observation needed to prove Lemma 7.1 is the following: The
search path for a value
in the open interval
in a random binary search tree, , contains
the node with key
if and only if, in the random permutation
used to create , appears before any of
.

To see this, refer to Figure 7.3 and notice that, until
some value in
is added, the search
paths for each value in the open interval
are identical. (Remember that for two search values to have
different search paths, there must be some element in the tree that
compares differently with them.) Let be the first element in
to appear in the random permutation.
Notice that is now and will always be on the search path for
.
If then the node
containing is created before the
node
that contains . Later, when is added, it will be
added to the subtree rooted at
, since . On the other
hand, the search path for
will never visit this subtree because it
will proceed to
after visiting
.

Figure 7.3:
The value
is on the search path for
if and only
if is the first element among
added to the tree.

Similarly, for
, appears in the search path for
if and only if appears before any of
in the random permutation used to
create .

Notice that, if we start with a random permutation of
,
then the subsequences containing only
and
are also random
permutations of their respective elements. Each element, then, in the
subsets
and
is equally likely to appear before
any other in its subset in the random permutation used to create .
So we have

With this observation, the proof of Lemma 7.1
involves some simple calculations with harmonic numbers:

Proof.
[Proof of Lemma 7.1]
Let be the indicator random variable that is equal to one when
appears on the search path for
and zero otherwise. Then the length
of the search path is given by

so, if
, the expected length of the search
path is given by (see Figure 7.4.a)

The corresponding calculations for a search value
are almost identical (see
Figure 7.4.b).

Figure 7.4:
The probabilities of an element being on the search path for
when (a)
is an integer and (b) when
is not an integer.