Red-black tree (C)

A red-black tree is a type of self-balancing binary search tree typically used to implement associative arrays. It has O(log n) worst-case time for each operation and is quite efficient in practice. Unfortunately, it's also quite complex to implement, requiring a number of subtle cases for both insertion and deletion.

This article walks through a C implementation of red-black trees, organized in a way to make correctness and completeness easier to understand.

Contents

A red-black tree is a type of binary search tree, so each node in the tree has a parent (except the root node) and at most two children. The tree as a whole will be identified by its root node. For ease of implementation, we will have each node retain a pointer to both its children as well as its parent node (NULL for the root). Keeping parent nodes costs space and is not strictly necessary, but makes it easy to follow the tree in any direction without maintaining an auxilary stack. Our red-black tree will implement an associative array, so we will store both the key and its associated value in a void*:

Each node also stores its color, either red or black, using an enumeration. The role of the color bit will be explained in the properties. There is some internal fragmentation due to using an integer type to store a single bit, but we avoid optimizing this here for simplicity. In the source file we will typedef abbreviated names for our types:

The uncle of a node, defined as the sibling of its parent. The uncle may also be NULL, if the grandparent has only one child.

<<private function prototypes>>=staticnodeuncle(noden);<<node relationships>>=nodeuncle(noden){assert(n!=NULL);assert(n->parent!=NULL);/* Root node has no uncle */assert(n->parent->parent!=NULL);/* Children of root have no uncle */returnsibling(n->parent);}

Although in older C environments we might define these as macros for efficiency, we prefer here to rely on function inlining support in the compiler to keep the code simple. The use of assert() requires a header:

We will at all times enforce the following five properties, which provide a theoretical guarantee that the tree remains balanced. We will have a helper function verify_properties() that asserts all five properties in a debug build, to help verify the correctness of our implementation and formally demonstrate their meaning. Note that many of these tests walk the tree, making them very expensive - for this reason we require the symbol VERIFY_RBTREE to be defined to turn them on.

As shown, the tree terminates in NIL leaves, which we represent using NULL (we set the child pointers of their parents to NULL). In an empty tree, the root pointer is NULL. This saves substantial space compared to explicit representation of leaves.

3. All leaves (shown as NIL in the above diagram) are black and contain no data. Since we represent these empty leaves using NULL, this property is implicitly assured by always treating NULL as black. To this end we create a node_color() helper function:

5. All paths from any given node to its leaf nodes contain the same number of black nodes. This one is the trickiest to verify; we do it by traversing the tree, incrementing a black node count as we go. The first time we reach a leaf we save the count. When we subsequently reach other leaves, we compare the count to this saved count.

Read-only operations on a red-black tree, such as searching for a key and getting the corresponding value, require no modification from those used for binary search trees, because every red-black tree is a specialization of a simple binary search tree.

We begin by creating a helper function that gets a pointer to the node with a given key. If the key is not found, it returns NULL. This will be useful later for deletion:

The client must pass in a comparison function to compare the data values, which have unknown type. Like the comparison function passed to the library function qsort(), it returns a negative value, zero, or positive value, depending if the left value is less than, equal to, or greater than the right value, respectively.

Now looking up a value is straightforward, by finding the node and extracting the data if lookup succeeded. We return NULL if the key was not found (implying that NULL cannot be used as a value unless all lookups are expected to succeed).

Both insertion and deletion rely on a fundamental operation for reducing tree height called a rotation. A rotation locally changes the structure of the tree without changing the in-order order of the sequence of values that it stores.

We create two helper functions, one to perform a left rotation and one to perform a right rotation; each takes the highest node in the subtree as an argument:

Here, replace_node() is a helper function that cuts a node away from its parent, substituting a new node (or NULL) in its place. It simplifies consistent updating of parent and child pointers. It needs the tree passed in because it may change which node is the root.

When inserting a new value, we first insert it into the tree as we would into an ordinary binary search tree. If the key already exists, we just replace the value (since we're implementing an associative array). Otherwise, we find the place in the tree where the new pair belongs, then attach a newly created red node containing the value:

The problem is that the resulting tree may not satify our five red-black tree properties. The call to insert_case1() above begins the process of correcting the tree so that it satifies the properties once more.

Case 1: In this case, the new node is now the root node of the tree. Since the root node must be black, and changing its color adds the same number of black nodes to every path, we simply recolor it black. Because only the root node has no parent, we can assume henceforth that the node has a parent.

Case 3: In this case, the uncle node is red. We recolor the parent and uncle black and the grandparent red. However, the red grandparent node may now violate the red-black tree properties; we recursively invoke this procedure on it from case 1 to deal with this.

We begin by finding the node to be deleted with lookup_node() and deleting it precisely as we would in a binary search tree. There are two cases for removal, depending on whether the node to be deleted has at most one, or two non-leaf children. A node with at most one non-leaf child can simply be replaced with its non-leaf child. When deleting a node with two non-leaf children, we copy the value from the in-order predecessor (the maximum or rightmost element in the left subtree) into the node to be deleted, and then we then delete the predecessor node, which has only one non-leaf child. This same procedure also works in a red-black tree without affecting any properties.

However, before deleting the node, we must ensure that doing so does not violate the red-black tree properties. If the node we delete is black, and we cannot change its child from red to black to compensate, then we would have one less black node on every path through the child node. We must adjust the tree around the node being deleted to compensate.

Case 1: In this case, N has become the root node. The deletion removed one black node from every path, so no properties are violated.

Case 2: N has a red sibling. In this case we exchange the colors of the parent and sibling, then rotate about the parent so that the sibling becomes the parent of its former parent. This does not restore the tree properties, but reduces the problem to one of the remaining cases.

Case 3: In this case N's parent, sibling, and sibling's children are black. In this case we paint the sibling red. Now all paths passing through N's parent have one less black node than before the deletion, so we must recursively run this procedure from case 1 on N's parent.

As we went along, we have separated the public interface (to be used by the client) from the private (static) helper function prototypes. We place the type definitions and public interface in the header file and the helper function prototypes and function definitions in the source file:

To ensure that all the cases of the complex insert and delete operations are exercised, we will perform a large number of operations on some simple integer data. All properties are verified after each operation, providing strong evidence of correctness.

The TRACE preprocessor constant allows us to turn on code printing the tree after each operation. This allows for a sanity check that the tree looks as we expect, as well as providing a way to visualize the results of the operation. We print the right subtree before the left subtree so that the tree is displayed sideways.

With property verification and tracing turned off, this code performs quite efficiently. When compiled with gcc with the -O3 flag on a 2.4 GhZ CPU running Gentoo Linux, it was able to perform 500,000 consecutive insertions followed by 600,000 consecutive deletions in only 2.1 seconds. Its primary limitation is the same as that of the C library function qsort(), that it must perform frequent calls to the comparison function through a function pointer, incurring function call overhead. Red-black tree (C Plus Plus) overcomes this by using templates which, like std::sort, allow the comparison function to be inlined.