Friday, April 20, 2012

Red-Black Trees

This is a tutorial on coding red-black trees. I have tried to make it as easy to understand and detailed as possible. It covers both insertion and deletion from a red-black tree data structure. If you have any questions or any suggestions for improvement for future tutorials, please feel free to contact me at computerghost@hotmail.com or leave a comment. I don't check my email but about once a month, so don't ask me homework problems or questions that need an answer immediately.

What is a red-black tree?

A red-black tree is a type of binary search tree that aims to solve a problem that binary search trees (abbreviated BST) seem to have by using some... um... rather complicated algorithms.

So, what are these binary search trees and what problem do they have? A binary search tree is a data structure used for storing data in a way that is very fast. You can search for data quickly, and you can insert and delete data quickly. The term for measuring this quickliness is "asymptotic complexity", "big-O", "big-theta", et cetera et cetera. A BST runs in logarithmic time. You don't need to know what this means to use them other than that "logarithmic time" is rather fast, but I promise that it would be some interesting reading if you do want to look those terms up.

A BST is fast because of the way it stores its data. Say you want to find the 7 in the above BST (remember, BST is short for binary search tree, and I hate typing out that long name). You start at the 5, but 7 is greater than 5, so you go to the right. Then you're at the 8, but 7 is less than 8, so you go left and find it! You visit a total of three nodes (of the 7 in there) to find a value. It may not seem like much, but if you had two million values in that tree, you would still only have to search through at most twenty-one to find one.

Unfortunately, binary search trees have a major problem sometimes. Because of the way they insert things (exactly like finding them), inserting elements 1-2000000 in ascending order messes up the tree. In such a case, it would take two million steps to find the number two million. Trees like this are called "skewed". The first picture of the tree was what is called "balanced".

Balanced trees are better than skewed trees. Unfortunately, unless you're putting data into a BST in random order, you're likely to make it skewed.

Red-black trees solve the skewing problem of BSTs by making sure that the tree is still balanced after every insertion or deletion of data. This extra work that it does after these operations may sound like a bad idea at first, but the work pays off in the long run. The cost of doing the extra work looks like nothing when compared to the speed increase that comes with keeping the tree balanced.

The red-black tree has a set of rules for keeping the tree balanced. First, a node can be either red or black. Second, the root is black. Third, all leaves are the same color as the root. Fourth, both children of every red node must be black. Fifth, every path from a node straight down to any of its decendant leaves has to have the same number of black nodes. (Wikipedia1)

The fourth and fifth rules are the ones that we have to focus on mainly. If you notice in the balanced BST, the height on both sides of any node is the same. The red-black tree forces this setup by rule five. That is why it works.

Unfortunately, keeping the tree perfectly balanced is a real pain and requires too much computing power, and that's where the red nodes come in. The red nodes give us a little bit of leeway in how close to perfectly balanced the tree is. There is rule four, though, that keeps us from abusing this leeway too much.

In this tutorial, you will see many things!

First, I am pretending that all null pointers in nodes actually point to an imaginary node that has no value but is black. Those imaginary nodes are the leaves. This allows us to do things like what is in the picture below.

Second, you will notice that the examples are in C++. I seriously considered using C instead, but I eventually decided that C++ should be used since I'm better at it. The algorithms should be easily converted to any programming language, so have no worries about the language that I'm using.

Third, private and protected data and functions are prepended by an underscore. This is a red flag that something shouldn't be used outside of the code it's designed to work with.

Fourth, you will find an absense of const functions, templates, and other confusing stuff that takes focus away from the matter at hand.

Lastly, if you look really hard, you might find a grammatical error! Please let me know if you do.

Let us begin with the design.

Our red-black tree will have three useful member functions: find, insert, and erase. Also, we will need a testing function to make sure that we are doing everything correctly. To save space in our cpp file, they will all be coded inline.

The public interface functions will be at the bottom of the class declaration, the helper functions will be above those, and the data will be higher still.

The following snippet of code should do for us to start with (and, no, ending a sentence with a preposition does not count as a grammatical mistake).

You should scan through that and get an idea of how everything will be laid out and used.

Next, let's code the testing code.

Well, we simply want to insert and erase lots of random data, right? We won't worry about the find function since both insert and erase use it and will be testing it pretty thoroughly without our intervention. So, the test_insertion, test_erasing, and main function should look like the following code.

Also, since we are using the rand function, we should include the cstdlib header file. I won't show you that code since it's easy to put in and is only one line.

Next, we need to code rb_tree::check. We need to make sure that the root is black, that the black height on each side of a node is the same, that there are no two red nodes in a row, and that a node's parent points back up to its actual parent.

If the value we are looking for is less than the data in the node we are at, then go left; if it is greater, go right. Keep going until we hit a dead-end or we find what we are looking for. That is all that the find function is doing.

Now, we need the destructor coded.

Before we go adding data to the container, we need to code the destructor! It basically clears the entire tree, so why don't we also add a clear interface function just because it's easy? I will use a recursive algorithm for this one because the iterative way is kind of messy. Generally, I hate using recursive algorithms in destructors.

What we have so far is the code below. Next, we will start work on the rotation functions and a few other helper functions. After that, we can jump into insertion and then into deletion! Anyways, here is what we have so far:

What are the helper functions that we want? Well, we know that we will need to do tree rotations. And tree rotations are much easier if you have a link to the node (those green lines in the pictures). So, we need the ability to represent those lines, and we need rotation functions.

Lets start with the links, aka the green lines. The implementation is kind of confusing, so we use some functions to deal with that stuff, and all that we need to know is that they are those little green lines! Whenever we set the destination of a link, whatever node the link is coming from now has its little green line pointing to somewhere else.

Now let's do the rotations. We will need clockwise rotations and counterclockwise rotations. Usually, they are called right and left rotations, respectfully. I call them clockwise and counterclockwise.

Let's say that we want to rotate the root node counterclockwise. We want to do it, but we also want the nodes to stay in the correct order. Watch. This is how you do it.

And that is how a counterclockwise rotation works! A clockwise rotation is exactly the same idea... just hold the picture up to a mirror. Here is the code for the rotations:

The _get_insert_link function returns a pair (if you're unfamiliar with std::pair, this means that it returns two things) with the first value being the link and the second being the origin of the link (the future parent of the node we're inserting). If the value already exists, then the link it returns is not valid, which can be tested with !where.first. We will need to include the utility header file to use std::pair.

The _link_set_dest function we have already covered. And the _insert_balance function just balances the tree after insertion. We pass the _insert_balance function the node we just created, since that's where we changed the tree and thus where we need to start balancing.

We set the node that we just inserted to red. While this isn't the only option that we have, it makes it easier if we can indeed insert a red node. Red nodes don't affect the black height, so if _insert_balance doesn't detect two reds in a row after this, then we don't even have to balance the tree. If we insert a black node, then we would more likely have to balance the tree afterwards.

The following is the code for the _get_insert_link function. We will cover the other undefined function, _insert_balance, in the next section. So, after this function is coded, we can move on to coding balancing.

The _get_insert_link function works just like the find function, except that, instead of keeping track of the current node as it works its way down, it keeps track of the link to the current node and that node's parent (aka the link's origin).

Easy balancing after insertion.

Balancing after an insertion is a pain, so it may be a good time for a quick break before continuing. Ah, it's not near as bad as erasing, but it can get messy.

We will start with the easiest cases: one, we just inserted a red node and its parent is black; and two, we just inserted into an empty tree.

The picture on the left is if we insert a red node and the parent is black. Since we do follow the rules for red-black trees, we know that the parent of the red node was balanced before we inserted the node. Thus, the right child must be a null leaf or a red node. In any case, the total black height of that node was two, and each of its child subtrees had a black height of one.. After inserting the red node on the left, the total black height at the red node's parent is still two, and each of its child subtrees still have a black height of one. Therefore, inserting a red node with a black parent is a breeze! We need not balance.

The picture on the right is if we insert a node at the root. The only rule that we are violating is that the root must be black. Simply change the color to black, and we are done!

If neither of the two pictures above apply, then the red node that we just inserted has a red parent, and we need to take steps to fix the problem.

The far left is the situation that we have. We know that the parent has no other children since it must have been balanced before we inserted its child node. To solve the problem of two reds in a row, we can either change the node we just inserted to black (middle) or the parent to black (right).

I personally like the solution on the right. In either case, we are adding one to the black height and thus must go up the tree correcting the problem, but in the solution on the right, we can start balancing at the parent instead of at the node we just inserted. One step higher! It might make a difference in speed.

We will save the complicated "go up the tree fixing the black height problem" for a separate function. For the easy cases and for preparing to go up the tree balancing, this is our _insert_balance function:

There is no super simple way to solve this problem that I can think of, so we need to make it easier by taking it one step at a time. We need to consider every possibility that makes sense.

We will first consider the possibilities when first calling the _insert_harder_balancing function. Also, get your gedit, notepad, or whatever you use handy. If we run into a place where we need to traverse on up the tree, we need to make a note of any new possibilities that we introduce.

When first calling the function, the node we passed is black and its parent is black (node was red in _insert_balance before it was changed to black, so node's parent must be black). Let's look at this.

The red node with the yellow dot is the node that we just inserted. The blue numbers are the black heights at each node. The red node's parent is what we passed to our function (henceforth known as "node"). It used to have a black height of 1 since it used to be red, so its parent (henceforth known as "parent") used to have a black height of 2. If parent used to have a black height of 2 and is black, then node's sibling (henceforth, "sibling") has a black height of 1. The tree is unbalanced.

Since sibling has a black height of 1, then it is either red or a null leaf. This makes things much simpler! Let's look at these two possibilities.

Initial case 1: node is black, parent is black, sibling is red.

Since sibling has a black height of 1, if it is red, then it can only have null leaves as children. Now, how can we do color changes or rotations to solve the black imbalance at parent?

What if we set parent to red and sibling to black? That balances the tree (each side has a black height of 2), and the total black height is 2. As long as parent's parent is not red (it is either black or nonexistant (parent is root)), we're done! But if parent's parent is red, then we have a problem.

Well, the solution is to either set parent (um, the second red node from the left) to black or to set parent's parent (the top node) to black. In either case, we'll have to traverse up the tree balancing, but if we set parent's parent to black, then we can jump up the tree further before we balance.

Let's look at what the tree looks like after we jump on up the tree. node is now parent->parent, parent is its parent, and sibling is the new node's parent. It sounds confusing, but we actually don't have to worry about those details. Just call _insert_harder_balancing with parent->parent, and all those variables are adjusted accordingly exactly as they were when called initially. Let's have a picture of when we step upwards.

Since the old parent->parent (now "node") was red, then the new "parent" must be black. That makes the new "sibling" able to be pretty much anything! All that we know for certain is that it must have a height of at least 2 (since that's the minimum after calling from the initial call of _insert_harder_balancing), and so it must exist. But its children may be red, black, or null leaves (considered black by the _node_color function). That's five extra cases right there! We must make a note of these new possibilities.

Hopefully, we will inadvertently solve each of the cases listed above. If not, then I guess we'll just have to deal with them later. Let's continue with the initial cases for now.

Initial case 2: node is black, parent is black, sibling is null leaf.

Oh, thank god! I immediately see a way to not only balance this tree but to also get the total black height back to what it was before insertion. We need not traverse up the tree and consider other cases.

Reminder: The red node with the yellow dot is what we just inserted. node is its parent, and parent is node's parent.

To solve this, set parent to red, then rotate clockwise about parent. The picture below is the outcome of these operations.

There is one minor error in the above solution. I will leave it as an exercise to figure out what it is. If you can't figure it out, don't worry. I will correct the error at the end of the insertion section.

We are done with the easy insertion cases.

The code below is what we have so far for the _insert_harder_balancing function. The next step is to address those five extra cases we introduced, but first let's just look at the code we have for the initial cases.

Currently, our code will balance the five other cases without us having to do anything else. Remember that the null leaf nodes in case 2 can also be considered black. Let's look at what happens in each of the five extra cases. I have color-coded node, parent, and sibling so that you can follow what happens.

One question that you may ask is, "what if node's right child is red?". Well, looking at the code where we eventually get to this case, node was red before it was changed to black and the function was called again for these cases. If node was red, then its children must be black. No problems!

By the way, my punctuating after quotation marks is technically a grammar mistake, but I have no intention of changing it. So, for the quest for the grammatical error, keep, looking. :)

We are done with insertion!

It is time to view our _insert_harder_balancing function. Look below to find the code.

One thing to note, is that all of the examples above assume that node is on the left of its parent, and all of the drafts of _insert_harder_balancing make this assumption, too. If it is, instead, on its parent's right, then everything needs to be viewed in a mirror. In other words, where we rotate clockwise, we need to rotate counterclockwise.

And now, before we move on to erasing, let's run our program to make sure that everything works as planned... I'm compiling it... I'm running it... error! The root is red?

Why is the root red? Don't we set it to black if we just inserted at the root? Yes, we do. Looking further into it, in the _insert_harder_balancing function, we see that our case 1 handler might make the root red. For the cases above, we did not consider the rule about the root having to be black (to simplify things).

There are two solutions that we can choose from: one, add an extra "if" for case 1 that checks to see if parent is root before setting its color to red; or two: set the root to black after we return back to _insert_balance. The first solution will execute that "if" an unknown number of times, but the second solution will only execute the assignment once. Let's go with the second solution.

Let's try this again. Compiling... running... arg! Remember when I said that there was a problem with case 2? What if node's right child is red? We get two reds in a row after the rotation! The solution is to swap the colors of node and its right child, then rotate counterclockwise, if its right child is red. And no, this doesn't cause any more red violations after the rotation since node's right child's children must be black. An opposite method is used if node is on the right side of its parent.

One more try. Compiling... running... success! The code we have so far is below. It works wonderfully!

To erase from a red-black tree, you first erase just like you do for a regular binary search tree, then you balance the tree and make sure it follows the rules for a red-black tree. If you don't remember or know how to erase from a regular binary search tree, no worries! I will go into more detail on that now. Consider the binary search tree below.

When you erase a node, there are four possibilities: 1) the node is the only node in the tree; 2) the node has no children; 3) the node has one child; 4) the node has two children.

In the first case, where you erase the only node in the tree, just erase it! In the second case, it's the same thing. You erase a node with no children, and everything is still dandy (apart from balancing afterwards for the second case).

In the third case, things get a little tricky. In the tree above, what happens if you just delete 8? Well, the root of the tree no longer has a path to 5, 6, or 7! This is, however, easy to fix. Just move the 6 up to where 8 is after you delete it. Oh, and notice that the values will still be in alphabetical order (left to right), too.

In the fourth case, what do you do? What if you want to delete 2? "Oh, just move the 3 up!" Yea, that works for that instance. Now what about 4? You can't just move the 8 up to take its place. Go ahead, try it. Neither can you move the 2 up to take its place. How can you remove the 4 from the tree? And here, it gets slightly tricky, but there is a solution!

If a node that you want to remove from the tree has two children, swap its value with either the rightmost node of its left subtree or the leftmost node of its right subtree. Either of those two will have at most one child. Then, use the case for no children or one child to remove the replacement node. Here, look (I'll use the leftmost node of the right subtree):

Notice that the values are still in alphabetical order after the 4 is swapped with the 5 and the 4 is removed.

We will use the solutions above to code a _replace_and_remove_node function that will help us out in erasing. What it will do is make the tree forget that the node ever existed. If it has one child, the node will be replaced by its child. If it has two children, the leftmost node of its right subtree will replace it. If it has no children, then it will be simply forgotten.

Since we are wanting to balance the tree after the modifications, we will make _replace_and_remove_node return the actual removed node (remember, we don't actually remove the node we wanted to remove if it has two children). What returning this will do is tell us where the tree was modified, and we will use this information to figure out where to start balancing. Since we will not delete the node yet, it still has its outgoing pointer to its parent (though its parent has forgotten it).

The code for the _replace_and_remove_node function is below. After this is done, we can more easily code the main erase function.

We passed the removed not to the balance function. We might want to know what color the node we removed was. Remember that the node still exists and still has a link to its parent (though its parent no longer has a link to it). And now we can begin the complicated task of balancing after a deletion.

Easy Balancing for Erasing

First, we will start with the easy cases, as we did for insertion. If we can not solve the problem easily, we will call another function for the more complicated cases.

Remember that we must make sure that the tree follows the red-black tree rules. To make things simpler, we will only be concerned with the rules that two reds cannot be in a row and every path from a node to a decendant leaf must contain the same number of black nodes.

When we erase a node, it has at most one child (as is ensured after the call to _replace_and_remove_node). Also, it can be either red or black. Consider the picture below.

Continuing with our current color scheme, the blue is the black height at a node, the green lines are the links between the nodes, and the circles are the nodes (red and black). One added symbol is the yellow X which covers the node that we have removed from the tree.

The bottom-right situation is impossible because the tree is unbalanced before the removal of the red node. The other three cases, though, are possible. In the top-left case, we have no choice but to consider even more nodes, and we will do this in a different function. The case on the right can be solved right away by making the red node black. The bottom-left case has no effect on the black height of the tree, so we need not balance in that case.

See? In the second and third situation, the black height is the same as it was before, so we are okay. It is only in the first situation that we have a problem.

On more thing to consider, however, is what to do if we just removed the root. The first and second cases above could also represent that, so we will have the balancing down. In the first case, why pass the work on to another function that handles complicated stuff, though, when we can just return without balancing if we just deleted the root? We will do that.

We took a shortcut for the black node with a red child. If we just removed a black node, we must have a red child if there is any child. Also, we used _node_color to get the node's color. As explained way up above, _node_color works even for those imaginary leaf nodes, but we use it for everything just to stay consistent.

I put question marks after the _erase_harder_balancing function, which handles the first case in the picture above. Where should we start considering the harder cases?

Well, we know that the node is black and has no children, so do we need to pass the node? Not really. But we do need to know which side of the subtree now has one less black node, so we do need to pass this information somehow. Well, we can't pass node because its parent no longer knows it. Let's pass its sibling. This way, we know which side is heavier, and we can start balancing at the parent. But what if the sibling doesn't exist? It must, because if it didn't exist, the black height of the side we removed from was 2 and its sibling's side was 1, which would have been unbalanced. We can and shall pass the removed node's sibling.

We can't use the _node_sibling function since the parent no longer recognizes the removed node, but we know that the removed node has no child and was replaced with nullptr, and we can use that to figure out the sibling (the sibling is the only side of parent that exists).

Now let's consider more difficult initial cases for erasing.

First, we will consider the initial cases. What does the tree look like right after we remove a node when we have gone to _erase_harder_balancing?

By looking at just the colors of the sibling and the parent, I cannot find a solution. All of the solutions I can think of depend on either the color of sibling's left child or the colors of both of its children.

We need to expand these cases to cover sibling's children. Since we are considering every possible combination of the colors of the parent, the sibling, and the children of sibling, we may as well just throw out the idea of only considering the simpler initial cases. Now we have the possibilities listed below.

Since null nodes are considered black, the above works also for the initial cases where sibling's children may be null. Also, we will still assume that node is black, but if we have traversed up the tree, it will exist (so I've left off the yellow X). We just have to make sure that node is black when we traverse up the tree.

One last thing to note about the picture is that I used "h" as a variable to hold the height of the parent. The other nodes have a height relative to this. This is because the height may be any number as we traverse up the tree.

Now we find solutions for the cases where erasing.

Our goal is to change colors and/or do rotations so that the total black height of the subtree is the same as it was before and the subtree is balanced. Sometimes, we can't get the height to be the same as it was before. In such a case, we need to traverse on up the tree.

One more thing to consider is whether we have made the root of the subtree red while its parent's color is red. We may have a problem if we have. We will need to fix that problem and then traverse on up the tree.

Now let us begin the tedious task of finding solutions to each of the nine cases. I hope you are fully rested and have a while to dedicate to this! It might get hard. Oh, and I'll skip the transitions from case to solution and leave those as practice for you. It will be good to work these out so that you understand the process better.

I will use P to represent parent, S to represent sibling, Sl to represent sibling's left child, and Sr to represent sibling's right child. I'll do this to make the typing in the case titles less. I will likely use the full names in the instructions, though.

Case 1: P is black, S is black, Sl is black, Sr is black

In this case, we can simply set the sibling's color to red. That fixes the black height problem. Unfortunately, the total height of the tree is now one less than it was before.

We will need to traverse on up the tree balancing. Since our _erase_harder_balancing function takes the sibling of the node that is on the smaller side of the tree as a parameter, we pass parent's sibling to _erase_harder_balancing to traverse upwards.

Case 2: P is black, S is black, Sl is red, Sr is black

In this case, set sibling's left child to black, rotate clockwise about sibling, then rotate counterclockwise about parent. Check it out!

I color-coded the nodes so that you could follow them. I did not show each step — that is an exercise for you! And we are done for this case.

Case 3: P is black, S is black, Sl is black, Sr is red

In this case, simply set the sibling's right child to black and then rotate counterclockwise about parent. Then, we are done!

Case 4: P is black, S is black, Sl is red, Sr is red

The solution here is to do exactly as in case 2. Set sibling's left child to black, rotate clockwise about sibling, then rotate counterclockwise about parent. Then, we are done.

Case 5: P is black, S is red, Sl is black, Sr is black

While I am thinking about it, let me point out that setting node to red would solve the problem completely if we just consider the part of the subtree shown in case 5. The reason we cannot do this is that we do not know that node's children are not red. We just came from there, too, so we do not want to backtrack to check node's children.

Now, back to case 5... I cannot find a solution from what is shown. The closest that I can come is the picture shown below, but the left side has an imbalance afterwards. I set parent to red and sibling to black, then I rotate counterclockwise about parent.

Isn't it ironic how we must traverse back down the tree right after I said not to? We must traverse back down the tree and pass the green-centered node to our _erase_harder_balancing function (remember, we pass the sibling of the node that has lost one black height).

We don't want to parse through cases 1, 2, 3, 4, and 6 when all we need to do is parse through cases 6, 7, 8, and 9 for this. Why not put 6, 7, 8, and 9 in a separate function? We will do that!

Case 6: P is red, S is black, Sl is black, Sr is black

In this case, we can simply rotate counterclockwise about parent. I won't color-code the nodes since it's so simple. Oh, and we're done afterwards!

Case 7: P is red, S is black, Sl is red, Sr is black

In this case, set parent to black, rotate clockwise at sibling, then rotate counterclockwise at parent. Then, we are done.

Case 8: P is red, S is black, Sl is black, Sr is red

The solution here is the same as in case 6. Just rotate counterclockwise about parent! Since it's so simple, no color-coding. And we are done afterwards since the total black height before is the total black height afterwards.

Case 9: P is red, S is black, Sl is red, Sr is red

Set parent to black, rotate clockwise about sibling, then rotate counterclockwise about parent. Total height is what it was before, so we need not traverse on up the tree. We are done.

Case one more: we are at the root

If we have just balanced the tree and the new subtree's root is the root of the entire tree, then we cannot traverse upwards. Therefore, we need to check to make sure that parent->parent exists every time we want to traverse upwards. Problem solved!

Now that the cases and solutions are discovered, code it!

All of the cases that we considered above assume that the node (which is the side that is low one black height) is on the left of its parent. If it is on the right, then just hold the solution up to a mirror.

We could use extra if statements to determine which side of its parent that node is on, but that makes the code messier. Instead, I will use a little trick that turns the rotations around and swaps the variables we use for the sibling's children.

Now we just look at each case and its solution above and code it as we see it. Don't worry about optimization. We just want to make a tree that works! Later, we can optimize (not covered in this tutorial). The code below is the _erase_harder_balancing function.

Test the code. Compiling... running... success! And now, hopefully, you have a better grasp of red-black trees. At the very least, you now have some red-black tree code! Feel free to use it however you see fit. Oh, and you don't even have to give me credit (though a +1 would be nice). ;) Oh, and the code for the entire thing is below.

About Me

Programmer, talks to self, crazy thoughts, terrible at song-writing and singing though I would never agree with that, odd musical tastes, embarks upon extremely odd adventures, thinks bamboo caterpillars are tasty, and went to TCHS, NEMCC, and MSU.