Typedefs: Adventures in Data Structure

November 3, 2012

If there's one thing I can take away from this project, it's that Wikipedia is a great resource for lists of things and an absolutely miserable experience for actually learning about them. Trying to read the descriptions for most of the tree structures on Wikipedia is like attempting to decipher ancient Greek. And this is a shame, because trees--even special trees like Red-Black or AA trees--are really not that complicated.

It's no exaggeration to say that we are surrounded by data trees. HTML is a tree--elements contained in other elements, all the way up to the <html> node. Levels in Doom were a special kind of tree called a Binary Space Partition. Your hard drive, of course, is a tree of nested directories. But even outside of computer science, trees are often a natural conceptual framework for expressing certain kinds of collective relationships: the Dewey Decimal System, for example, or a business's org chart, or (near and dear to my heart) the sections and sub-sections of a piece of legislation.

So if we were implementing a very simple tree, one with no optimization whatsoever, it would be pretty easy. In fact, it's not far off from our linked list, but instead of each item containing a single reference to a descendent, it contains a collection of references to zero or more children, and (optionally) a single reference up to its parent.

For applications where semantic meaning is more important than raw speed, the number of children per node is pretty arbitrary. On your hard drive, for example, you probably have some folders that are many levels deep, and some that are very shallow but contain a lot of objects, because the priority is to organize them logically so that you and your programs can find them. But if you're building a tree for doing searches or sorting behind the scenes, you want the "branches" of your tree to be all the same length and spread pretty much evenly around the structure, so that navigating around the tree and down each branch takes a consistent amount of time.

In this post we're going to build one of these search-optimized trees--a so-called "self-balancing binary search tree." And the difference between those and a regular, ad-hoc tree is simply the addition of three constraints:

It has a maximum of two children per node (usually called left and right)

It has a set of rules for determining when a branch is too deep or too wide after insertion or deletion, and

It has a clearly-defined procedure for rearranging the tree when the rules are violated to bring it back into compliance.

To navigate a tree built this way, you take your search value, and you compare it to the node at the root (which is traditionally at the top: binary trees grow down. Imagine they're in Australia, if that helps). If it's bigger than the value assigned to the root, you move to the child on the right side. If it's smaller, you move to the left. Then you do the comparison again, until you either hit the value you want, or you hit the end of the tree (where you may insert the value so you can find it next time). Here's an example of a node from one of these trees, followed by an algorithm for traversal.
var node = {
left: null, //reference to left-hand child (usually lesser)
right: null, //reference to right-hand child (usually greater)
key: 0, //the node key, used for lookup
data: null //the actual value of the node
}
//walk() recursively traverses a tree made of these nodes
//Then it renders it out visually to the console
function walk(tree, depth, side) { //depth and side are used for internal record-keeping
if (!depth) depth = 0;
if (!side) side = "";
var padding = "";
for (var i = 0; i < depth; i++) padding += "| "; //add spacing to the output
console.log(padding, side, tree.key, tree.data); //display the tree//now we walk down the branches, if they exist
if (tree.left) walk(tree.left, depth + 1, 'left');
if (tree.right) walk(tree.right, depth + 1, 'right');
}
Recursion--creating functions that call themselves--is an essential strategy for being able to use trees, so if you're not very good with it, now would be a good time to study up.

Today we're going to implement an AA tree, which is named for its inventor, Arne Andersson (original paper available here), not for the support group ("Hi, I'm Thomas, and I'm a data structure." "Hi, Thomas!"). I picked the AA tree from all of the binary search trees because its rules for maintaining balance are extremely simple, and I didn't particularly want to spend a long time working through edge cases and chasing bugs. An AA tree is built by adding a level value to each node, and then following these five rules:

Leaf nodes (the null values that terminate each branch) are at level one.

Left children have to have a level value less than their parent.

Right children must have a level less than or equal to their parent.

Right grandchildren, however, must have a level less than their grandparent.

Any node with a level higher than one has to have two children.

If this already sounds like one of Professor Layton's more annoying word problems, rest assured: it's actually not that bad. Rule violations are dealt with using only two operations: a "skew" and a "split."

A skew is used to fix a violation of rule two, where the left child is at the same "level" as its parent. It rotates the tree so that the left child becomes the new parent, with the old parent as its right node and its old right node moved underneath. For once, Wikipedia's illustrations are actually pretty good:
And here's the code for skew():
function skew (branch) {
if (!branch || !branch.left) return branch; //basic input validation
if (branch.left.level >= branch.level) {
var swap = branch.left;
branch.left = swap.right;
swap.right = branch;
return swap;
}
return branch;
}

A split fixes violations of rule 4, where the rightmost grandchild of a node (so the right reference of its right reference) has the same level value. It makes the right child the new parent node, with its parent as the new left reference and its former left child takes its place as the right child node on the old parent. Again, from Wikipedia:
Here's our split() code:
function split(branch) {
//basic input validation
if (!branch || !branch.right || !branch.right.right) return branch;
if (branch.right.right.level >= branch.level) {
var swap = branch.right;
branch.right = swap.left;
swap.left = branch;
swap.level++;
return swap;
}
return branch;
}

It may help, when trying to figure out exactly what these functions are doing, to remember that a binary search tree is basically a number-line with a bunch of connections starting from an arbitrary middle node somewhere. Or, to put it another, way, if you squash a tree down onto a single "level", you get the following:
So all of these operations--skew, split, insertion, and deletion--are ways of changing the links between nodes without altering their horizontal order. The tree above, for example, could easily have 4 as its head node, linking from there to 2 and 7 (and from those to 1 and to 6 and 9, respectively), but the order of the nodes when flattened remains the same, and the same rules of traversal (go left if less, right if greater) still apply. This would make a pretty decent Professor Layton puzzle, actually.

Enough theory, let's start adding values. Whenever you do an insertion on the tree (deletions are more complicated), you start by traversing the structure to see if the value exists. If it doesn't, you add it at the end of a branch with a level of 1. Then work your way back up to the root on the back half of the recursion, rebalancing as you go with a skew and a split. You skew first, because the reverse makes it possible to end up in an invalid state and you'd have to perform the balancing operations all over again.
insert: function(key, data, branch) {
if (branch.key == key) {
branch.data = data;
return branch; //If we find a match for the key, just update in place
}
if (key < branch.key) {
if (branch.left) {
branch.left = insert(key, data, branch.left); //recurse to the left
} else {
branch.left = new AANode(key, data); //at the leaf, add the new node.
}
} else {
if (branch.right) {
branch.right = insert(key, data, branch.right); //recurse to the right
} else {
branch.right = new AANode(key, data); //at the leaf, add the new node.
}
}
branch = skew(branch);
branch = split(branch);
return branch;
}

Deletions, as I said, are more complicated, but they're not terribly hard to understand--the deletion part, at least. Once again, we walk the tree to find the value, but this time we hold onto it for a little bit while we go looking for its nearest key to swap into its place. We find that by walking down either the left or right branch (either will do) one step, then going the other direction until we reach a null (i.e., for the precursor value, we walk left once, and then repeatedly right). We swap the replacement node's data and key with those of the node to be deleted, and then prune the end of the branch.
remove: function(key, branch) {
if (!branch) return branch; //if we recurse past the end of the tree, return the null value.
if (key < branch.key) {
branch.left = remove(key, branch.left);
} else if (key > branch.key) {
branch.right = remove(key, branch.right);
} else {
//if this is a leaf node, we can just return "null" to the recursive assignment to remove it.
if (!branch.left && !branch.right) return null;
//we look for the "closest" key in value, located at the end of one of the child branches
var parent = branch;
if (branch.left) {
var replacement = branch.left;
while (replacement.right) {
parent = replacement;
replacement = replacement.right;
}
} else if (branch.right) {
var replacement = branch.right;
while (replacement.left) {
parent = replacement;
replacement = replacement.left;
}
}
//we swap the replacement key and data into the to-be-removed node
branch.key = replacement.key;
branch.data = replacement.data;
//then remove the replacement node from the tree, thus completing the "swap"//we can't do this in the while loop above, because technically it could//be either child, regardless of initial search direction.
if (parent.left == replacement) {
parent.left = null;
} else {
parent.right = null;
}
}
//decrease levels, in case they've gotten out of control
var minLevel = branch.left ? branch.left.level : branch.level;
minLevel = branch.right && branch.right.level + 1 < minLevel ?
branch.right.level + 1 :
minLevel;
if (minLevel < branch.level) {
branch.level = minLevel;
if (branch.right && minLevel < branch.right.level)
branch.right.level = minLevel;
}
//rebalance, using the sequence given in Andersson's paper.
branch = skew(branch);
branch.right = skew(branch.right);
if (branch.right) branch.right.right = skew(branch.right.right);
branch = split(branch);
branch.right = split(branch.right);
return branch;
}
It looks complicated, but it's really not that bad. The most confusing part is probably the swap in the middle, but if you remember the number-line version of the tree, all we're doing is switching the to-be-deleted node's position in the tree with it's neighbor, so the rules of left and right traversal remain valid. Then, because the neighbor was at the bottom of its branch, we can drop the now-swapped node from the tree without worrying about its children. The most conceptually-difficult part of the delete operation is the rebalancing at the end, but Andersson's paper provides a sequence of splits and skews that we can simply follow in order.

Implementing these trees in JavaScript was a challenge, not because they're hard, but because of a language quirk: JavaScript is strictly pass-by-value, except where the argument being passed in is an object, in which case what gets passed in is a reference to the object, still passed by value. In this, it's basically like Java. As a result, you can alter properties of the object because you have a reference to it, but you can't replace it or modify it wholesale, because that changes the value of the reference to something completely new, while leaving the old reference unchanged in the caller's scope. You can't, in other words, write a swap(a, b) function in JavaScript, which is a shame, since (you may have noticed) maintaining these trees requires a lot of swaps.

I started out trying to write my skew and split functions with reassignments done in the function scope, just as in Andersson's paper (which is written in Pascal-like syntax, and Pascal is a pass-by-reference language). But when the function exited, my node arrangements would be unchanged--or worse, would be altered in bizarre and unpredictable ways--because I was unknowingly operating on and assigning new references that only existed in the local scope. Changing my functions to expressly use return values eliminated the problem. It also made my code more verbose and difficult to read, in my opinion. But once I came to grips with correct JavaScript calling style, the rest fell into place pretty quickly. Deletion was the hardest part: I eventually went back to the original paper, re-read the prose description, and implemented a "clean room" version of it.

Maybe because you're not actually supposed to touch the inside of a tree as much as you do a list, this actually ended up being much simpler than the linked list from last post. There's more theory, but arguably less bookkeeping. As before, I wrote it as a functional library object, then wrapped it up in an object that maintains the references to the structure in working order. I like it, and could definitely see myself using these for something in JavaScript--as you're about to see.

Demonstration

This time, I want to demonstrate using this structure for a real purpose, not just a get/set interface. So here's what we're going to do: when you click the Start button below, a script will pull in the contents of this post and store each word and its frequency in an AA tree structure. Don't worry, it's surprisingly quick. At that point, you can start typing into the text box below. As you do, it'll search through the tree with the closest option (which returns the last non-null branch it finds) to autocomplete a word (shown in the "do you mean..." link). If it finds an exact match, or if you click the link to use the auto-completion, it'll tell you how many times I used that word. The tree structure should be fast enough to do autocompletion in realtime, which is pretty cool.

As always, the code for the example is here, and the underlying tree implementation is here, both with as many comments as my poor eyes could stand. If you're the adventuresome type, you can access the actual tree structure from your console via window.demos.aaTreeDemo (I recommend calling window.demos.aaTreeDemo.tree.render(), if you're curious about the resulting structure). Once you've prepped the tree, you can also see a simple ASCII-style rendering of its structure (with the direction, key, data, and level of each node) using the View Tree button, but be warned: it's long.

This post is dedicated to a student of mine. My first quarter teaching JavaScript, I had one person in class who was working on mapping, using Python to parse the data and JavaScript to present it. He was taking the class in order to better understand that presentation layer--in particular, he wanted to be able to cluster a large number of map points at a wide zoom levels.

My first idea was to create a quadtree, but although this student was fairly advanced, he was still learning to process lists--processing a tree using recursion would have been tough to swallow. We ended up using Underscore to build groups filtered by distance from the first ungrouped item, which is workable and fast, but the clusters it creates are dependent on order of the original list, and they can drift over time as items move in and out of the clustering groups. However, when I started reading Steven Skiena's Algorithm Design Manual this month and happened upon a mention that weighted graphs could be used for clustering, I leapt on the idea. Maybe now, finally, I could figure out a satisfactory answer to his problem?

Before we answer that question, let's talk a little bit about what a weighted graph is. If you're used to the idea of a tree, a graph is not a hard adjustment to make--instead of a tree with left/right branches (or some other scheme), a graph is composed of nodes connected by "edges," which may have weight (corresponding to distance, cost, change, etc.). For implementing a graph, we really care about the edges most of all--handling node values during/after traversal is mostly important at a higher abstraction level than the graph itself. We can think of each edge as looking similar to this:
var edge = {
to: 1 //index of the vertex at the end of the edge
weight: 0 //used for deciding traversal path
}
Edges only have one endpoint. If two vertexes are related to each other, the practice is usually to add two edges--one going each way. Since they're not real, it's not like they'll collide or anything. A graph with one-way linkages is called a "directed" graph, while one with all linkages going both ways is "undirected."

The graph structure that surrounds our edges also requires some explanation. Skiena's graph structure is built around tables of linked lists (which, in JavaScript, we can simply replace with arrays)--one table for vertexes, one for edges, and more tables for flags like "outdegree" or "search state"--with relationships between the tables expressed through common indexes. Consider the following simple graph:
var graph = {
vertexes: ["A", "B", "C"],
edges: [
[{to: 1}, {to: 2}], // A's edges
[{to: 0}], // B's edges
[] // C's edges (or lack thereof)
]
}
In this case, A has edges linking it to B and C. B is linked only to A, and C is an end vertex--it doesn't link out to anywhere. The link between A and B is undirected, while the links to C are directed (one-way). This may be easier to envision as a table, where the numbers listed for each edge express which column the edge links to:

Column Index

0

1

2

Vertex values

A

B

C

Edge to

1

0

2

Skiena sets up his structures this way because he's using C, and because he's trying to go fast, so it makes sense to keep data in flat arrays instead of objects or structs. In JavaScript, it feels a little bit odd to be using arrays linked by index instead of objects with properties, so after I implemented Skiena's version, I built a version of the graph and graph algorithms using Edge and Vertex constructors, as well as some functional techniques that felt more idiomatic for JavaScript. Unfortunately, these turned out to be much slower, so I'll be sticking with Skiena's versions. You can check out the object-oriented versions in the source below.

It's also true that since actual vertexes turn out to be mostly ignored during traversal, there are good reasons to separate your edges and your vertexes anyway. JavaScript's lack of a real dictionary type (where object references can be used as keys) means that array indexes are as good as anything else for linking the two. In the examples below, for example, I keep a copy of the pre-clustered edges and restore them after clustering, so that the search animations will run properly across the entire graph. We just have to be careful to delete the corresponding items from the edge list and other flag arrays whenever we remove a vertex (addition is simpler, since we don't have to worry about out-of-bounds errors in JavaScript if we handle undefined properly).

The constructor that I've provided, WeightedGraph, accepts arrays of vertexes, edges, and a flag for directionality, but most of the time it's seeded with the vertexes only. We add our edges one at a time, depending on what kind of graph we're planning to create, using the addEdge() function.
addEdge: function(from, to, weight, directed) {
// most of our graphs are undirected, so this will be false
directed = directed || this.directed;
// if there's not an edge list for the starting
// vertex, create one
if (!this.edges[from]) this.edges[from] = [];
// add an edge to the list with a default weight
this.edges[from].push({
to: to,
weight: weight || Infinity
});
// if undirected, add a second edge going the other way
if (!directed) {
this.addEdge(to, from, weight, true);
}
}

With our edges added and our vertexes in place, we can traverse our graph. There are two ways to travel across a graph: breadth-first, which radiates outward from the starting vertex, and depth-first, which recursively travels as far as it can go before backtracking and continuing from the last branch. Here's our breadth-first traversal, with arguments for a starting index and a callback function to be run on each vertex along the graph:
breadthFirst: function(start, f) {
// call f on our root node// we pass in the vertex, its edges, and its index
f(this.vertexes[start], this.edges[start], start);
// create a queue of edges to traverse from the root
var queue = this.edges[start].slice();
// visited keeps track of where we've been already
var visited = [];
visited[start] = true;
// now we pull edges from the queue and process them,
// adding their edges to be processed in turn
var edge;
while (edge = queue.shift()) {
// skip previously-visited nodes
if (visited[edge.to]) continue;
visited[edge.to] = true;
// process unvisited node
f(this.vertexes[edge.to], this.edges[edge.to], edge.to);
// add its edges to the queue for traversal
var edges = this.edges[edge.to];
queue = queue.concat(edges);
}
}

A breadth-first algorithm traverses all the vertexes at a given distance from the root before it moves outward. In depth-first, by contrast, we travel as far out as possible and sweep around the rest of the graph. At heart, this is just the difference between first-in-first-out and first-in-last-out order. But in implementation, depth-first has a nice, tidy recursive version that means we don't have to keep any lists of upcoming edges:
depthFirst: function(start, f) {
var visited = [];
var self = this;
// here's our recursive function to walk the graph
var walk = function(nodeIndex) {
// on entering a node, mark it as visited
visited[nodeIndex] = true;
var node = self.vertexes[nodeIndex];
var edges = self.edges[nodeIndex];
// call our traversal function
f(node, edges, nodeIndex);
// now walk through the edges
for (var i = 0; i < edges.length; i++) {
var edge = edges[i];
// if we haven't been to the end of this edge, we
// immediately walk down to it
if (!visited[edge.to]) {
walk(edge.to);
}
}
}
// trigger the recursive function
walk(start);
}
After getting used to recursion while working on trees, depth-first seems cleaner to me. If you keep track of a counter value on entry and exit for each node, you can also use depth-first search to identify cycles (places where the graph loops in on itself) and bottlenecks, which is pretty useful.

These functions, though, are not the hard part. Here's the hard part: to do our clustering, we want to build what's called a "minimal spanning tree"--that's the path through the graph with the lowest sum edge-weight. In our case, our edges are weighted by distance between two points, so that we can group together only points that are close. If we then remove any edges above a certain length, we're left with only the clustered "components" within the graph. Minimal spanning trees are also really useful for identifying fast or cheap paths through a network. In one of my favorite notes in the Algorithm Design Manual, Skiena suggests using them to identify line breaks in OCR'd text, by making lighter pixel edges cheaper than darker ones.

Given that we're dealing with a dense graph of points, where each point has a distance edge with every other point, I've implemented Prim's algorithm (whoever Prim is), because it's apparently better for dense graphs than Kruskal's (whoever that is). Prim's method works by creating a "tree" of one vertex, then repeatedly adding the closest non-tree vertex until the entire available graph has been traversed. As we add each point to the tree, we update a list of distance values corresponding to each non-tree point, which makes it easy to find the next candidate.
findMST: function(start) {
var v = start;
// we'll use two flag arrays in this algorithm:
// - inTree keeps track of what's been added
// - distance is the current distance of each non-tree vertex
var inTree = [];
var distance = this.edges.map(function() { return Infinity });
// here's where we'll accumulate minimal spanning edges
var tree = [];
tree[v] = null;
while (inTree[v] !== true) {
inTree[v] = true;
// for each node, we update the distance list if it moved
// us closer to any non-tree points
var edges = this.edges[v];
for (var i = 0; i < edges.length; i++) {
var edge = edges[i];
if (distance[edge.to] > edge.weight && !inTree[edge.to]) {
distance[edge.to] = edge.weight;
// we create edges inward to the current vertex
// they'll be overwritten if we find something closer
tree[edge.to] = {to: v, weight: edge.weight};
}
}
// find the closest non-tree vertex to process next
var d = Infinity;
for (var i = 0; i < distance.length; i++) {
if (!inTree[i] && distance[i] < d) {
d = distance[i];
v = i;
}
}
}
// once we have all our edges, let's construct a new graph
var newGraph = new WeightedGraph(this.vertexes);
for (var i = 0; i < tree.length; i++) {
if (tree[i]) {
newGraph.addEdge(i, tree[i].to, tree[i].weight);
}
}
return newGraph;
}

This function is very C-like and tricky--so much so that I actually wrote it two more times in increasingly readable style. Take a look at the mst3K function for a simpler (but not shorter) loop, and then Vertex.prototype.findMST for the object-oriented version (which is much slower, but much easier to understand).

But with that out of the way, the rest of our clustering algorithm is pretty simple. cut() just removes any edges above the treshold weight, and findComponents() takes the now-shattered graph and creates a map of which vertex belongs to which coherent group. Using the canvas toy below, you can play with the different graph traversals, the minimum spanning tree, and the cluster cut threshold. You can also see an animation of the two different search algorithms--I recommend creating the MST first, then trying both searches. It's a great illustration of the difference between the queue and recursive exploration patterns.

So what's our takeaway from this? Are graphs useful? Absolutely. Are they useful for this particular clustering problem? Eh, not so much. If you play with the clustering, you'll notice that it's possible to get some weird edge conditions that we probably wouldn't want for our map. For example, it's possible to have a chain of points that wraps around--but doesn't include--one or more points in the middle. Moreover, long chains of closely-positioned points can end up forming much larger clusters than we would really want on a map. I suspect that our hypothetical JavaScript map should probably still look at some kind of spatial or suffix tree to handle combining points at its widest zoom.

But graphs are a useful problem-solver elsewhere. Pathfinding comes to mind--if our map offered directions, or was a game of some kind, a graph traversal algorithm would definitely be key. And of course, graphs like these are invaluable for modeling networks (social or otherwise), traffic flows, and topology. Think of some other uses you'd like to explore? As always, the graph library and the example source code are yours to peruse.