In one of the lectures Tim Roughgarden points out that we’re doing the same calculations multiple times to work out the next smallest edge to include in our minimal spanning tree and could use a heap to speed things up.

A heap works well in this situation because one of the reasons we might use a heap is to speed up repeated minimum computations i.e. working out the minimum weighted edge to add to our spanning tree.

The pseudocode for the Prim’s algorithm which uses a heap reads like this:

Let X = nodes covered so far, V = all the nodes in the graph, E = all the edges in the graph

Pick an arbitrary initial node s and put that into X

for v ∈ V – X

key[v] = cheapest edge (u,v) with v ∈ X

while X ≠ V:

let v = extract-min(heap) (i.e. v is the node which has the minimal edge cost into X)

Add v to X

for each edge v, w ∈ E

if w ∈ V – X (i.e. w is a node which hasn’t yet been covered)

Delete w from heap

recompute key[w] = min(key[w], weight(v, w)) (key[w] would only change if the weight of the edge (v,w) is less than the current weight for that key).

reinsert w into the heap

We store the uncovered nodes in the heap and set their priority to be the cheapest edge from that node into the set of nodes which we’re already covered.

I came across the PriorityQueue gem which actually seems to be better than a heap because we can have the node as the key and then set the priority of the key to be the edge weight. When you extract the minimum value from the priority queue it makes use of this priority to return the minimum one.

I couldn’t see a way to keep track of the edges that comprise the minimal spanning tree so in this version I’ve created a variable which keeps tracking of the edge weights as we go rather than computing it at the end.

We start off by initialising the priority queue to contain entries for each of the nodes in the graph.

We do this by finding the edges that go from each node to the nodes that we’ve already covered. In this case the only node we’ve covered is node 1 so the priorities for most nodes will be MAX_VALUE and for nodes which have an edge to node 1 it’ll be the weight of that edge.

While we still have nodes left to cover we take the next node with the cheapest weight from the priority queue and add it to the collection of nodes that we’ve covered. We then iterate through the nodes which have an edge to the node we just removed and update the priority queue if necessary.

The time taken for this version of the algorithm to run against the data set was 0.3 seconds as compared to the 29 seconds of the naive implementation.

As usual the code is on github – I need to figure out how to keep track of the edges so if anyone has any suggestions that’d be cool.