About Me

09 October 2011

Loop counting: Google's multi-language-bench

C++ is not a cool language. I think it once was, back in the early 1990s; certainly there were a lot of books and magazine articles written about it at that time. Back then I wrote code mostly in C. I think it took quite a while before I really got object orientation, but eventually I became comfortable with C++, probably around the time that Java became the new cool language.

I mention this because it made me smile when a little while ago a friend showed me a paper (pdf) from Google that described how they implemented the same algorithm in several different languages and found the one written in C++ the fastest. You know how brand loyalty is apparently tied up with our self-esteem? Hey, C++ may not be cool, but when you got important things you gotta do fast you need a tool man-enough for the job, I thought. Of course, benchmarks are notoriously unreliable, and I didn't really believe this. But it still made me smile.

Maybe I shouldn't let my self-esteem get entangled with a programming language. But I think it's not unnatural that after spending considerable time with a language one may become attached to it. I'd like to state that I am aware of this proverb: if all you have is a hammer, everything looks like a nail; I subscribe to the notion that learning an unfamiliar programming language may broaden your mind. A little. Which is a good thing.

Reading the Google paper I came across this

“We ﬁnd that in regards to performance, C++ wins out by a large margin. However, it also required the most extensive tuning efforts, many of which were done at a level of sophistication that would not be available to the average programmer.”

They have a high opinion of themselves don’t they? Doubtless it’s justified. But it sounds like a challenge...

One of the promises of object oriented design is that you may be able to change the implementation of an object without changing its interface. So I wondered if I could improve on the performance of the Google C++ implementation without changing the test driver. After about 30 seconds of reading the code I decided that that sounded too much like hard work: much more fun to continue in the great programming tradition and rewrite the code. (The Google code has objects with getters that return pointers to private member variables; this restricts the freedom to change the implementation. Breaking encapsulation may be justified for run-time performance gains. The penalty is that a change to the implementation may mean a change to the user code, or a run-time penalty to maintain the old interface.)

So anyway, I wrote the version shown below. To make execution time comparisons meaningful I tried to keep the functionality identical to the Google C++ implementation from Robert Hundt's paper. Russel Cox also implemented the algorithm. So when I talk about Robert's code or Russel's code these are what I'm referring to.

antloops.h

#ifndef ANTLOOPS_H_INCLUDED#define ANTLOOPS_H_INCLUDED/* This code is an implementation of the analyze_loops algorithm from Nesting of Reducible and Irriducible Loops by Paul Havlak published in ACM ToPLaS, July 1997. (Referred to below as "Havlak's paper.") See http://howtowriteaprogram.blogspot.com/

/* a basic_block is a node or vertex in a control-flow graph (a basic block is a sequence of zero or more statements in a computer program that would be executed in-sequence with no jumps into or out of the sequence, described briefly in Havlak's paper; it doesn't matter, for our purposes it's just a node in a graph) */class basic_block {public: void add_out_edge(block_name to) { out_edges_.push_back(to); }

// return the number of nodes in the graph int size() const { return node_vec_.size(); }

// return a reference to the node with the given 'name'; // reference guaranteed valid only until next call to any non-const function const basic_block & node(block_name name) const { return node_vec_[name]; }

void calculate_nesting_level() { /* I'm not implementing this because it has no effect on the performance of either the Robert Hundt or Russel Cox C++ implementations of the Havlak algorithm. Robert's LoopStructureGraph::CalculateNestingLevel() is never called. Russel's LoopGraph::CalculateNesting() is called, but as Loop::isRoot is not initialised (and is therefore mostly "true"), it does (almost) nothing. */ }

private: int count_;};

// find loops in given 'cfg' using the analyze_loops algorithm from Havlak's// paper; return the number of loops foundint find_havlak_loops(const mao_cfg & cfg, loop_structure_graph & lsg);

}#endif

antloops.cpp

/* This code is an implementation of the analyze_loops algorithm from Nesting of Reducible and Irriducible Loops by Paul Havlak published in ACM ToPLaS, July 1997. (Referred to below as "Havlak's paper.") See http://howtowriteaprogram.blogspot.com/

private: const mao_cfg * cfg_; preorder current_preorder_; // used in dfs() to assign a preorder number to each node visited std::vector<block_name> node_; // node_[a] is the block_name of node with preorder number a std::vector<preorder> number_; // number_[a] is the depth-first (or preorder) number of node named a std::vector<preorder> last_; // last_[number_[a]] is the last decendant of node a

// return the name of the root element of the element named 'p' // (beneficial side-effect: the path from 'p' to the root is compressed) int find_root(int p) { // I found this a good place to learn about the UNION/FIND algorithm: // http://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf while (p != element_[p]) p = element_[p] = element_[element_[p]]; return p; }

// find loops in given 'cfg' using the analyze_loops algorithm from Havlak's// paper; record loops in given 'lsg' and return the number of loops foundint find_havlak_loops(const mao_cfg & cfg, loop_structure_graph & lsg){ if (cfg.size() == 0) return 0;

/* This code is an implementation of the test driver in the paper Loop Recognition in C++/Java/Go/Scala by Robert Hundt. See http://code.google.com/p/multi-language-bench/

Copyright (c) 2011 Anthony C. Hay. Some Rights Reserved. Published under the BSD 2-clause license. See http://www.opensource.org/licenses/BSD-2-Clause See also http://howtowriteaprogram.blogspot.com/*/

int main(){ // build a control flow graph and count loops like the code // in Robert Hundt's LoopTesterApp.cpp so that execution time // comparisons with that code are meaningful std::cerr << "counting loops...";

(I found I needed to reserve about 10 MB stack space to get Robert's code to produce a result. You may need to do the same for mine.)

Probably the most interesting difference between the implementations is that Robert's and Russel's both use explicit news and deletes and mine does not. I don't mean to say that there is anything wrong with raw pointers - I use them all the time - I simply chose a different approach to the problem using vectors and indices instead. The fact that the nodes are named with successive integers and may be referenced by a 'preorder' number, an integer, suggested a vector would be an obvious data structure to use.

Using collections, such as vectors, can make the code simpler, less likely to leak memory and easier to make exception-safe.

I think the code is in the spirit of the Google paper, which was that it should be written using "the default, idiomatic data structures in each language, as well as default type systems, memory allocation schemes, and default iteration constructs the basic facilities provided in the language."

Is it correct?

If it was not a requirement to be correct we could give an answer to any problem in an arbitrarily short time. So let's ask first if the code gives the right answer.

My code calculates there are 76,001 loops in the test graph. Robert's gives 76,002 and Russel's gives 76,000. Oh dear. Which is right?

Playing with Robert's code I found that it always exaggerates the number of loops by exactly one; for example it returns 1 for a graph containing no loops at all. So adjusting for this means mine and Robert's agree.

I corresponded briefly with Russel and he was kind enough to respond to my questions. It turns out there is a slight difference between Russel's graph and Robert's in the first few nodes: in Robert's there is an edge connecting nodes 1 -> 2, in Russel's there isn't. This difference means there is one less loop in Russel's graph, so his result is correct for his graph.

Looking at the code that builds the graph I would have said there were ((((3 x 25) + 1) x 100) + 1) x 10 = 76,010. But I'd be wrong, probably because overlapping loops are not counted more than once, so the 'and 10' becomes 'and 1'.

In short, I believe there are 76,001 loops in the Robert Hundt C++ graph and my implementation counts them correctly.

I like to remember I have an amazing symbol manipulation tool at my command that I can use to help me check my own work. (Not my own brain, lol.) I mean I may be able to write some code to test my code. Since I have the two Google implementations to test against the obvious thing to do is to build some graphs and compare the loop counts returned by each implementation. I wrapped each in its own namespace and linked all three into one test program.

The first idea I had was to generate graphs containing a random number of nodes with a random number of edges between random nodes, like this

During several hours running this test there were no disagreements between any of the implementations (after adjusting Robert's by -1).

The second idea I had was to generate graphs with all possible combinations of connections between nodes. This is only going to be practical for small graphs: for one node there are two possible graphs: the node has no outgoing edges and the node has an outgoing edge to itself. For two nodes there are 16 possible graphs ranging from the graph where neither node has any outgoing edges to the graph where each node has an outgoing edge to itself and to the other node. I think for n nodes there will be 2^(n^2) possible graphs (I'm assuming no duplicate edges). For 5 nodes there are 33,554,432 possible graphs and for 6 there are 68,719,476,736. So 5 nodes is probably about the limit of what I could reasonably check. Here's what I wrote to test all possible graphs up to 5 nodes

// generate all possible graphs having 'n' nodes and test// all loop counting implementations return the same countvoid test_all_possible_graphs(int n){ assert(1 <= n && n <= 5); const int end = 1 << (n * n); // we will test this number of graphs for (int bitmap = 0; bitmap < end; ++bitmap) { /* think of bitmap as n groups of n bits; each node has its own group, each bit in that group determines whether there is an out edge from that node to the node corresponding to that bit */

It takes about 10 minutes to count the loops in all possible graphs having 5 nodes or fewer and confirm that in all cases my implementation returns the same loop count as Robert's. (I also tested against Russel's implementation, but that plotted a stairway to heaven on my memory usage monitor: Russel didn't free any of the memory he allocated. The memory usage for Hundt+Hay was a flat line.)

I'm sure there are other tests I could do. I could count loops in more 'hand-built' graphs like the original 76,001 looper, for example. But at this point I'm going to say that my implementation is an accurate analog of the original Robert Hundt code.

Is it efficient?

Actually, let's just ask if it's fast. Here are the timings for the three implementations running on my 2008 Mac with a 2.8 GHz Core 2 Duo processor:

(The above is an update to the original post after I upgraded to GCC 4.2)

So Russel's code wins, but mine is a creditable runner-up.

It's interesting watching the random graph test program run: occasionally a graph will be generated where it looks like Robert's code will count the loops quicker than either Russel's or mine. I'm wary of benchmarks. People say that, like statistics, you can find a benchmark to support any point of view. Hang on!... I wonder if... Voilà!...

I added code to the random graph test to time how long each implementation took to count loops and ran it for 100 graphs. The code and the results are shown here.

In this test Russel's implementation counted the loops in 90% of the time it took Robert's implementation to count the loops in the same graphs. But my implementation took only 38% of the time. My code wins. I think benchmarks can be a pretty accurate measure of performance, now I come to think about it.

I have my tongue in my cheek, in case you didn't know. I mean, these results are genuine, but I'm still wary of benchmarks. Just to confirm my code's superiority I changed max_graph_size to 50000 and reran the test. Here are the last few lines of output

In this test my code took twice as long as Robert's and Russel's code took twice as long as mine! Robert's code wins. I'm going off benchmarks again.

Three different benchmarks. Three different winners. It would be interesting to analyse the performance characteristics of these three implementations of the same algorithm to understand what's happening. But that's for another time.

Is it readable?

From the user's perspective code readability is neither here nor there. But in my experience from the programmer's perspective, you are more likely to achieve some of the things the user does care about, like being correct, if you can reason about the code. And to reason about the code you need to understand it. And code readability aids understanding. I believe the way the code looks is of lesser importance than other goals, but it isn't of no importance.

The core analyze_loops algorithm in the Havlak paper (reproduced in the Google paper) is written in a mathematical pseudocode. It is 30 lines long. I think there is a fairly clear resemblance to that pseudocode in my implementation, which is 53 lines long. One line of pseudocode to two of real code is a pretty good ratio, I think. (Note: I don't count blank lines, comments or lines containing only a curly brace.)

In general I think the resemblance between someone's description of an algorithm and the concrete realisation of that algorithm in a real programming language that actually works on a real computer is a good thing. Sometimes the algorithm description is clear but the computer language doesn't allow the realisation to reflect it.

It may be that computer scientists who publish algorithms often describe them in a way suited to implementation in C++ because they all use C or C++ to develop their algorithms. I don't know.

In what way is an algorithm "readable?" Some algorithms are simple and you can just see what they do without much difficulty. And some you can't keep enough in your head to just "see." In that case you resort to pencil and paper and boxes joined to other boxes with arrows. You work through simple concrete examples following each step in the algorithm to see how it transforms the data. That's what I do anyway.

Candid conclusion

The code is short, but I didn't find it easy to write. I bought the Havlak paper for $15 from the ACM, but I didn't find it easy to move from the abstract description of the complex algorithm in the paper to the concrete representation of a working program; without the two Google examples by Robert Hundt and Russel Cox I may not have succeeded. I do not intend anything I've said here to imply any criticism of those two guys' work; I don't know them personally but I have no doubt they have a far deeper understanding of the Havlak algorithm than I do. But I suspect they were not very interested in the C++ implementation of this algorithm, perhaps because C++ is not a cool language. (Update: I regret saying this last bit; how do I know how interested they were? It was a snide comment. Sorry.)

One of my difficulties was a certain amount of confusion about when I should be using the basic block name, an int, and when the preorder number, also an int. Using the two typedefs for these two concepts helped me keep them separate.

The most interesting thing I learned was the UNION/FIND algorithm, which I'm sad to say, I had never before seen (or if I had I had forgotten). That path compression is really neat.

In the above narrative I may have made it sound like I implemented the algorithm, checked it against the Google results and found it was correct. This is not how it happened. I fiddled around for some time getting wrong answers and trying to figure out what mistakes I had made. So what? I believe a poet might work on a poem for years before it's published. Right matters. Right first time is overrated. Since I'm not 100% sure this code is right I'm going to stop pontificating right here.

5 comments:

Our congratulations are particulary sincere for we have (almost) been through the same experiment : reimplementing the Hundt's C++ version. Our approach was different, we are developping a new langage (a new hammer ;-), so we wanted to produce exact "cut and past" of Hundt's version : "you push back an element ? ok, we do it too, next move ?", and so on... but with our own langage/STL...We even copied the getters/setters functions.We were not really interested in the algorithm efficiency itself but by the comparison between C++ and Cawen.The results are here : http://melvenn.com/en/home/We don't know if C++ is cool or not ;-), but we will be very happy to implement your own version as soon as we find some time to do it !

Hi TS & GC. Thank you for your kind compliment. I wish you well with your endeavour (I looked but couldn't actually find any code written in your new language). I think computer languages are fascinating and useful things. I'd love to invent one myself. I recently read that when Jerry Weinberg was asked what his greatest contribution to the field of computer science was he said "That's easy. I answered that some years ago, and my answer hasn't changed. My biggest contribution to the software world is that I never invented yet another programming language." (from http://jonjagger.blogspot.co.uk/2010/09/interview-with-jerry-weinberg.html)

He he he, the quote is excellent. It reminds me of this one : "The nice thing about standards is that there are so many of them to choosefrom."But, please believe that we did not do too much harm to the software world, since 1) Cawen is not exactly a new langage...It should better be considered as a way to template (struct and functions) good old C (who has never been cool but does not mind, for he knows he is the King ;-)...It's some kind of "back to the future VI" language...You can download the C99 sources the Cawen preprocessor has generated here :http://www.melvenn.com/data/havlak_cawen_benchmark.tgz

2) may be no one will never hear about it expect you, R.Hundt, my friend Thomas and me.