Friday, January 18, 2013

When Haskell is faster than C

Conventional wisdom says that no programming language is faster than C, and all higher level languages (such as Haskell) are doomed to be much slower because of their distance from the real machine.

TL;DR: Conventional wisdom is wrong. Nothing can beat highly micro-optimised C, but real everyday C code is not like that, and is often several times slower than the micro-optimised version would be. Meanwhile the high level of Haskell means that the compiler has lots of scope for doing micro-optimisation of its own. As a result it is quite common for everyday Haskell code to run faster than everyday C. Not always, of course, but enough to make the speed difference moot unless you actually plan on doing lots of micro-optimisation.

This view seems to be borne out by the Computer Language Benchmarks Game (aka the Shootout), which collects implementations of various programs written in different languages and measures their execution speed and memory usage. The winningest programs are always written in highly optimised C. The Haskell versions run anything from slightly to several times slower than the C versions. Furthermore, if you read the Haskell it is also highly optimised, to the point where it no longer looks like Haskell. Instead it reads like C translated into Haskell. Based on this, you would be justified in concluding that everyday Haskell (that is, Haskell that plays to the strengths of the language in brevity, correctness and composability) must be irredeemably slow.

But then I read this presentation
by Tony Albrecht, in which he talks about how some seemingly innocent everyday
C++ code is actually very inefficient. Two things in particular caught
my eye:

A fetch from the main memory when there is a cache miss costs 400 cycles. Pointer indirection in particular tends to fill up cache lines, and hence increase the number of cache misses overall.

A wrong branch prediction costs over 20 cycles. A branch that skips a
simple calculation can actually run slower than the calculation.

To put it another way, C is no longer close to the real machine. The real machine has 3 levels of CPU cache (some of them shared between multiple cores), long instruction pipelines, branch prediction, multiple ALUs and FPUs, and hardware data flow analysis done while the program is being executed in order to schedule all this in a way that makes it look like a simple processor executing one instruction at a time. C doesn't expose all that to the programmer, so it seems to me that the only way to write highly optimized C is to have a sophisticated mental model of the processor and its memory architecture, decide what you want this machine to do, and then reverse-engineer the C which is going to make that happen. The result is difficult to understand and hence hard to maintain. Look at the C implementations of the Shootout problems for examples of what I mean.

But most code isn't written like that. Tony Albrecht is a games programmer, an expert at squeezing cycles out of the rendering loop. Most developers do not live in that world. For them the objective is to produce code that meets the requirements, which includes being fast enough. This is not laziness or waste, but practicality. First design the optimal algorithm, then implement it in idiomatic code. Only if that does not run fast enough should you start detailed profiling and micro-optimisation. Not only is the micro-optimisation process itself expensive, but it makes the code hard to understand for future maintenance.

So I wondered: the high level of Haskell gives the compiler many more opportunities for micro-optimisation than C. Rather than comparing micro-optimised programs therefore, it seemed sensible to compare everyday programs of the sort that might be produced when readability is more important than raw speed. I wanted to compare programs that solved a problem large enough to have a useful mix of work, but small enough that I could write Haskell and C versions fairly quickly. After poking around the Shootout website I settled on the reverse-complement problem.

A potential issue was that one of my programs might inadvertently use something highly non-optimal, so I decided I would profile the code and remove anything that turned out to be pointlessly slow, but not change the structure of the program or add complexity merely for the sake of speed. With that in mind I wrote Haskell and C versions. I also downloaded the Shootout winner to get some feel for how my programs compared. You can see my code at the bottom of this post.

The first version of the Haskell took 30 seconds (compared with the Shootout time of about 0.6 seconds). As I had feared, profiling did indeed reveal something pointlessly slow in it. In order to filter out carriage returns from the input I used "isLetter", but unlike the C char type the Haskell Char covers the whole of Unicode, and determining if one of those is a letter is not trivial. I put the filter after the complement operation and compared the result with zero, which in addition to being faster is also the Right Thing if the input contains invalid characters. Once I had this fixed it dropped down to a much more respectable 4.3 seconds. Interestingly, profiling suggests that about half the time is being spent writing out the 60 character lines; merely printing out the result with no line breaks cut execution down to around 2 seconds.

The C version, meanwhile, took 8.2 seconds. Profiling did not directly reveal the cause, but it seemed to imply that the processor was spending most of its time somewhere other than my code. I strongly suspect that this time is being spent in putc(3) and getc(3). The obvious optimisation would use fread(3) and fwrite(3) to read and write characters in blocks instead of one at a time, but that would require significant changes to the code; extra bookkeeping to deal with the start of the next sequence (signalled by a ">" character) when it is found half way through a block, and to insert newlines similarly. Unlike the replacement of isLetter in Haskell this would require new variables and control structures driven by the cost of the solution rather than a simple switch to a less expensive expression

It might be argued that I have tilted the playing field against C by not making these changes, and that any halfway competent C programmer would do so when faced with code that runs too slowly. If isLetter is pointlessly slow, isn't the same thing true of putc(3) and getc(3)? But I think there is a clear difference. Both programs are written in a character-oriented way because the problem is described in terms of characters. I wrote the inner loop of the C to operate on a linked list of blocks because it looked like a faster and simpler choice than copying the whole string into a new buffer twice the size every time it overflowed (on average this algorithm copies each character once or twice; see Knuth for details). I might have considered reading or writing the characters in reverse order rather than doing the in-memory reverse in a separate function, but profiling didn't show that as a significant time sink. Overall, getting decent performance out of the C is going to take about the same amount of work as writing the code in the first place.On the other hand the Haskell has decent performance out of the box because the compiler automates a lot of the micro-optimisation that C forces you to do manually. It may not do as good a job as a human with plenty of time might do, but it does it automatically and reasonably well.

This isn't principally about code size, but for the record the Haskell has 42 SLOC, of which 21 are executable. The C has 115 SLOC, of which 63 are executable. The Shootout C winner has 70 SLOC, of which 46 are executable (counting comma separated statements as one line per statement).So here is the bottom line. If you really need your code to run as fast as possible, and you are planning to spend half your development time doing profiling and optimisation, then go ahead and write it in C because no other language (except possibly Assembler) will give you the level of control you need. But if you aren't planning to do a significant amount of micro-optimisation then you don't need C. Haskell will give you the same performance, or better, and cut your development and maintenance costs by 75 to 90 percent as well.

Instead of "TL;DR" where you mean "Summary" or "Abstract", why not consider "Summary" or "Abstract" in future? It will enhance the impression of competence and make your readers feel less like you're talking down to them.

and fixed the C code so it compiled by adding the brackets for the includes. I then compiled it with "gcc -O3" and ran it on the input. It was nearly instantaneous, so I made a new input file by repeating the input file lots of times (it is now 127 MB instead of 11 KB).

Now when I run it as:

time ./a.out < in.txt > /dev/null

I get:real 0m2.903suser 0m2.864ssys 0m0.032s

For the haskell side, because I don't really know what I'm doing, I commented out the first line of the file and compile with "ghc -O" (without commenting out that line, it wouldn't link). I then run it:

time ./main < in4.txt > /dev/null

and get:

real 0m14.554suser 0m14.433ssys 0m0.088s

Even if I don't turn on the optimizer for the C code, I still get much faster times (a bit over 4 seconds).

I can not reproduce your results either. Perhaps you made a mistake when you were reading the results.Also jumping to the conclusion that Haskell will be as close to as efficient as c everytime based on a single test with a single program is a big issue. In order to really be able to say that with any level of confidence you should really test several programs in both languages and do that for several different types of problems.

I find it hard to take this post seriously, since failing to realize the impact I/O will have on the benchmark results seems like a pretty big methodological flaw. In fact microbenchmarks, pretty much by definition, try to avoid interacting with any system that is not part of the core concern of the benchmark (as much as possible)..

In fact the benchmark does explicitly say "read line-by-line", so your program probably even fails to qualify as a solution.

The motivation for the original benchmark may have been measuring the performance of data structures, but that's certainly not what you are measuring. Which brings me to the question of what exactly is being measured here?

The I/O throughput of the standard libraries of language implementations? The ease by which this implementations can be understood and utilized? I think we can both agree that on both counts Haskell doesn't have a realistic chance of beating C, regardless of the amounts of magic that the compiler may be capable of utilizing..

This brings me to my final point, which is that I fail to see how you can claim in your conclusion that "conventional wisdom is wrong", given that1. everyday C code will most certainly not be doing character-based I/O2. out of a lineup of n programmers picked at random a sizable chunk wouldn't even be capable of writing I/O code in Haskell, while almost certainly all of them would still manage an "everyday" C solution that beats yours.

>2. out of a lineup of n programmers picked at random a sizable chunk>wouldn't even be capable of writing I/O code in Haskell, while almost >certainly all of them would still manage an "everyday" C solution that >beats yours.

I don't think that this argument holds. The fact that few programmers are able to comprehend and use Haskell only means that they don't have yet that skill. It is not something that may never change. 40 years ago people capable of coding in C weren't legion either. Out of the same random chunk of programmers, how many more of them would be able to code up that programme in PHP rather than C? Does it make PHP more suitable?

Its nice to see that you stacked this test on Haskells side by choosing the benchmark test that it is best at. I think you should face the facts that Haskell will almost never be anywhere near as fast or efficient as well written c

In short, you're wrong. C is just as close to the machine now as it was before. The only difference is that "the machine" is now actually rather complex. The machine does not offer some kind of "faster_dereference" which is faster than C's de-reference but with slightly different semantics. C's de-reference is all the machine offers. The fact that loading from main memory can now be very expensive is irrelevant.

Also, a linked list of characters? hahahahahaha. The buffer algorithm will be much faster. No wonder your performance was rubbish.

IOW, your entire article is "I don't understand CPUs, and I'm going to write some terrible code that uses the CPU very badly, and this proves my point.".

Readability *does* matter. And being more readable *is* a big advantage. Performance *isn't* everything. But you have definitely failed to demonstrate that Haskell can be equally as performant as C, and secondly, it's just a pity that Haskell is *less* readable than C, in general.

The Haskell code here could be made more efficient. The most obvious thing is that break should be replaced with breakByte. The second most obvious thing is that you scan the header text twice, first for the greater-than symbol and then for the newline. The third most obvious thing (I think...) is that the "filter out nulls" operation doesn't seem to serve any obvious purpose. If you need it, you should probably indicate why.

My last post had a couple errors. I see now why the filter is required. I also see now that GHC has a rewrite rule to replace break with breakByte when appropriate. However, the double scanning is still unfortunate (it might even be good to use readLn after setting the handle buffer size explicitly—not sure), and I think I found a bigger problem: as written, I believe the table will be rebuilt for every block, unless full laziness is enabled (a bad idea, generally). Making the table a top-level variable seems the most sensible approach.