Rob Pike, one of the creators of the Go language, stated that he expected the language to be adopted by C++ programmers, a prediction that hasn’t been realised. Recently however at the HFT firm where I work, the success of a team’s move from Python to Go for some pieces of non-speed-critical infrastructure led to the decision to attempt a slimmed-down Go rewrite of a somewhat-throughput-critical 50k LOC C++ server. The old C++ server used the same techniques and libraries used in our latency-critical C++ trading software, where every microsecond matters, and this degree of performance simply wasn’t needed. It was hence thought that a rewrite in Go, using the language’s native scheduler rather than the hyper-optimised C++ framework used by the autotraders, would be easier to maintain. I was tasked with the rewrite.

The tl;drIn business terms, the project was a success: completed ahead of schedule, performing acceptably, and less than 10k LOC long (this massive LOC reduction was of course partially due to the removal of features that were deprecated or not needed by the team behind the rewrite). In personal terms however I feel the outcome was suboptimal, in the sense that I wrote two to three times as much code as would have been needed in a language with parametric polymorphism.Some of this was due to type-safety: Go forces a tradeoff to be made between verbosity and type-safety, and I settled somewhere in the middle; it could have used less code and been less type-safe, or used more code and been more type-safe.

Now, the pros and cons, starting with the pros.

The pros

Emacs! With plugins for autocomplete, jump-to-definition, error checking upon save, intelligent refactoring, and GoTest integration, programming Go in Emacs offers practically everything one’d expect from a good IDE, with the bonus of super-easy customisation and extensibility via Elisp. Since one of my reasons for getting into programming was the opportunity to get paid to use Emacs, this is definitely a huge plus.

Goroutines! Go makes message-passing based concurrency, which I personally find the easiest form of concurrency to reason about, super simple to use. It also allows parallel/async code to be written in the exact same way as concurrent code, simply by setting GOMAXPROCS to 1. The only other languages I know with built-in lightweight thread schedulers are Erlang/Elixir and Haskell, with the former lacking static typing and the latter lacking management-willing-to-use-ability.

No inheritance. I’ve personally come to view inheritance-based OO as somewhat of an antipattern in many cases, bloating and obscuring code for little benefit, and Go makes this kind of code impossible to write. I suspect this was Rob Pike et al’s motivation for designing Go the way they did: there was a bunch of Java/C++ at Google that was written as if Enterprise Fizzbuzz was a positive role-model, and they wanted to spare themselves from having to deal with such code in future. That being said, in spite of being legacy code the use of inheritance in the old C++ server was pretty sane, and it could easily have been rewritten to use a more modern style.

Readability. I always found the Go code I encountered quite easy to read and understand, both our code and external code. Some of the C++ I encountered, in contrast, took hours to fully comprehend. Go also forced me to write readable code: the language makes it impossible to think something like “hey, the >8=3 operator in this obscure paper on ouroboromorphic sapphotriplets could save me 10 lines of code, I’d better include it. My coworkers won’t have trouble understanding it as the meaning is clearly expressed in the type signature: (PrimMonad W, PoshFunctor Y, ReichsLens S) => W Y S ((I -> W) -> Y) -> G -> Bool”.

Simple, regular syntax. When I found myself desiring to add the name of the enclosing function to the start of every log string, an Emacs regexp find-replace was sufficient, whereas more complex languages would require use of a parser to achieve this. The simple syntax also makes code-generation a breeze, be it generation by Emacs macros or Go templates. Emacs + Go == parametric polymorphism: not only can macros be used to speed up the process of generating the “copy-paste” code that Go’s lack of parametric polymorphism requires, if functions are written right then regex can also be used to update all “copy-pasted” functions simultaneously, making updating the code for fooInt, fooFloat and fooDouble almost as easy as updating foo<t> in a language that supports <t>. The downside is that, while Emacs macros and regex can write and modify Go code in such a manner as to emulate parametric polymorphism, it’s still not as readable or concise as actually-polymorphic code, and of course is not easily maintainable by someone lacking familiarity with regex or an extensible editor like Emacs.

Built-in, effective templating. Go’s text/template package can easily be used to Generate new Go code. This allows IO to be used during code generation: we had for instance a library for interacting with a particular service that was generated from an XML schema, making the code perfectly type-safe, with different functions for each datatype. In C++, IO cannot be performed at compile time, so such schema-driven code generation would not be possible. Languages allowing compile time IO include F#, which has compile time IO via Type Providers, Idris, which also has Type Providers, Lisps, which can do IO in macros, Haskell, which has an IO -> Q compile time IO function in Template Haskell, D, which can use `import` to read files at compile time, Nimrod, which has functions for compile time file IO, Elixir (and possibly Erlang?) which can do arbitrary IO via macros, and Rust, which can use libsyntax to perform arbitrary computations and IO at compile time.

The cons

Stockholm syndrome. I just argued above that generating Go code with templates is superior to compile time metaprogramming in C++ due to allowing IO, which of course is a stupid argument, since one could just as easily generate C++ code using a separate C++ program that does IO.

Lack of parametric polymorphism! I’ve read many people saying this isn’t a problem in practice, well in this particular case it was a huge problem. I’m confident that a C++ translation of the new Go code would be less than half the LOC of the Go version and more type-safe, due to C++’s polymorphic functions and types. A Haskell rewrite would need even fewer LOC, and if I’d been allowed to write it in Clojure I suspect the whole thing could have been expressed in fewer than 1000 lines of macros (although I’m not sure how debuggable or maintainable that would have been…).

Sacrificing type safety. We use extension attributes for the various protobuffer messages that the server handles, and I originally intended to distinctly type these, so that for instance a FooExtensionAttribute could not be used on a Bar. Go’s lack of parametric polymorphism and generic types however meant that this would have involved a significant amount of code duplication, so I ended up settling with just a single ExtensionAttribute type, with the type system not checking that it was used to extend the appropriate message.

Binary sizes. If one uses code generation to Generate a type-safe API, with distinctly typed accessors and whatnot for each datatype, then one can easily wind up with over 100,000 lines of Go and 30mb+ binaries. Compile times are also slower; over 10 seconds in this case, although this wasn’t a significant issue as the library could just be compiled once to a static library and then statically linked.

Kernel compatibility. I may well be the first to make this complaint, but when for Kafkaesque reasons you have to deploy to an old kernel, it’s somewhat disappointing when the newest Go version requires features not supported by the kernel and you’re forced to stick to an older, slower version of Go.

Conclusion

Go is a double-edged sword: it forbids complex abstraction, both bad and good. The worse the abstraction that you and your colleagues are likely to use, the better a language is Go, and vice-verse (ultimately depending on what is considered ‘good’ and ‘bad’ abstraction).

I recently upgraded my phone to a Galaxy S3, an ARMv7 device with 2GB of ram and a quad-core 1.3ghz processor. After installing Arch Linux on it (only in a chroot; I’m not brave enough to boot directly into it), which seemed like the right thing to do, I thought it’d be interesting to do some benchmarking and see how various language implementations perform on ARM compared to on my x86-64 (Intel dual core i5 M430 2.27GHz) laptop.

This benchmark involves calculating a large number of Fibonacci numbers repeatedly. The exact code used is available here. Note that the purpose of this benchmark is to compare the ARM and x86 implementations of each language, not to compare the different languages to each other, so no effort has been put into optimising the different implementations.

The benchmarks were run with an input of 500 million, with the speed for each language being expressed as a percentage of the speed of the C implementation. This allows for easy comparison, as each language being expressed in terms of C speed (as opposed to in absolute terms) minimises the effect of the different speeds of the processors used.

The results from my x86-64 laptop are as follows (input of 500 million):

Language

Running time (ms)

% C speed

C (Clang)

3726

100

C# (Mono)

7625

48.8656

Go

8489

43.8921

Haskell

9839

37.8697

Java (OpenJDK)

4237

87.9396

LuaJit

15135

24.6184

Ocaml

9668

38.5395

Racket

22785

16.3529

Rust

3910

95.3

The results from my ARMv7 Galaxy S3 (input of 500 million):

Language

Running time (ms)

% C speed

C (Clang)

5051

100

C# (Mono)

52736

9.5779

Go

12246

41.2461

Haskell

22304

22.6462

Java (OpenJDK)

50079

10.0861

LuaJit

21738

23.2358

Ocaml

28774

17.554

Racket

56519

8.93682

Rust

5741

87.9812

The performance of each language on ARMv7 expressed as a percentage of its x86-64 performance:

Language

Performance (% x86-64 speed)

C

100

LuaJit

94.3839

Go

93.9716

Rust

92.3190

Haskell

59.8003

Racket

54.6498

Ocaml

45.5481

C# (Mono)

19.6005

Java (OpenJDK)

11.4693

The benchmarks can be run by `sh fib.sh 500000000 true`, replacing true with false if the executables have already been compiled. This will print an html table to stdout, and store the results in text form in `./sortedResults`. Rename this to something like armResults or x86Results, and two different such files may be compared using `sh fibdiff.sh armResults x86Results`, which will output the above html table to stdout.

Again, note the purpose of this benchmark is not to compare the languages with each other, but to see how well each one performs on ARM compared to on x86.

Interestingly, the Rust implementation actually runs faster on ARM. I suspect this indicates a bug of some sort in the compiler, as Rust generally performs at close to C speed. Both the ARM and x86 implementations are running on the latest Arch Linux build, but the Rust version on ARM appears to be “0.11.0-pre”, whereas on x86 it’s “0.11.0”, so it’s possible there was some regression between those two versions.

*Edit: the Rust version used the Rust int type, which is 64bit on x86-64, whereas C’s int is 32bit. This exaggerated Rust’s speed relative to C on ARM. The results table has now been updated to show results from using Rust’s fixed size i32 type, which is much faster on x86-64. It can now be seen that Rust manages to perform quite well on both platforms, which makes sense considering its use of the LLVM backend.

*Another edit: according to strncat on Reddit: “Rust is slower than C at this task on ARM (and x86_64, once the 64-bit overhead is gone) because it uses LLVM’s segmented stack support for stack overflow checking. The sane alternative is outputting a one byte write in multiples of the guard size, so 99.5% of functions won’t have any overhead, and the remaining ones will only have some writes rather than the branch that is there today.”

Java on ARM is hilariously slow, 1/10th of its x86 speed, which I suspect may be because the JVM I’m using, OpenJDK, doesn’t yet have a JIT compiler. Or, if it does, it’s not particularly well developed. The Oracle JVM is apparently faster on ARM.

C# (via Mono) is also much slower on ARM, 1/5 of its x86 speed. Presumably ARM code generation hasn’t received as much attention as x86.

Luajit’s performance on ARM, almost equal to its x86 performance, proves once again that Mike Pall is a genius. Interestingly, as everything is a double in Lua, it may actually have an advantage in this case as there’s a modulus operation in the code, and ARM doesn’t (as far as I’m aware) have hardware support for integer division, only float division.

The Go compiler also runs surprisingly well on ARM, nearly matching its performance on x86. I’m not sure whether this is due to lots of work having been put into ARM code generation or not so much work being put into x86 code generation, but it certainly bodes well for future plans to allow the use of Go in the Android NDK.

Racket, OCaml and Haskell all run at about half their x86 speed on ARM, which seems reasonable as I can’t imagine they’re often run on ARM, so I imagine ARM performance probably hasn’t received much attention.

In terms of the C implementation, it’s interesting to note that its absolute speed on the ARM device is almost 73% of the speed of its x86 speed, in spite of the x86 device being a 16 inch i5 laptop and the ARM device being just a 5 inch phone.

Miscellanea

The Haskell implementation is named hsfib.hs instead of fib.hs, breaking from the naming conventions of the other implementations. Why is this? If I name it fib.hs, GHC will notice the fib.o left over from compiling the OCaml implementation and try to link that, with hilarity ensuing soon thereafter. This could be avoided if GHC checked whether the object files it was attempting to link were actually Haskell object files.

Getting an Integer from a Number in Typed Racket seems to take a lot of work: `(numerator (inexact->exact (real-part n)))` Although there’s probably a simpler method I’m missing.

There’s a Common Lisp implementation in the repo, but the ARM support of SBCL is quite recent and I couldn’t get it to build.

At least 70% of the C# implementation is just copy pasted from the Java implementation. Syntactically those languages seem even more similar than Lisp and Scheme.

Recursion in Lua is interesting. Doing a comparison like `if n == 1` does not work out well, due to the nature of floating point equality, so `if n < 1` is necessary instead.

The correctness of the Rust and C implementations is actually optimisation dependent. If compiled with -O, they use constant space; if compiled without it, they use linear space and fail with a stack overflow. Never, ever write code like this for any project that’s even remotely important..

OCaml’s Emacs support is really awesome. I installed a couple of plugins, Tuareg and Merlin, and not only did they give me autocomplete but also automatic error checking whenever I save. I was particularly impressed when it combined with Ocaml’s type-checking to warn me that the float I was accidentally passing to printf didn’t match the type required by the %d in the format string; even Haskell’s default printf function doesn’t do compile time checking of the format string.

*Edit: I was asked why I didn’t include Nimrod. Nimrod compiles to C, so it would be the same speed on both ARM and x86. Seems like it would be a good option for ARM development.

Bonus

I ran the Java implementation of the benchmark under Dalvik. The results are rather underwhelming, to say the least. Here are the results from running with an input of 500 million (the same as above), with the Dalvik results included.

Language

Running time (ms)

% C speed

C

5051

100

C# (Mono)

52736

9.5779

Dalvik

75699

6.6725

Go

12246

41.2461

Haskell

22304

22.6462

Java (OpenJDK)

50079

10.0861

LuaJit

21738

23.2358

Ocaml

28774

17.554

Racket

56519

8.93682

Rust

5741

87.9812

Now that’s why the new ahead-of-time-compiled Android RunTime is A Good Thing. Also possibly part of the reason why the Android UI is less fluid than the iPhone’s, and I say this as someone who wouldn’t use an iPhone if you paid me. Where would Objective C appear in that table? Right at the top; Objective C is a superset of C, so valid C code is generally valid Objective C code, making it extremely easy to write fast code in Objective C (just write C).

Note however that I ran the Dalvik file (compiled with `dx –dex –output=fib.dex fib.class`) with `dalvikvm -cp fib.dex fib`, the simplest way to run it. It’s quite possible there’s a better way to compile/run it that makes it faster; I’m not particularly familiar with Android development. There apparently exists a tool called ‘dexopt’, which is run automatically when a new apk is installed, but I’m not sure whether it’s run when I execute a .dex file manually with `dalvikvm`. Running `dexopt` tells me to look at the dexopt source for usage information; I don’t have time to figure it out right now, but if someone wants to send me a script for running dexopt on a .dex file I’ll be happy to test it.

The Go language has some weaknesses compared to other modern languages. It also however has some advantages that allow its users to do things that users of many other languages cannot easily do. In particular, to have a large number of green threads running on a single machine, making it easy to structure a server by having one such thread (goroutine) for each client connection, and to communicate between these goroutines with channels, making it easy to coordinate them and avoiding the need for mutexes and other forms of manual synchronization.

As an example, take the following program, a race relay race (it’s like a data race race). Imagine two pairs of n (in this case three) threads: (a1, a2, a3), and (b1, b2, b3). For each one, a message is sent from the first to the last and then back again (e.g. from a1 to a2 to a3 to a2 to a1). Whichever racer completes this first out of (a1..an) and (b1..bn) wins the race. Now, the average computer can have at most around 65k open tcp connections, so a server using one connection per client would require around this many goroutines. Let’s therefore set n to 65/2 = 32k; 32k green threads per racer.

Here’s how we make the racetrack. The function is recursive: in the case where numLeft is > 0, it makes another part of the track to the right, in a new goroutine, then creates another goroutine for handling message passing from its left to its right, then itself works on handling message passing from its right to left. This means each racer is actually using 2*numLeft goroutines, so we set the total number per track to 32/2=16k. If numLeft is less than one, we instead set to work passing the input from the left channel back into the left channel.

Now, you may wonder: why not just use one channel for both sending ints from the left channel right and those from the right left? It’s not good practice to have one goroutine writing and reading from the same channel, as if it attempts to write when the channel has data to be read then it will block (deadlock). We’d hence need four channels if we were only using one goroutine: left in, right in, left out, and right out. I found this alternative less subjectively pleasant than just having one goroutine for sending left and one for sending right

We create numRunners ints, each numbered with their start position (first to run is 1, last is numRunners-1), and send them down the channel. Then we receive them all, and stick them in an array which is then sent back through the appropriate channel. This allows us to examine the order in which they arrived.

That creates the racetracks, channels to them and for the race results to be returned on, and runs the races. Note the select statement: this is an extremely useful feature of Go that will receive from whichever of the target channels (in this case resultsA and resultsB) is ready first. Running this takes around 3 seconds, not bad for sending data across so many channels. What’s really impressive, however: quadruple that number to 64000, meaning a total of 256 thousand goroutines are created, and it still runs in around 3 seconds! Now what’s why Go is great for servers. I’m fairly certain that trying to do the same thing with 256 thousand kernel threads on this laptop would melt it.

Seemed to be waiting for foo(bar) to complete before running baz, preventing the program from working if foo is a function looping until it receives a message from baz. Not sure if this was a bug, or just me doing something wrong due to unfamiliarity with the language (probably the latter). Anyway, while it’s certainly possible to do in Erlang what Go can do, Go is five to ten times faster for cpu-intensive work. It’s also more statically typed, preventing errors that wouldn’t be caught at compile time in Erlang/Elixir.

*Edit: it should be noted that Erlang has something called Dialyser, which can do static checking of Erlang code. A redditor wrote an Erlang implementation of the code for comparison, available here, which performs quite well.

I considered an implementation in D, but while D has green threads in the form of fibres, but doesn’t seem to include a scheduler, so I’d have to write one myself. Also, while it has message passing between threads, it doesn’t seem to have message passing between fibres. It also has no equivalent of Go’s select statement, so the only way to poll multiple channels would be to wait on one with a very short timeout, then wait on the next, and so on in a loop, which is rather ugly.

*Edit: D does have a fibre scheduler as part of the the Vibe.d webserver, but this isn’t yet integrated into the standard library. Integration is planned such that eventually message passing between fibres will be possible in D as it is in Go.

What other languages have lightweight threads? Rust still has m:n threading, even if it’s not now the default, but I’m not sure if its message passing facilities work between green threads or just between OS threads. I attempted an implementation in Rust, but it fails with a generic ‘failed while receiving on a closed channel’ error not pointing to where in the code it occurred, and I ain’t got time for debugging that. Better to wait until the library and its error messages are more developed.

C# and F# don’t seem to natively support message passing between lightweight threads. Clojure has a library for it, but it seems to be an order of magnitude slower than Go. The JVM has Akka, but it’s incredibly verbose in Java, and to use it with Scala would first require learning Scala, which is for some reason a language that has never appealed to me.

Anything else? Interestingly, Haskell actually has lightweight threads and message passing, and they’re quite performant (see the benchmark linked in the previous paragraph). I attempted to write a Haskell implementation, and amazingly succeeded thanks to judicious use of unsafePerformIO, which is all the fun of unsafe sex with none of the risks (being able to write an `IO A -> A` function after having spent hours battering your head against the type system is pretty much the best thing ever). Yes, I’m a bad person. Anyway, it managed to match the Go implementation, handling even 360,000 threads and 70 runners with no problem. What’s more, if I compiled with O3 and had numRunners and numThreads hardcoded (as opposed to via argv), it would do all the calculation at compile time, inlining the result and always producing the same answer when run.

*Edit: turns out the unsafePerformIO was completely unnecessary. The same program without unsafePerformIO can be seen here, courtesy of a helpful redditor.

Okay, the Haskell implementation must have cheated. It works when setting numTreads to a billion, and there’s no way it’s actually creating a billion threads, no matter how lightweight they are.

Anyway, so Haskell can also operate with multitudes of lightweight threads; why still use Go? One reason: Haskell doesn’t have select, and it can’t even be implemented on standard channels as they offer no means of checking whether they contain data or not. It could be done with synchronous channels, which offer more functionality, but they failed with a build error when I attempted to cabal install them. There’s also Software Transactional Memory channels, but they operate in the STM monad rather than IO, so would lose the advantage of being able to stick print statements everywhere for debugging (well, you can if you really want to, you might just get things printed multiple times if the transaction is retried). Finally, there are MVars, which are like channels that only hold one item, and which allow checking for emptiness; these are what I used for checking who won the race.

Another reason to use Go: reasoning about both concurrency and lazy evaluation simultaneously is more difficult than reasoning about concurrency in a strict environment. Learning Go in general is also much easier: you can teach someone Go in a week, whereas Haskell takes a lot longer to learn. Finally: while it’s possible to write extremely fast code in Haskell, idiomatic code in Go is more performant than idiomatic Haskell.

In summary: Go is great for servers because it allows you to easily build efficient servers based on message passing between lightweight threads. Its advantage over Erlang/Elixir is that it’s much faster at calculation-heavy workloads and can do shared-memory concurrency. Its advantage over Haskell is that it’s much easier to learn, more idiomatically performant, and has a nifty select statement that as far as I’m aware Haskell hasn’t implemented. It’s also more ‘batteries-included’; no need to try and figure out which of the four+ different channel libraries is most appropriate.

Its advantage over Rust is that it’s stable, with a comprehensive standard library, while Rust’s standard library is still in development. It’s also easier to write, as there’s no need to fight the borrow checker or worry about lifetimes. While Rust’s advanced linear type system allows memory safety without a garbage collector, garbage collection isn’t usually a big problem for servers, so Rust’s type system represents a potentially unnecessary overhead. Although Rust does have easier-to-use garbage-collected allocation via managed pointers, its garbage collector is currently quite primitive compared to Go’s, so Go currently makes more sense to use if one desires a garbage-collected server language with lightweight message passing concurrency. Finally, the biggest advantage to me personally of Go over Rust: no macros, meaning no need to read other people’s macros. The only macros I’ve ever liked are lisp macros, and that’s largely because they can be expanded and examined dynamically in the REPL.

Go’s advantage over Java/Scala + Akka on the JVM? Concurrency in Go is much more concise than using Akka in Java. Go has the same advantages over Scala that it has over Haskell, i.e its extremely shallow learning curve, plus the extra advantage of significantly faster compilation. Finally, Akka uses an actor model, like Erlang, which is subtly different from the lightweight threads+channels model used by Go and Haskell and so is not a direct substitute.

The Go and Haskell implementations, along with the incomplete Rust and Elixir implementations, are available at this repo.

Any valid criticisms of this post (of which there are undoubtedly many) are welcome either here or at the relevant reddit discussion.

I recently came across an interesting blog post in which Java performed slower than I’d have thought reasonable, around 4 times slower than C++, so I decided to try my luck at making it faster. The problem? Imagine a forest of goats, wolves and lions. “Wolves can devour goats, and lions can devour wolves and goats. (‘The stronger animal eats the weaker one’.) As this is a magic forest, a wolve, after having devoured a goat, is transmuted into a lion; a lion, after having devoured a goat, is transmuted into a wolve; and a lion having devoured a wolve becomes a goat.” A new world is generated for each possible outcome (a forest (2,2,2) would generate (3,1,1), (1,3,1) and (1,1,3)), with the same then happening in each of these new forests recursively, and so on until a ‘stable’ forest is found: one with only one kind of animal remaining.

Note: some new results have been found since this post was originally made, and are listed at the bottom.

My first impulse was to rewrite the functional Java implementation in imperative code, assuming it would be faster. Big mistake! My implementation ran in around 1.5 seconds for an input of (100,100,100), compared to 65ms for C++; an order of magnitude slower, and much slower than the functional version. To try to understand what went wrong, I translated the Java implementation into C, and it was equally slow! C is generally not (barring a really insidious compiler) an order of magnitude slower than C++, so I realised I must have made a mistake in the translation from functional to imperative. I hence decided to translate the functional code into a less-verbose language in order to understand it better, so I grabbed the D language and got busy:

Unfortunately, the D was also noticeably slower; around 200ms for input (100,100,100), compared to 65s for C++. Assuming for no good reason that creating and populating the array of new forests was what was consuming the most time, I next decided to vectorise it using D’s delightful inline assembly support:

Which, unfortunately, had absolutely no effect on performance whatsoever. I was considering rewriting it to use only aligned access, but it finally occurred me that I should probably check that this allocation actually was a bottleneck before proceeding. I ran the program through Valgrind’s callgrind tool, and lo and behold, turns out I’d been chasing waterfalls; it actually represented an extremely insignificant proportion of the runtime. The sort was taking the most time, and upon a little much-needed reflection, I realised that made perfect sense, as sorting is 0(nlogn), while allocating a large array is just O(n). Thanks to a clever suggestion from someone on the D forums, I then replaced the .sort.uniq with sticking everything into a hashtable and taking it out again (which should be only O(n)), and got the runtime down to only around 100ms, quite competitive with C++’s 65ms.

I figured that, as O(n) beats the O(nlogn) sort+std::unique algorithm that C++ was using for removing duplicates, the D version might actually be faster for higher inputs. Unfortunately, this was not the case; at (500,500,500), while the C++ implementation ran in around 7 seconds, the D version took over 25. I assumed this was due to D’s hashmap implementation generating garbage, which gave me an oh-so-brilliant idea. Java’s got the best garbage collector in the world, right? So if I translate the D implementation into Java, the hashmap garbage shouldn’t cause trouble, and the O(n) uniqifying would allow it to beat C++ (oh, and around this time I also realised why my original C and Java implementations were so slow: they were using a naive 0(n*n) method of removing duplicates; yes, I’m an idiot). Anyway, I fired up Java, and got an implementation going using Java.util.HashSet.

How’d it go? Put it this way. Imagine the pain of giving birth while being repeatedly punched in the testicles, stung by extremely venomous spiders and debugging a Boost library, convert that pain into ineffectiveness, multiply it by a thousand, and it’d still be a few orders of magnitude more effective than using HashSet to remove duplicates in Java. It literally failed with a java.lang.OutOfMemoryError on an input of around (30,30,30), while the D implementation could handle (300, 300, 300). Needless to say, it was time to go back to the proverbial drawing board.

Okay, I’m turns out I’m even more of a shenjingbing than first imagined; I didn’t realise, but apparently java defaults to reference-based equality checks when no hashing function is defined for a class. This means the HashSet wasn’t working as intended; it wasn’t removing duplicates, as it didn’t recognise duplicates as duplicates since they were separate objects even if their goats, wolves and lions values were the same. Properly implementing equals() and hashCode() makes the Java HashSet implementation slightly faster than the C++ for high inputs and much faster than the original functional implementation, so in that sense yaay, I did it. Thanks to JingJing23 on reddit for pointing out my mistake.

They say when you don’t know what to do, do what you know. So I translated the original D implementation (the 200ms one using .sort.uniq) into C, albeit using a naive quicksort instead of whatever D uses, and it ran in around 90ms, quite competitive with the C++. What’s more, its performance didn’t degrade with higher input like the D hashmap implementation; (500,500,500) took 7s in C++, and only 9s in C. On a hunch, I decided to try change D’s to use quicksort too, via the following elegant implementation (don’t worry, I promise this is the last D I’ll show! It’s actually not even mine; I found it on the internet).

And voila, it was as fast as the C, and around 80% of the C++. Presumably the sorting algorithm in D’s std.algorithm is just particularly unsuited to this case; sometimes sorting algorithms are made to ensure good performance on worst case input (unlike quicksort, which has O(n*n) worst case), but this introduces a constant factor that can make them slower on best-case input than the appropriately named quicksort.

Armed with this knowledge, I was certain I’d finally be able to give the Java implementation the speedup I’d been searching for. I thus set about translating the C implementation into Java, and by translating I mean transliterating. I capitalised on the fact that an array of struct{int, int, int} is equivalent to an array of ints, in order to get past Java’s lack of value types, and changed various functions to take and reuse buffers, in the end managing to remove ‘new’ from the main loop completely. My miraculously ugly Java quicksort:

Note how it uses integers like pointers, as the array of forests is actually just an array of ints. Also note how it stores the pivot goat, wolf and lion values at the final 3 spots in the array, as the ‘forLessThan’ function can only compare two parts of the same array. If you’re thinking this could lead to problems if the final 3 spots in the array contained a valid forest, you’re right, but at that point there’d be an extremely high probability of overflowing the array and getting an outOfBoundsException anyway.

How’d it go? In spite of having essentially written C in Java, it was actually slightly slower than the original functional implementation from the original blog! So a big congratulations to the writers java.util.stream. While I’ve no idea how distinct works, but it certainly kicked my ass.

For fun, I also wrote a few other implementations. One in Haskell, which was surprisingly fast considering I just used sticking everything into Data.Set and taking it out again for removing duplicates. It ran in around 350ms for input (100,100,100), however slowed down at higher inputs presumably due to garbage or hash collisions.

I’d be interested to see how fast one using sort and removing consecutive duplicates would be; I attempted one myself, however it was rather slow as I lacked the monad-foo to make an in-place partitioning function for removing duplicates.

Minor gripe: why doesn’t Haskell support generic functions on collections; why do I have to use Set.fromList for a set but Vector.fromList for a vector? Why isn’t there a generic map function? D seems to handle this fine, via ranges, and even C++ manages it via std::algorithm.

I also wrote a Go version, but it’s slower than I’d have expected, presumably because the Go standard library sorting algorithm uses heapsort “if depth of 2*ceil(lg(n+1)) is reached”. It also generates more garbage than the D version, as D’s creator Walter Bright recently did some much-needed work on removing allocations from D’s std.algorithm. So a big thank you to Walter for ironing out that wart on the language!

I wrote a Common Lisp version too, but it’s rather slow as I was too lazy to write my own unique function, and the one in the standard library (‘remove-duplicates’) uses the naive O(n*n) algorithm for some reason. I’d be quite excited to see how fast it is if someone wants to write up an improved version!

Finally, I attempted a Rust version, and although I wasn’t able to make it functional as the iterator syntax got the better of me, I did get an imperative version working, and it ran at around the same speed as the C and D versions.

I had one more idea for optimising the Java version. If I could create an ideal hashset, with O(n) time for uniquifying and minimal garbage generation, I should be able to beat the C++ implementation’s O(nlogn). A simple way to create a perfect hash (unique for any forest) is by using the first n bits of the goats, wolves and lions values of a forest, as follows:

This produces a maximum value MAX of 1<<(BITS_FOR_HASH)*3, meaning if BITS_FOR_HASH is less than eleven it will fit in a uint (just to clarify for any Haskell programmers reading this, that ‘<<=’ is to bitshift left as ‘+=’ is to ‘+’; it’s not a bind). To guarantee O(1) insertion and removal (no collisions), a simple if super space-inefficient solution is to use a bitset of size MAX, as follows:

While incredibly slow for small inputs (as it’s allocating and deallocating hundreds of megabytes every call), for an input of (500,500,500) it only took around 20 seconds, just slightly less than half the speed of the quickSort+uniq implementation. I suspect that with even higher input it could well prove to be faster, however input greater than (512,512,512) is not possible, as the maximum goat/wolf/lion value that can be hashed with BITS_FOR_HASH < 11 is 1024. Using BITS_FOR_HASH of 11 or larger is not possible, as it wouldn’t translate into the Java version, as the maximum value would be greater than 2^32 and there’s no way to index into a Java BitArray with a long. The only option would be to write my own Java implementation of a long-indexable bitarray, and ain’t nobody got time for that, so I admit defeat. java.util.stream: 1. Logicchains: 0.

Moral of the story: profile before attempting to optimise, don’t assuming functional programming necessarily performs poorly, and give the D language a try sometime; while it may have some wrinkles, it can nevertheless be quite powerful and elegant.

Also, sometimes if pays to implement your own sorting function if you know the input won’t be worse case, as demonstrated here by quicksort being twice as fast as D’s standard sort. (Or, conversely, maybe D needs to improve its default sorting algorithm for this kind of input, as C++’s and Rust’s weren’t so slow.)

Edit: The Java HashSet implementation of the problem I was using was buggy in that I didn’t realise I had to manually implement toHash and equals on the forest class. Fixing that made it actually faster than the C++ at higher inputs, so yay, optimisation complete!

Also, changing the D to use the default stable sort (like the C++) actually made it as fast as the C++ for higher inputs (such as 500,500,500). Stable sorting is faster in this case than unstable sorting, possibly because the input is already partially sorted (forests generate similar forests nearby to themselves in the array).

Generic functions on collections are possible in Haskell, via type classes, they’re just not shown in the tutorials as they can produce errors messages that are less comprehensible to beginners.

Finally, there’s a superior algorithmic implementation that runs in linear time, described here.

*Edit – For clarity, I’ve decided to just include the most up-to-date, fair and relevant results tables at the top here. Scroll down below to read more about the optimisation process and the effectiveness of various different implementations.

NumTrades = 150

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

C3

clang

0.0944

248

100.00

352224

757

60

1502

9

Clojure

–

0.0000

607

40.86

893908

976

79

2814

0

Java3

javac

0.8451

624

39.74

826964

864

119

3386

2

Scala

–

0.0000

737

33.65

751168

787

57

1719

0

Common Lisp

–

0.0000

1152

21.53

902192

715

44

2103

0

Python2 (Pypy)2

–

0.0000

1716

14.45

1382304

609

42

1382

0

C#3

mcs

0.7248

2099

11.82

727712

893

114

3268

4

F#3

fsharpc

4.2022

2126

11.67

731808

660

39

1541

12

Racket

raco

17.1353

2150

11.53

875424

691

38

1959

11142

Go3

go

0.3930

2516

9.86

352952

656

62

1313

2799

Ruby1

–

0.0000

6122

4.05

1070024

636

52

1266

0

JuliaBigNum

–

0.0000

8932

2.78

2526800

565

52

1227

0

Tcl

–

0.0000

18754

1.32

1702460

543

45

1130

0

NumTrades = 100

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

C3

clang

3.0596

210

100.00

235036

757

60

1502

9

Clojure

–

0.0000

295

71.19

706660

976

79

2814

0

Java3

javac

2.8895

421

49.88

561296

864

119

3386

2

Scala

–

0.0000

478

43.93

479916

787

57

1719

0

Common Lisp

–

0.0000

644

32.61

866428

715

44

2103

0

Python2 (Pypy)2

–

0.0000

1036

20.27

955904

609

42

1382

0

F#3

fsharpc

7.6323

1082

19.41

492912

660

39

1541

12

C#3

mcs

1.7726

1144

18.36

488812

893

114

3268

4

Racket

raco

26.5090

1208

17.38

639700

691

38

1959

11142

Go3

go

2.8093

1921

10.93

235768

656

62

1313

2799

Ruby1

–

0.0000

3045

6.90

675208

636

52

1266

0

Lua

–

0.0000

3780

5.56

2202448

576

53

1305

0

Smalltalk (GNU)

–

0.0000

5545

3.79

611088

712

76

1641

0

JuliaBigNum

–

0.0000

7836

2.68

1012580

565

52

1227

0

Tcl

–

0.0000

9865

2.13

1135652

543

45

1130

0

NumTrades = 50

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

C3

clang

3.6418

81

100.00

117848

757

60

1502

8

Common Lisp

–

0.0000

84

96.43

324988

715

44

2103

0

Clojure

–

0.0000

117

69.23

403120

976

79

2814

0

Racket

raco

20.1401

132

61.36

373368

691

38

1959

11142

Java3

javac

2.8739

176

46.02

318364

864

119

3386

2

Scala

–

0.0000

213

38.03

276620

787

57

1719

0

LuaJit

–

0.0000

230

35.22

649116

576

53

1305

0

F#3

fsharpc

5.3502

381

21.26

254008

660

39

1541

12

C#3

mcs

1.4964

389

20.82

249920

893

114

3268

4

Python2 (Pypy)2

–

0.0000

428

18.93

482480

609

42

1382

0

Go3

go

2.3946

814

9.95

118576

656

62

1313

2799

Ruby1

–

0.0000

1435

5.64

301384

636

52

1266

0

Lua

–

0.0000

1720

4.71

1101764

576

53

1305

0

Smalltalk (GNU)

–

0.0000

1825

4.44

311132

712

76

1641

0

JuliaBigNum

–

0.0000

2424

3.34

907624

565

52

1227

0

Python3

–

0.0000

2425

3.34

1338756

497

32

1126

0

Tcl

–

0.0000

4102

1.97

569216

543

45

1130

0

NumTrades = 10

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

Common Lisp

–

0.0000

16

100.00

111068

715

44

2103

0

C3

clang

0.0966

16

100.00

24096

757

60

1502

9

Racket

raco

17.0957

26

61.54

190028

691

38

1959

11142

Clojure

–

0.0000

27

59.26

121268

976

79

2814

0

Java3

javac

0.8250

39

41.03

96224

864

119

3386

2

LuaJit

–

0.0000

40

40.00

128036

576

53

1305

0

Scala

–

0.0000

41

39.02

148820

787

57

1719

0

F#3

fsharpc

4.1270

63

25.40

62920

660

39

1541

12

C#3

mcs

0.7194

64

25.00

58820

893

114

3268

4

Python2 (Pypy)2

–

0.0000

85

18.82

113608

609

42

1382

0

Go3

go

0.5028

158

10.13

24828

656

62

1313

2799

JuliaBigNum

–

0.0000

273

5.86

316656

565

52

1227

0

Ruby1

–

0.0000

286

5.59

65244

636

52

1266

0

Smalltalk (GNU)

–

0.0000

324

4.94

66276

712

76

1641

0

Lua

–

0.0000

370

4.32

216136

576

53

1305

0

Python3

–

0.0000

485

3.30

273000

497

32

1126

0

Tcl

–

0.0000

813

1.97

115772

543

45

1130

0

I recently came across an interesting blog post that dealt with optimising a common situation in which the JVM performed surprisingly poorly, which prompted me to try the microbenchmark in a few other languages to see how they fared. The code for this benchmark can be found here.

The problem: iterating over an array of objects modifying and reading its contents. What’s interesting about it? Firstly, as this is a relatively common task, the results give a tiny bit of insight into the ‘general’ speed of the languages tested. Secondly, it’s directly relevant to the effectiveness of object pooling, a technique I’m using in the other benchmark I’m working on. The idea of an object pool is to statically allocate the needed objects at initialisation and reuse them instead of creating new objects, to avoid generating garbage. The effectiveness of this technique would however be limited if it’s impossible to access and modify objects in the pool without generating garbage.

Why are some of the usual suspects (D, Rust, Nimrod, C++) missing? They use the same backend as C so they’d produce results of similar speed, and their metaprogramming facilities are powerful enough that it’d be easy to ‘cheat’ and do all the calculation at compile time. If somebody wants to submit an implementation in one of these languages they are however quite welcome, as long as it doesn’t ‘cheat’.

Now, the results (but don’t neglect to read the rest of the post, as there’s a twist at the end). The value ‘NumTrades’ multiplied by 50,000 is the number of objects in the array of trade objects. Note that if a particular language/implementation is missing from a certain NumTrades value’s results, this means that it ran out of memory or was otherwise killed by the OS. The tests are run on a dual-core 64bit, 1.6ghz Intel i5, with 4gb of ram, running Linux Mint 15 64bit.

*Edit: It should be noted that this set of graphs tests the form of allocation used in the blog post that inspired this post. It is however suboptimal, and the set of graphs further below involve implementations using an optimal method of allocation/object pooling. This first set of graphs just provides for interesting comparison, to show how small change can lead to significant speedup. Also note that the Ruby, Smalltalk and Lua implementations in this graph use the faster form of allocation, as they ran out of memory and died too soon without it. It is hence more fair to compare them with the implementations suffixed with 2, in the later set of graphs. In fact, these graphs are partially just here as part of explaining the optimisation process; if you’re interested in the fairest numbers, scroll to the final set of graphs.

NumTrades = 150

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

x86-64 assembly

fasm

0.0318

110

100.00

300524

2279

180

5840

8

Go

go

0.3103

124

88.71

352484

614

61

1206

2256

C

clang

0.0737

127

86.61

352044

712

56

1425

9

JavaUnSafe

javac

1.1290

203

54.19

326036

1182

156

5148

2

CLispFixNum

–

0.0000

228

48.24

810536

745

44

2152

0

Julia

–

0.0000

298

36.91

609564

552

52

1228

0

Common Lisp

–

0.0000

1120

9.82

902192

715

44

2103

0

Java

javac

2.1709

1575

6.98

918820

819

121

3329

1

C#

mcs

0.6872

1587

6.93

999892

878

115

3301

4

Racket

raco

16.8814

1955

5.63

877364

691

38

1959

11142

F#

fsharpc

4.2951

5761

1.91

1452868

646

40

1526

12

Python2 (Pypy)

–

0.0000

9489

1.16

1519092

561

38

1192

0

Tcl

–

0.0000

14486

0.76

1702460

544

46

1170

0

NumTrades = 100

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

x86-64 assembly

fasm

0.3159

70

100.00

200468

2279

180

5840

8

Go

go

0.3185

82

85.37

235300

614

61

1206

2256

C

clang

1.8142

84

83.33

234856

722

56

1435

9

JavaUnSafe

javac

1.1487

135

51.85

223596

1182

156

5148

2

CLispFixNum

–

0.0000

152

46.05

594160

745

44

2152

0

Julia

–

0.0000

200

35.00

415212

552

52

1228

0

Common Lisp

–

0.0000

656

10.67

866428

715

44

2103

0

C#

mcs

0.7074

953

7.35

966208

890

116

3320

4

Java

javac

2.7260

1052

6.65

651904

819

121

3329

1

Racket

raco

20.3636

1237

5.66

639704

691

38

1959

11142

F#

fsharpc

6.2570

1393

5.03

976720

646

40

1526

12

Python2 (Pypy)

–

0.0000

2234

3.13

1183108

561

38

1192

0

Lua

–

0.0000

4580

1.53

2202468

554

52

1208

0

Ruby

–

0.0000

4753

1.47

2233092

605

56

1250

0

Smalltalk (GNU)

–

0.0000

5313

1.32

611088

712

76

1641

0

Tcl

–

0.0000

9279

0.75

1135660

544

46

1170

0

Python3

–

0.0000

10654

0.66

1582084

607

39

1268

0

Clojure

–

0.0000

86824

0.08

994700

960

53

2811

0

NumTrades = 50

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

Go

go

1.8760

40

100.00

118112

614

61

1206

2256

x86-64 assembly

fasm

0.6947

40

100.00

100416

2279

180

5840

8

C

clang

1.6018

43

93.02

117668

722

56

1435

7

JavaUnSafe

javac

1.2099

73

54.79

121152

1182

156

5148

2

Common Lisp

–

0.0000

96

41.67

324988

715

44

2103

0

Julia

–

0.0000

99

40.40

271700

552

52

1228

0

Racket

raco

21.5142

130

30.77

373364

691

38

1959

11142

LuaJit

–

0.0000

230

17.39

649136

554

52

1208

0

C#

mcs

1.5198

460

8.70

488508

890

116

3320

4

F#

fsharpc

5.7907

480

8.33

455608

646

40

1526

12

Java

javac

3.0675

677

5.91

493744

819

121

3329

1

Python2 (Pypy)

–

0.0000

1010

3.96

620820

561

38

1192

0

Smalltalk (GNU)

–

0.0000

1835

2.18

311136

712

76

1641

0

Lua

–

0.0000

2150

1.86

1101764

554

52

1208

0

Ruby

–

0.0000

2197

1.82

1046204

605

56

1250

0

Python3_fst

–

0.0000

2460

1.62

1338748

497

32

1127

0

Tcl

–

0.0000

4121

0.97

569212

544

46

1170

0

Python3

–

0.0000

4951

0.81

794376

607

39

1268

0

Clojure

–

0.0000

10246

0.39

858024

960

53

2811

0

NumTrades = 10

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

Go

go

1.6017

8

100.00

24360

614

61

1206

2256

C

clang

1.6518

8

100.00

23920

722

56

1435

9

Java

javac

1.1308

13

61.54

117792

819

121

3329

1

JavaUnSafe

javac

1.0829

14

57.14

39256

1182

156

5148

2

Common Lisp

–

0.0000

16

50.00

111068

715

44

2103

0

Julia

–

0.0000

20

40.00

164340

552

52

1228

0

Racket

raco

17.0663

26

30.77

190052

691

38

1959

11142

LuaJit

–

0.0000

50

16.00

128032

554

52

1208

0

C#

mcs

0.6989

67

11.94

106228

890

116

3320

4

F#

fsharpc

4.4070

83

9.64

116864

646

40

1526

12

Python2 (Pypy)

–

0.0000

203

3.94

142012

561

38

1192

0

Smalltalk (GNU)

–

0.0000

324

2.47

66276

712

76

1641

0

Ruby

–

0.0000

448

1.79

212004

605

56

1250

0

Lua

–

0.0000

450

1.78

216132

554

52

1208

0

FasterPython

–

0.0000

484

1.65

273000

497

32

1127

0

Tcl

–

0.0000

821

0.97

115768

544

46

1170

0

Python3

–

0.0000

1017

0.79

163864

607

39

1268

0

Clojure

–

0.0000

1121

0.71

322328

960

53

2811

0

Take the results for Python, Ruby and Julia with a grain of salt, as I’m less experienced with these languages and there are probably optimisations that I’ve missed. Suggestions are welcome! (Note also that the rightmost column of the table represents the size of the resulting executable; it doesn’t seem to fit fully into the page.)

*Edit: The Ruby and Julia versions have been improved since the benchmark was originally run, and now perform quite impressively. A version that’s faster in CPython (but not in Pypy) has also been added, as Python3_fst.

Points of interest:

Pypy – Significantly faster than standard Python 3 (CPython); quite impressive. For such an otherwise elegant language, however, writing constructors in Python is surprisingly awkward. Or perhaps it’d be more accurate to say the fact that one even has to write constructors in the first place seems odd; is there a default constructor I’m not aware of, a more concise way to write the following? *edit: there isindeed a better way to write this. I need to learn more Python.

C# – Faster than Java, and uses less memory. An interesting note: I originally ran the benchmark with the older Mono 2.1 runtime, before having to upgrade to 3.2 to get F# to compile, and the C# implementation actually ran faster there. Presumably the old garbage collector was better suited to this case, whereas maybe the new one is more like Java’s.

x86-64 assembly – Uses less memory, courtesy of struct packing. Normally languages pad structs, to align memory such that each memory access of a particular type is at an address that’s a multiple of that type. So an int32 can be accessed at 4, 8, 12, etc, whereas an int64 can only be accessed at 8, 16, 24 etc. The trade struct in this benchmark is long, long, int, int, long, long, byte; 8, 8, 4, 4, 8, 8, 1, or 41 bytes total. If sticking these in an array, 41 is not a multiple of 8, so 7 bytes of padding are necessary at the end of each struct so that the long at the start of the next one is aligned. As assembly doesn’t have structs, just byte arrays, it’s possible to avoid this padding. Note that while it’s faster in this case, unaligned access is generally thought to be slower, and on non-X86 architectures can cause the program to segfault. Also, you can do the same in C with compiler extensions, and that’s generally considered a much better way to achieve unaligned datastructures than writing the whole program in assembly.

Note that I’m a pretty mediocre x86 assembly programmer; it might well be faster if written by someone more experienced. Pull request are welcome. Also note that if you exclude the printing code (I couldn’t get printf to work without segfaulting, so I copy-pasted an integer printing routine. Take a look; printing integers is actually ridiculously slow and complicated compared to printing strings.), the assembly code is still more concise than its Java equivalent. Heck, going by plain linecount the entire assembly file, including the ridiculous integer printing routine, is 225 lines, only around 15% more lines than the unsafe Java code To be fair, however, I should note that the Java code does include numerous setters and getters, which aren’t strictly necessary. For those who aren’t familiar with assembly, here’s an example of setting the values in a struct, where rcb is the address of the struct and rbx is the value you want to set tradeId, price and quantity to (please don’t hesitate to let me know if you spot any mistakes, as I’m quite confident the x86-64 implementation doesn’t function completely as intended):

Go – Almost matches C, quite impressive considering how many more man-hours the LLVM has had put into it than the Go compiler. Why is it so much faster than C# and Java? Maybe because it allows me to pointer chase, manipulating data in the array directly without generating any temporary objects, and hence minimising garbage creation. Something I learned: on x86-64, Go ints are 64bit; I initially made the mistake of assuming they were 32bit like C ints.

Common Lisp (SBCL) – Incredibly impressive speed for a dynamic language. Note in particular that I didn’t even add type annotations, as my attempts at adding them didn’t seem to speed the implementation up at all. The speed drops as NumTrades increases however, and profiling with time() reveals that garbage collection takes up a significant (60%+) percent of the running time. This means that either the compiler isn’t able to convert:

or that it’s unable to stack allocate trade-ref so as to avoid it becoming heap garbage. That, or its bignum implementation generates more garbage. The same applies to the Racket compiler and the JVM.

*Edit: Thanks to type annotations it’s possible to make the Common Lisp implementation perform even faster, via avoiding bignums and overflowing instead as C, Go and Java do. This result is titled CLispFixnum.

Pharo – Sure it’s slow, but it’s had far fewer development hours put into it than any of the other languages used. It was however one of the funnest implementations to write, and also the most concise. Wait, you say, Pharo’s not in the list? I converted the program to GNU Smalltalk as that’s easer to run from the command line and around 50% faster, but it was developed in Pharo. It’s also not too slow compared to CPython. If I was given the resources to write my own language (tangent incoming; feel free to skip the next two paragraphs)…

it would be a statically typed Smalltalk (or maybe something closer to Self) with type inference, a focus on referential transparency, and incremental compilation. If the methods of an immutable object only messaged other immutable objects (were pure functions), it’d be possible to compile the whole object into an optimised block that was potentially as fast as the equivalent C code. The compiler would be a first class citizen. so it’d also be possible to compile non-referentially-transparent blocks with Compiler compile: myBlock. Track would be kept of all the objects that this block messaged and its assumed responses from them, and if any were changed then the compiled block would become invalid and revert to interpreted code. The file system would also be a first class citizen; compiled blocks could be saved as executables that took their messages as flags (./myblock doLotsOfMathsWithInput: someNum) and outputted the result to stdout, which the runtime could then capture. Block data would be stored in metadata files accompanying the executables, and the filesystem could be used as if it were a Smalltalk image. Note that this would allow easy parallelism via the OS process scheduler, a bit like Erlang. It would also allow for runtime code specialisation, as described in this link, making it faster than C in some cases. For instance, the following function:

int myFun(a, b, c){ return a*b+c }

can be compiled to just return 12 + c if a and b are known at compile time to be 3 and 4, which is much faster. If these are runtime constants, however (Java final), while it should still be possible to do the same optimisation on the code, in practice the JVM doesn’t (and I don’t think the CLR does either). A language that took advantage of this kind of optimisation could be faster than c for problems where there’s a lot of information only known at runtime that doesn’t change once it’s loaded.

Racket – Very enjoyable to write. It was originally much faster, before I fixed a weird bug I was having. Can you spot the bug in the following code?

If not, neither did I. I couldn’t understand why my attempts to mutate the objects in the vector were producing weird results, especially when equivalent code in Common Lisp worked fine. Even though half the trades should have a side value of ‘S’ and half should have ‘B’, for some reason after the vector was initialised they all had ‘S’, even though my initialisation code had been checked and double-checked; Then I realised: for every other value, all the objects in the vector also shared the same value, the value of the final object in the vector. Why was this? Answer: all the objects in the vector pointed to the same object. Changing the array creation code to the following:

fixed the problem. It also slowed the code down significantly. It does however suggest at a solution to the problem of poor performance due to the lack of cache locality of *obj arrays compared to obj arrays: have all the *obj point to the same object, as then it’ll always be in the cache…

Lua – Luajit really is quite impressive, achieving a 10x speedup over the standard Lua interpreter, however still falling behind Racket and Common Lisp. Interestingly, although Luajit appears to use less heap, plain Lua actually handles conditions of high memory use better: the reason there is no Luajit implementation listed for numTrades values of 100 and 150 is because it died with an out of memory error while running them (the benchmarking script will automatically exclude languages that don’t compile or run properly).

Self – What’s Self? It’s a research language that’s like a more-pure form of Smalltalk. It was extremely awesome to write, but I didn’t include it as it ran out of memory on NumTrades values of less than 10 (and took over a second to run at NumTrades=10). Nevertheless I think it’s worth checking out for fans of Smalltalk as it does an even better job of ‘pure OO’, and is extremely elegant about it. My original implementation in Self was only around 20 lines of code, although there’s a few more now as I wasn’t sure how to concatenate strings for printing so I printed them the ugly way.

F# – A neat language, and generally fun to write. Not quite as fast as C#, although it was originally the fastest language. Why? Firstly, I made the same mistake as in Racket, initialising an array with a function that ended up filling it with references to the same object rather than different objects. After I fixed that, however, it was still really fast. This was because I didn’t realise that parameter-less functions had to be called with a () at the end; a function with one parameter can be called as ‘myFun param’, but a function with no parameters must be called as ‘myFun()’. In the process of figuring out what I was doing wrong, I ended up looking through the CIL (CLR bytecode) disassembly of the program, and I must say it was the one of the most nauseating things I’ve ever seen in my life. Imagine a language with twice the verbosity of Java and half the clarity of x86 assembly, and you’ve got a language that’s about four times as elegant as the CLR Common Intermediate Language. I really don’t think the following should belong in bytecode: *Edit: okay, I may be wrong here, and the CIL isn’t itself bytecode, but rather is translated into bytecode. In which case, ignore this complaint.

Even functions seemed to have multiple verbose constructors defined. The total assembly size was around 1800 lines, almost an order of magnitude more than the equivalent program written by hand or by the LLVM in an assembly language. If you ever plan on writing low-level code for a virtual machine by hand, my advice is to stick to the JVM, as its bytecode is much nicer (only 160 lines for the disassembly of Java.class, which is only 10 lines more than the original Java.java source file…). To be fair however, the disassembly of the C# executable (as opposed to the F# one) was only around 460 lines long, and didn’t have quite as much boilerplate.

Ruby – I don’t quite understand the fuss about Ruby; like Python, it seems to be a dynamic language that wants me to write constructors (and getters/setters), or else use ugly stuff like $trades[i].instance_variable_get(“@price”) . To be fair, it does seem to be slightly faster than CPython, but it’s still slower than Smalltalk, and if I was going to use a language that slow I’d definitely prefer Smalltalk. Maybe someone can enlighten me as to what’s so great about Ruby? (And Rails doesn’t count, as I’m not a webdev.)

*Edit: Okay, I just mentally compared Ruby to Javascript and PHP, and was overcome with love for the language. I think I now better understand its appeal.

*Edit2: Wow, I made a small change and now the Ruby implementation is twice as fast, faster than Smalltalk and beating CPython. Now I’m impressed.

Julia – The language is quite fast on many other tasks, comparable to C or Fortran, but it seems the optimiser doesn’t know what to do with this particular program. I suspect someone more familiar with the language than I am could significantly speed it up by changing the array of trades into a one-dimensional matrix of int64s and manipulating that; the language is built around numerical computing, and the compiler has specialised optimisations it applies when it detects idiomatic numerics code. Writing the language was quite pleasant; I think the future looks bright for Julia. It was also the most robust of the ‘slower’ languages: managing to complete 150 numTrades where GST, Ruby and Clojure dropped out. Note if you want to test it yourself that the Julia compiler takes quite a while to build, maybe at least half as long as Rust. Interestingly, this is the most concise language in terms of compressed source file size.

*Edit: Wow, a patch to make the global trades array local has sped it up massively, to near C speed. This shows the importance of being familiar with the gotchas of a language currently still in development in order to maximise its performance.

Clojure – Poor performance here can probably be blamed on the JVM itself, considering how poorly Java fares. The Clojure implementation was however actually the least enjoyable to write (although normally my experience with Clojure is quite pleasant), due to how the language goes out of its way to make declaring mutable types difficult. Where Racket only requires a #mutable tag to render a struct mutable, Clojure requires declaring an interface complete with getters and setters for every value, and then declaring a type that implements these getters and setters. Somebody please tell me there’s an alternative to the following monstrosity:

I wonder if Rich Hickey ever had any bad experiences with Godzilla, as the above code suggests that he must really hate mutation. Even Haskell makes declaring mutable datastructures easier. (I’d love if someone were to submit a Haskell information, by the way. My type-foo isn’t up to scratch for dealing with the state monad.)

Finally, the twist: although the C, Go and C# implementations are faster, they’re also wrong! Run the Clojure version with sufficient numTrades changing +’ to + and it’ll throw an ‘integer overflow’ exception, kindly letting us know as the other implementations didn’t that the buyCost and sellCost variables are overflowing. The other Lisps, Python and Smalltalk handle this by automatic promotion to bignum. The benchmark is hence not particularly fair, as bignum operations are slower than standard integer multiplication, and the dynamic languages are doing bignum calculations while the statically typed languages are simply overflowing. Look at the numbers for numruns < 100 for fairer comparison, as no languages are using bignums then. Also note that the Clojure version can use +’ instead of + to automatically promote to bignums and avoid the overflow, which the current version does.

I bet you thought that was all. But no, there’s more! Long story short, the Java code in the original blog post was poorly written from a performance perspective (unnecessary ‘new’s). I cleaned it up, and voila, it’s faster than the original unsafe Java implementation and quite close to the C. Moral of the story: don’t resort to sun.misc.unsafe until you’ve removed all the ‘new’s from your performance-sensitive code. In the following table, each language suffixed with ‘2’ is an implementation where the following (or equivalent):

It can be seen that this optimisation results in a huge speedup for the Java version, as it reduces object creation and hence reduces garbage collection. Similar for C#, F# and Python. Interestingly however it has an adverse effect on the performance of the Lisp implementations, suggesting that an object is created for each aref/vector-ref, so minimising the use of those functions produces better performance. This suggests, somewhat sadly, that object pooling may not be an effective option in Lisp, as there’s no way to mutate objects in an array without generating garbage (short of using unsafe ffi functions).

Finally, the most useful part of the benchmark: benchmarking the fastest implementation in each language, with the statically typed languages all using bignums so as to produce a fair result. The results are as follows:

NumTrades = 150

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

C3

clang

1.9361

252

100.00

352224

750

60

1492

9

Java3

javac

2.6750

547

46.07

797152

864

119

3386

2

Common Lisp

–

0.0000

1116

22.58

902192

715

44

2103

0

Racket

raco

20.1457

1957

12.88

875388

691

38

1959

11142

C#3

mcs

1.6699

2100

12.00

727712

893

114

3268

4

F#3

fsharpc

5.8450

2125

11.86

731956

659

39

1541

12

Go3

go

0.6288

2515

10.02

352956

656

62

1313

2799

Python2 (Pypy)

–

0.0000

3333

7.56

1519136

561

38

1192

0

NumTrades = 100

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

C3

clang

1.7614

162

100.00

235036

750

60

1492

9

Julia

–

0.0000

198

81.82

421180

552

52

1228

0

Java3

javac

2.9222

369

43.90

561684

864

119

3386

2

Common Lisp

–

0.0000

644

25.16

866428

715

44

2103

0

F#3

fsharpc

5.4417

1073

15.10

492908

659

39

1541

12

C#3

mcs

1.3916

1106

14.65

488812

893

114

3268

4

Racket

raco

20.3282

1209

13.40

639704

691

38

1959

11142

Go3

go

0.6998

1659

9.76

235768

656

62

1313

2799

Python2 (Pypy)

–

0.0000

2122

7.63

1183156

561

38

1192

0

Ruby

–

0.0000

4258

3.80

2233108

605

56

1250

0

Lua

–

0.0000

4600

3.52

2202356

554

52

1208

0

Smalltalk (GNU)

–

0.0000

5295

3.06

611084

712

76

1641

0

JuliaBigNum

–

0.0000

7726

2.10

1001056

565

52

1227

0

Python3

–

0.0000

10212

1.59

1582084

607

39

1268

0

C really shines here, courtesy of libGMP and stack allocation, but Go drops well behind, presumably due to its bignum library being slower and generating more garbage. Java performs relatively well, which I imagine is due either to its garbage collector being better or its bignum implementation generating little garbage. Interestingly, SBCL beats Go, F# and C# (twice as fast as them at NumTrades=150) in spite of being a dynamic language and being compiled without any type annotations (they didn’t seem to make it any faster), which just goes to show that dynamically typed languages needn’t necessarily be slower than statically typed ones in all cases. Racket also appears to beat them slightly at NumTrades=150, too. If anybody knows a way to stop the SBCL implementation from consing I’d be interested to hear it, as garbage collection accounts for over 50% of the running time so avoiding it would being it close to Java in speed. I actually also made a SBCL implementation that just calls libgmp via the ffi, which results in no garbage collection occurring (which means aref isn’t creating garbage, so object pooling _is_ possible in Common Lisp, yay!), yet it’s about twice as slow. It’s in the repo as Lisp3.lisp if anyone wants to fix it (it’s inspired by the lisp pidigits code from the alioth language benchmarks).

Note that D, Rust or Nimrod could also perform as well as C here, simply by calling out to libGMP. As could Go, for that matter, but I thought for the sake of the benchmark it made more sense to use the native Bignum implementation.

In the Benchmarker.go source the output and input file can be changed; as input, BenchmarkData.dat produces the first set of tables, BenchmarkData2.dat the second, and BenchmarkData3.dat the third. The numtrades values to test are stored near the top of the file in an array of strings, and the time to pause between runs can also be set via the WaitTime constant.

Hope you enjoyed the benchmark! The moralof this post: new considered harmful. The tl;dr of it: Lisp and Racket beat Go at Bignums.

Last night I received a call to my landline, a heavily Indian-accented voice stating “I am from Microsoft. Your computer is sending many errors and blocking our servers. Today the amount of errors has intensified, so we decided to take the initiative and contact you directly”. I was intrigued; I run Linux Mint, and rarely boot into Windows (I certainly hadn’t that day). I told him this, and he helpfully corrected my English; “it’s Lih-nux, not Lie-nux”, and then told me I must have a Windows running somewhere. Maybe I had a rare Linux trojan that was somehow booting up the Windows partition in the background and using it to DDOS Microsoft by sending repeated error reports?

I dutifully restarted and booted into Windows. I asked him how he got my landline, and he replied that I had ‘registered’ it with Microsoft when setting up the computer. How prescient of me, considering that I had installed Windows well before moving to this house and getting the current number. I asked him my name, and he responded with the name attached to the number in the phonebook, one not used anywhere in my Windows registration; maybe he was just trying to be polite?

By this time Windows had finished booting. Hey told me to hit win+r (run hotkey), and type eventmgr into the run dialogue. I followed his instructions, clicking first on Custom Views then on Administrative Events, and there were over 13,000 ‘errors’ logged there! I told him this, and he was quite shocked too, stating that this was a new ‘record’ for them. Which explains why the ‘Microsoft Servers’, despite being able to handle errors reports coming in from over a billion computers running Windows, are unable to handle a few thousand error reports coming from a single PC over what looks like the course of weeks (I didn’t raise the possibility to him that these were internal errors, and not error reports sent to Microsoft, as I didn’t wish to hurt his feelings).

He told me to open up a command prompt and spelled out a command for me to type, assoc, and I dutifully obeyed. This spilled out a list of what appeared to be registry entries. He told me to take a look at the third last and pay attention to the line containing CLSID, something like “ZFSendToTarget=CLSID{SomeStringOfDigitsAndLetters}”. He then explained patiently how only my computer and Microsoft have access to that number, and proceeded to read his copy out to me, which was identical to mine, and assured me that this proved he was genuinely from Microsoft.

Figuring I was clearly hooked, he then moved on to the crux of his brilliant plan. He instructed me to type an url into the run prompt, “freepcupdate.com”. I opened the website in a secure browser, to be greeted with a poorly formatted page with ‘MicroSoft’ as the heading and a poorly drawn arrow further down pointing to a large green ‘update’ button. He told me to click the button, and, figuring I had wasted enough of his time and completed my spambaiting duty, I promptly hung up the phone. He called again, and I politely told him to stop calling, and hung up. He called again, and told me in a threatening voice that “We are disconnecting your computer!”. Disconnected from what? He called yet again, and I remembered that, for reasons that now escape me, I happen to have a large singing bowl lying about the house. It’s basically a bowl-shaped gong, like a smaller version of the image at the bottom of this post. I fetched the bowl, answered the phone, and then placed the handset inside the bowl. I then proceeded to strike the bowl repeatedly until it was resonating a note (somewhat ironically, it was tuned to F#) loud enough to be just on the threshold of causing pain. I left the handset in there until the sound had died down, then hung up the phone. Needless to say, he didn’t call again.

I’m working on a modular random level generator, the essence of which uses similar logic to that of a roguelike. Last month I benchmarked the effectiveness of various programming languages at a simplified level generation algorithm that roughly mimics the demands of my actual algorithm, using single-threaded code. I’m now running the same benchmark with concurrent code, to examine how easy the languages tested are to parallelise for this task.

The code is available here. Any improvements to the code are most welcome. Most of the running time is spent on (writing in Haskell* as it’s easy to read):

Checking a newly generated room against a list of previous rooms to see if they collide or not, discarding it if it does (it’s a brute-force level generation technique; our actual engine is a bit more sophisticated, but still relies on the same principle). Much of the rest of the time is spent on random number generation, with all languages using a similar prng algorithm to ensure fair comparison. Note that the program is embarrassingly parallel, a trait I aim for in algorithm selection, and so how easy the implementations are to parallelise for this simple task is not necessarily representative of how effective the languages are at more complex parallel tasks. I.e, these benchmarks are relevant to my goal, not necessarily yours. I enclose the generalisations made later in this post within the context of that statement.

Edit: A couple of months after the benchmark was originally run, somebody submitted a D implementation that takes advantage of LLVM optimisation particularly well and is significantly faster than the other entries (also quite concise, using the D parallel foreach). It’s included at the top of the table. An impressive achievement, although it’s certainly possible that someone quite familiar with optimising C or C++ could produce an equally fast result.

*Haskell was excluded from this version of the benchmark as there seems to be a space leak of some sort in the algorithm that neither I nor anyone who’s examined it so far has been able to overcome. Nimrod was added to the benchmark instead, and so since it has no single-threaded version to compare to it thus has no ‘SLOC to Parallelise’ measure. Nimrod is a language with a whitespace-based syntax, like Python, but which compiles to C for optimal speed.

**I parallelised the D and Rust programs myself, hence the code is probably not idiomatic, and still has plenty of room for improvement. D for instance has a parallel for loop; I couldn’t get it working, but if someone got it working then it would significantly reduce the size of the code. Edit: the D version has now been made more idiomatic, and uses the parallel foreach.

***Somebody submitted a C++ version that runs twice as fast (in around 0.550 ms on the GCC), using an occlusion buffer for collision testing between rooms. I’m not including it in the benchmark numbers as the algorithm is different, but anyone who’s interested can view it here.

****It turns out the reason the C version runs slower than the C++ one is because the PRNG seeds for each thread are all stored in an array together, forcing the hardware threads to compete for access to them and slowing the program down. Having each thread use a copy of the original seed from the array brings the speed up to that of the C++ implementation.

*****A Redditor submitted a version in a language they’re working on called Neat, currently built on top of the LLVM and inspired by D; the compiler is here. I was impressed by how a new language can take advantage of the LLVM like that to achieve the same level of performance as much maturer languages.

******This is the D version from the time of the benchmarks, not the faster more recent submission.

Speed:

Generally, the relative speeds for the concurrent benchmark were the same as those for the single-threaded one, with llvm D, C++ and C++ running fastest, along with the new entry, Nimrod. I was surprised however by how C was slower than C++ and D; I imagine this may have been due to my naive C implementation not giving the compiler sufficient hints, or something related to aliasing. (Edit: The reason C is slower than C++ is described at ****above.) C is definitely capable of reaching those speeds: the Nimrod compiler compiles to C, and impressively fast C at that. The gap between gccgo’s speed for this problem and 6g’s is surprising; it just goes to show that gccgo isn’t always the best choice for speed. Note that the gcc C implementation was slower than the LLVM one because the gcc missed an optimisation involving turning a jump in the GenRands function into a conditional move, resulting in the gcc C one encountering more branch misses.

Memory Use:

Memory use was mostly as expected. I was impressed by how Go and particularly D didn’t use much more memory than the C and C++ versions in spite of being garbage collected, but Rust’s memory use was somewhat surprisingly high. This is likely just due to the immaturity of the language, as there’s not reason it should need so much memory for such a task. Scala used an expectedly large amount for a JVM language.

Concision:

I was quite impressed by Nimrod’s concision. Presumably it has such a low SLOC count because it uses whitespace rather than {}, like python, or because it uses a very concise parallel for loop courtesy of OpenMP:

for i in 0 || <NumThreads: makeLevs(i.int32)

Note however that Nimrod was written completely idiomatically. Go, Scala and C++ are similar, but for the D and Rust implementations only the single-threaded portion was written idiomatically (correction: the Rust one was, but the D one was written hackishly for speed) ; the parallelisation of that code was done naively by myself. Note also that the Scala version’s ‘SLOC to parallelise’ measure also includes optimisations made for speed; it was possible to parallelise the algorithm much more simply, but this had inferior speed characteristics.

Subjective experience:

Working on the C implementation I had one of those rare moments where it feels like the language is laughing at me. I was encountering weird bugs in my implementation of the for loop:

Can you spot the problem in that code? I was passing ‘i’ into the newly created thread by reference, to represent the thread number (pthread_create only takes a void* pointer as an argument, normally to a struct, but in this case a single integer was all that was needed). What was happening was that when the thread tried to access ‘i’, even if was the first thing that thread did, often ‘i’ had already been incremented by the for loop in the original thread, so the thread would have the wrong number; there might be two threads with thread number 2, and they’d both do the exact same calculations, writing to the same part of the global array, corrupting the output. I fixed this using the rather cumbersome method of filling an array with the numbers 1 to NumThreads and passing a pointer to a value from there; it could have been done more concisely by just casting ‘i’ to void* and then back to an integer, but I feel it’s bad practice to cast values to pointers. It’s potentially unportable: if ‘i’ was a large 64bit integer, converting it to void* and back would work on a 64 bit system but not on a 32bit one, as pointers on the latter are only 32bit, and it’s impossible to store a large 64bit integer in a 32bit pointer (although this problem would be unlikely to surface in this case unless one somehow had a machine with over 2^32 cores..).

The D implementation also surprised me, although I think this was largely because I was unfamiliar with D’s memory model. I originally tried having each thread write to part of a global array, like in the C implementation, but, after they had completed their task and the time came to read the data, the array was empty, possibly having been garbage collected. The solution was to mark the array as shared, which required propagating the chain of ‘shared’ typing to the objects being written to the array in order to ensure safety. This was an interesting difference from Rust’s memory safety model, where the type of pointers rather than objects determines how they can be shared, but I’m not familiar enough with D’s memory model to comment on their relative effectiveness. I liked the use of messages in D, which allow the sending of data between threads using Thread IDs for addressing, rather than channels, and I imagine this would be particularly useful for applications running on multiple processes across a network once messaging between processes is supported (currently it’s only supported between in-process threads). Note that D offers support for multiple methods of parallelising code in its standard library, so the method I used may not necessarily be the neatest or most idiomatic.

(Edit: It turns out that all memory in D is thread local unless specified otherwise, which seems like an effective way of ensuring memory safety.)

The Rust implementation proved to be relatively straightforward, apart from some slight difficulties I had with the syntax. Declaring a shared channel (one that can have multiple senders) required:

let (port, chan) = stream();

let chan = comm::SharedChan::new(chan);

I imagine this process will be more concise by the time version 1.0 is reached. Using ‘.clone()’ for explicit capture of a variable into a closure also took a bit of getting used to, but it makes sense in light of Rust’s memory model (no direct sharing of memory between tasks). I think there may be a more concise way to parallelise parts of the problem in Rust, using something like (from the Rust docs):

I wasn’t however familiar enough with the language and current iterator syntax to implement it myself. Using a future might also be more concise. Go felt the easiest to parallelise, although to a degree this was largely because it didn’t enforce the same kind of memory safety as Rust or D. This made it more enjoyable for a project like this, but I imagine on a much larger project the stricter nature of Rust and D’s memory models might come in useful, especially if one was working with programmers of dubious competence who couldn’t be trusted to write memory-safe code.I didn’t write the Scala, Nimrod or C++ version, so I have no comments on the experience of doing so.

Notes for optimising:

To anyone wanting to optimise the code, note that the code must produce consistent results for any given seed, but needn’t produce the same results as the other implementations. This is because different seeding behaviours will produce different results, as the seed for each thread/task is calculated by the original random thread multiplied by the square of that thread/task’s number. This means a different number of threads may produce a different result. Currently D, C, C++ and Rust will all produce the same result for any given seed, but will produce different results for a given number of cores. Go on the other hand will produce the same result for any particular seed no matter how many cores are used. Also note that optimisations that change the fundamental logic, the number of calls to GenRands used, or the number of checks done is not allowed. Finally, remember to change the NumCores variable or its equivalent to however many cores you have in your machine.Timing is done using bash’s ‘time’ function, as I’ve found it to be the simplest accurate method for timing threaded applications. The fastest result of 20 trials is taken, with all trials using the same seed. Resident memory use is obtained from running “command time -f ‘max resident:\t%M KiB’ filename seed” in Bash.

The moral of the story:

For a task like this, compiler choice has as significant an effect on speed as language choice.

I’m working on a random level generator for a game that, although written in C++, is modular such that the level generator can be written in something more high-level. As C++ isn’t always the most fun and productive language to program in, I set out to benchmark some potential alternatives, using a simple Roguelike level generation benchmark that relies heavily on iteration and conditionals and hence roughly mimics the actual generator. The code used is available here: https://github.com/logicchains/levgen-benchmarks. Any suggestions for improving the code used would be most welcome. The majority of the running time is spent on this (using the Haskell version as I think it’s the easiest to read):

Checking a newly generated room against a list of previous rooms to see if they collide or not, discarding it if it does (it’s a brute-force level generation technique; the actual engine is a bit more sophisticated, but still relies on the same principle). Much of the rest of the time is spent on random number generation, so take these benchmarks with a grain of salt, as to a degree they’re as much a benchmark of the respective languages’ random number generators as they are of their general speed (i.e. these benchmarks are relevant to my goal, not necessarily yours. I enclose the generalisations made later in this post within the context of that statement).

All implementations now use the XorShift PRNG, with Rust, Go, C and D using the exact same code (a 32bit xorshift function), Scala using a 128-bit xorshift function, and Haskell using a 128-bit xorshift library. It is hence now less unreasonable to make comparisons between the languages, compilers and implementations.

The results are as follows:

Compiler

Speed(s)

%Fastest

SLOC

LDC

0.256

100%

107

Clang

0.279

92%

140

FCC*

0.283

90%

111

Rustc

0.303

85%

103

6g

0.323

79%

131

G++

0.330

78%

127

Scala

0.344

74%

79

GCC

0.347

74%

140

LLVM-GHC

0.428

60%

78

GHC

0.546

47%

78

DMD

0.567

45%

136

GCCGO

0.598

43%

131

*A Redditor submitted a version in a language they’re working on called Neat, currently built on top of the LLVM and inspired by D; the compiler is here https://github.com/FeepingCreature/fcc. I was impressed by how a new language can take advantage of the LLVM like that to achieve the same level of performance as much maturer languages.

**Edit: Improvements to the D code make it now the fastest implementation. Most of the speedup came from rearranging the checkColl function to allow for better LLVM optimisation, but I’m not personally familiar enough with the language or the LLVM to explain the optimisations in question.

The LLVM version used by LDC, Rust and Clang is 3.2. GCC and GCCGo used GCC version 4.7.3, while GDC used GCC version 4.6.4. Rust was version “0.8-pre (9da42dc 2013-07-17 04:16:42 -0700)”, GHC was version 7.6.2, DMD was 2.036 and 6g (Go) was 1.1.1. They were all run with 03 where available, –opt-level=3 for Rust, -release for DMD, and -funbox-strict-fields for Haskell. D now also runs with -inline and -noboundscheck.

D was the fastest non-C language tested, with Rust following close behind. It’s worth noting that for this task at least both D and Rust “beat C” at the same code, when comparing their results to the GCC C executable’s. The LLVM appears to do a much better job of optimising this kind of PRNG than GCC, showing how sometimes compiler effectiveness at optimising a given piece of code can be a bigger factor than language choice in determining speed. I will be excited to run these benchmarks again in a year’s time and see how Clang, LLVM D and LLVM Rust compare then.

Scala really surprised me; I didn’t realise a JVM language could run so fast for this kind of code. It even matched the speed of GCC C, putting a lie to the statement that languages running on a VM must necessarily be slower. Many thanks to markehammons for submitting the Scala code! Although I haven’t yet learned Scala, the code is surprisingly easy to read, and isn’t full of ugly optimisation code like I’d have expected a fast JVM program to be. Definitely a language worth considering even for speed-critical tasks.

Rust’s speed was impressive, for such a new language. Its flexibility with memory (optional GC) made it interesting to write in, however Rust’s flexibility (at least currently) comes at a slight price, as its syntax takes a bit of getting used to; to pass a heap-allocated vector by reference I need to use myFunc(& mut myVector) (even though it’s already a mutable vector), and the function receiving it needs myAlias:&mut~[myVector] in its type signature, like fn myFunc(myAlias:&mut ~[myVector]) {..}. Compared even to C ( void myFunc(struct MyType myAlias[arrayLength]){..} ), the Rust version looks a bit byzantine. For those who haven’t been following Rust, there are around seven different kinds of pointers: @ (garbage collected heap allocation), ~ (uniquely owned heap allocation), & (borrowed pointer to heap/stack allocation), * (raw C pointer, only useable in unsafe code), and mutable versions of the first three of those (actually, I’m not sure if there’s a mut & pointer, so maybe just six pointers in total). Note also that a pointer to a mutable value behaves differently than a pointer to a non-mutable value.

**Quoting kibwen on YCombinator:“We’re moving the garbage-collected pointers (@) out into a library, so by default we’ll have only a single pointer, ~ (and there’s no mutable version of this). We’ll still have & and &mut, but those aren’t pointers, they’re references (it’s our own fault for calling them “borrowed pointers” in our documentation). So depending on how you count, that’s either one or three pointer types, which I’m happy with. The “unsafe” pointers exist only for C interop and have no bearing on normal code, so it’s inaccurate to count them.” So there’s actually really only three relevant kinds of pointer.

I also benchmarked a version of the Rust code using stack-allocated vectors, but there was no noticeable speed difference. I did however learn that stack-allocating vectors in Rust is currently rather verbose, as since uninitialised values aren’t allowed, I had to create an instance of the object I wanted an array of that had all its values set to zero and fill the vector with that upon creation. Hopefully in future Rust will adopt something like Go’s automatic initialisation of all values to zero, or at least have this as an option (or document it better, if it already is an option). Currently it looks like this:

Which is a lot of unnecessary code, and would quickly become cumbersome in a large project that made heavy use of stack-allocated arrays (although there doesn’t seem to be a reason for doing so, as in this test at least they performed no faster than heap-allocated ones).

Go was impressive, performing well for a relatively new language albeit still falling behind D. The default PRNG was quite lacking in speed, so changing it to use XorShift resulted in a massive speedup. Interestingly, while GCCGo was faster with the default PRNG, the standard 6g runtime was faster with the bitshifted XorShift PRNG. The freedom from semicolons is a nice experience, and in terms of my subjective experience it was the most enjoyable language to write imperative code in. I’m quite looking forward to the development of LLVM Go.

Although it took me quite a while to optimise, I’m happy with Haskell’s performance, especially since it has to carry a vector of [initially] ten million random Ints around (random numbers can’t just be generated in the middle of functions, as that breaks purity). Writing recursively was a nice change, and the naive Haskell version was the most concise implementation of the languages tested (it grew significantly when I switched from lists to vectors and to the faster random number generator, as I couldn’t get pattern matching to work on vectors so had to stick [unsafe]Head and Tail everywhere).

**Updated the Haskell version to use MWC generator instead of Mersenne, and updated again to use more idiomatic code, resulting in a speedup to 0.770s. Updated yet again to use XorShift, now got the running time down to 0.546s. Running with -fllvm speeds it up to 0.428s.

Any more improvements to the Haskell code would be most welcome. Stay tuned for parallel level generation benchmarks! (Eventually..)

Finally, the moral of the story: there’s no such thing as a slow language, only an insufficiently optimising compiler.

Second moral of the story: if cryptographic-level randomness isn’t needed, then XorShift is the best PRNG algorithm available in terms of speed.