I’m working on a random level generator for a game that, although written in C++, is modular such that the level generator can be written in something more high-level. As C++ isn’t always the most fun and productive language to program in, I set out to benchmark some potential alternatives, using a simple Roguelike level generation benchmark that relies heavily on iteration and conditionals and hence roughly mimics the actual generator. The code used is available here: https://github.com/logicchains/levgen-benchmarks. Any suggestions for improving the code used would be most welcome. The majority of the running time is spent on this (using the Haskell version as I think it’s the easiest to read):

Checking a newly generated room against a list of previous rooms to see if they collide or not, discarding it if it does (it’s a brute-force level generation technique; the actual engine is a bit more sophisticated, but still relies on the same principle). Much of the rest of the time is spent on random number generation, so take these benchmarks with a grain of salt, as to a degree they’re as much a benchmark of the respective languages’ random number generators as they are of their general speed (i.e. these benchmarks are relevant to my goal, not necessarily yours. I enclose the generalisations made later in this post within the context of that statement).

All implementations now use the XorShift PRNG, with Rust, Go, C and D using the exact same code (a 32bit xorshift function), Scala using a 128-bit xorshift function, and Haskell using a 128-bit xorshift library. It is hence now less unreasonable to make comparisons between the languages, compilers and implementations.

The results are as follows:

Compiler

Speed(s)

%Fastest

SLOC

LDC

0.256

100%

107

Clang

0.279

92%

140

FCC*

0.283

90%

111

Rustc

0.303

85%

103

6g

0.323

79%

131

G++

0.330

78%

127

Scala

0.344

74%

79

GCC

0.347

74%

140

LLVM-GHC

0.428

60%

78

GHC

0.546

47%

78

DMD

0.567

45%

136

GCCGO

0.598

43%

131

*A Redditor submitted a version in a language they’re working on called Neat, currently built on top of the LLVM and inspired by D; the compiler is here https://github.com/FeepingCreature/fcc. I was impressed by how a new language can take advantage of the LLVM like that to achieve the same level of performance as much maturer languages.

**Edit: Improvements to the D code make it now the fastest implementation. Most of the speedup came from rearranging the checkColl function to allow for better LLVM optimisation, but I’m not personally familiar enough with the language or the LLVM to explain the optimisations in question.

The LLVM version used by LDC, Rust and Clang is 3.2. GCC and GCCGo used GCC version 4.7.3, while GDC used GCC version 4.6.4. Rust was version “0.8-pre (9da42dc 2013-07-17 04:16:42 -0700)”, GHC was version 7.6.2, DMD was 2.036 and 6g (Go) was 1.1.1. They were all run with 03 where available, –opt-level=3 for Rust, -release for DMD, and -funbox-strict-fields for Haskell. D now also runs with -inline and -noboundscheck.

D was the fastest non-C language tested, with Rust following close behind. It’s worth noting that for this task at least both D and Rust “beat C” at the same code, when comparing their results to the GCC C executable’s. The LLVM appears to do a much better job of optimising this kind of PRNG than GCC, showing how sometimes compiler effectiveness at optimising a given piece of code can be a bigger factor than language choice in determining speed. I will be excited to run these benchmarks again in a year’s time and see how Clang, LLVM D and LLVM Rust compare then.

Scala really surprised me; I didn’t realise a JVM language could run so fast for this kind of code. It even matched the speed of GCC C, putting a lie to the statement that languages running on a VM must necessarily be slower. Many thanks to markehammons for submitting the Scala code! Although I haven’t yet learned Scala, the code is surprisingly easy to read, and isn’t full of ugly optimisation code like I’d have expected a fast JVM program to be. Definitely a language worth considering even for speed-critical tasks.

Rust’s speed was impressive, for such a new language. Its flexibility with memory (optional GC) made it interesting to write in, however Rust’s flexibility (at least currently) comes at a slight price, as its syntax takes a bit of getting used to; to pass a heap-allocated vector by reference I need to use myFunc(& mut myVector) (even though it’s already a mutable vector), and the function receiving it needs myAlias:&mut~[myVector] in its type signature, like fn myFunc(myAlias:&mut ~[myVector]) {..}. Compared even to C ( void myFunc(struct MyType myAlias[arrayLength]){..} ), the Rust version looks a bit byzantine. For those who haven’t been following Rust, there are around seven different kinds of pointers: @ (garbage collected heap allocation), ~ (uniquely owned heap allocation), & (borrowed pointer to heap/stack allocation), * (raw C pointer, only useable in unsafe code), and mutable versions of the first three of those (actually, I’m not sure if there’s a mut & pointer, so maybe just six pointers in total). Note also that a pointer to a mutable value behaves differently than a pointer to a non-mutable value.

**Quoting kibwen on YCombinator:“We’re moving the garbage-collected pointers (@) out into a library, so by default we’ll have only a single pointer, ~ (and there’s no mutable version of this). We’ll still have & and &mut, but those aren’t pointers, they’re references (it’s our own fault for calling them “borrowed pointers” in our documentation). So depending on how you count, that’s either one or three pointer types, which I’m happy with. The “unsafe” pointers exist only for C interop and have no bearing on normal code, so it’s inaccurate to count them.” So there’s actually really only three relevant kinds of pointer.

I also benchmarked a version of the Rust code using stack-allocated vectors, but there was no noticeable speed difference. I did however learn that stack-allocating vectors in Rust is currently rather verbose, as since uninitialised values aren’t allowed, I had to create an instance of the object I wanted an array of that had all its values set to zero and fill the vector with that upon creation. Hopefully in future Rust will adopt something like Go’s automatic initialisation of all values to zero, or at least have this as an option (or document it better, if it already is an option). Currently it looks like this:

Which is a lot of unnecessary code, and would quickly become cumbersome in a large project that made heavy use of stack-allocated arrays (although there doesn’t seem to be a reason for doing so, as in this test at least they performed no faster than heap-allocated ones).

Go was impressive, performing well for a relatively new language albeit still falling behind D. The default PRNG was quite lacking in speed, so changing it to use XorShift resulted in a massive speedup. Interestingly, while GCCGo was faster with the default PRNG, the standard 6g runtime was faster with the bitshifted XorShift PRNG. The freedom from semicolons is a nice experience, and in terms of my subjective experience it was the most enjoyable language to write imperative code in. I’m quite looking forward to the development of LLVM Go.

Although it took me quite a while to optimise, I’m happy with Haskell’s performance, especially since it has to carry a vector of [initially] ten million random Ints around (random numbers can’t just be generated in the middle of functions, as that breaks purity). Writing recursively was a nice change, and the naive Haskell version was the most concise implementation of the languages tested (it grew significantly when I switched from lists to vectors and to the faster random number generator, as I couldn’t get pattern matching to work on vectors so had to stick [unsafe]Head and Tail everywhere).

**Updated the Haskell version to use MWC generator instead of Mersenne, and updated again to use more idiomatic code, resulting in a speedup to 0.770s. Updated yet again to use XorShift, now got the running time down to 0.546s. Running with -fllvm speeds it up to 0.428s.

Any more improvements to the Haskell code would be most welcome. Stay tuned for parallel level generation benchmarks! (Eventually..)

Finally, the moral of the story: there’s no such thing as a slow language, only an insufficiently optimising compiler.

Second moral of the story: if cryptographic-level randomness isn’t needed, then XorShift is the best PRNG algorithm available in terms of speed.

The reason why you had to refactor your code in the example (the conditional) is not directly related to a lazy evaluation problem. The problem is that || evaluates its left operand before its right. Consider this:

In the first case, undefined is not evaluated, but it is in both other cases. We say that || is strict in its first argument.

Secondly, I think it would be more idiomatic to generate random number on the fly, especially to avoid storing this vectors of ints. You can still have pure functions, it is just your main “loop” which must be “impure”. If this wouldn’t be possible, an idiomatic solution could be to use monads such as MonadRandom, which only allow the generation of random numbers.

Yes, it behaves the same as a short-circuiting or in C. Except that in C, you could not write this function yourself (except with macros), as arguments to a function are always evaluated before the function is even called. Lazy evaluation allows you to have this “short-circuiting” behaviour in any function.

To come back to the random numbers problem, you can actually get rid of your big vector of ints without changing much of your code.

I see you make a call to randoms from the System.Random.Mersenne module. This function actually returns an infinite list of random values, which you can pass around instead of the vector. Lazy evaluation will make that this list is not generated beforehand, as your vector is, but on the fly, as elements are needed.

When you’re doing something inherently stateful, such as generating random numbers, it isn’t considered bad form to use a less pure function. Instead of passing the entire vector when you only need 2 numbers, pass those 2 numbers, and make the function that calls that function stateful by using the ST monad. Most of the more useful parallelism libraries in Haskell also internally use a StateT monad transformer so you can use a random number generator in those easily as well.

The key thing is to avoid using IO as much as you can as that’s really where most of the “bad” stuff in terms of impure code lives. ST or State is just a convenience to keep from having to pass around state arguments to all your functions.

You could also go the other direction and manually pass around the RNG, most of the random number libraries make this possible by returning not only a current value, but a next seed value when you use them, so you could just pass around the state of the RNG manually rather than using the ST or State monads.

Somebody attempted a version using the State Monad, but for some reason it ran a lot slower. I’ll attempt passing the PRNG around randomly when I have time, although I would like to figure out why the current version fails with four random numbers per room.

I replaced the unboxed vector of Ints with an infinite list and cut the running time down to 1.75s from 1.9s! In future I’ll know better than to think an unboxed vector is necessarily faster than a list just because it’s unboxed, haha.

Would an infinite list behave better than an unboxed vector? I was originally using an infinite list, but changed to an unboxed vector to avoid a space leak while I was optimising. The code now is much cleaner than it was at the time I switched though, so I’ll try using an infinite list again and see what effect it has on performance.

The thing is that an infinite list won’t consume nearly as much space as elements are generated on the fly. If you use the head and only keep the tail of the list in your functions, then the elements consumed can be garbage collected. The unevaluated tail will just take constant space.

Typo “GCCGo and DMD only being secondary implementations”, I bet you meant GDC.

In my experience performance of the rust is very sensitive to the amount of hand-tuned inlining. Afaik rust doesn’t do automatic inlining like C does and rust uses segmented stacks, so function calls aren’t very cheap. Take a look at benchmark I was partly an author of: https://github.com/mozilla/rust/blob/master/src/test/bench/noise.rs. All these inlines were very crucial to the performance. Oh, if you’re interested in my perlin noise benchmarks you can find C, D, Go, Python, Tcl and Rust versions here: https://gist.github.com/nsf/1170424, however it’s recommended to use the rust version from their repo as it has these inlines applied.

I’ll try the Rust version with inlining. Any guidance on how to choose between [inline(always)] and [inline], or should I just benchmark each and find which works best? Also, is it idiomatic to use for int::range(min, max) |i| {..} like in the benchmark you mention or to use a ‘while i< max {..i+=1}' loop like in the Rust Tutorial?

I’m not a rust guru myself, so I have no idea what’s the different between inline(always) and inline, but I have a pretty good guess that using inline(always) will force it as much as possible, while the other may or may not. Same about idiomatic code, I have no idea, ask rust guys. This perlin noise benchmark was the first and the last program I wrote in rust.😀

The default Go random number generator is wrapped with a big mutex which basically makes it useless for multithreaded (goroutine) code. I wrote a short replacement for it that only works on Linux as it relies on the kernel’s read() implementation being atomic for short reads, and relies on the kernel’s /dev/urandom driver; needless to say it is much much quicker.

I’d say it’s pretty evident from these results that if you need a critical speed advantage, use C or C++. The compilers are excellent and their lint like tools are improving constantly. I’d pick Rust or C any day over D or Go or Haskell.

If you pick Haskell, you can’t find cheap developers anywhere for your product. Go is beginning to look like vaporware. Rust is still too immature but will probably beat D once it’s finalized.

You assume someone wants to find cheap developers. You usually get what you pay for. I’d happily hire a Erlang, OCaml, or Haskell dev for pretty much anything as odds are good they’re going to know what they’re doing, a PHP dev, not so much. Not to say that a PHP dev can’t be good, just that there are an awful lot of bad PHP devs, so you end up spending a ton of time screening out the bad ones to find the good ones.

There’s other considerations than just execution time involved as well, there’s metrics like development time, size of the code base, and how common bugs are (or how easy it is to introduce different categories of bugs). Sure in the end if all you really care about is the absolute fastest speed you can get, a naive implementation in C is probably going to beat a naive implementation in pretty much anything else (except maybe assembly), but if done properly most languages can achieve something approaching C speed (or sometimes exceeding it).

The reason why dealing with a pointer to a slice in Go is awkward is because it’s not idiomatic. It’s much more common to pass a slice to a function and have the function return the slice if it modified it’s length.
Awkward code in Go is generally a sign that you’re trying to do the wrong thing.

Since everyone’s jumping on the optimisation band-wagon, I’ll do it too, I adapted the Rust code to use a faster RNG (XorShift, in the standard library too), and use iterators & vector-builders, which took Rust down to equal with GCC (which is faster than clang on my computer):

The Haskell implementation seems odd to me in a few ways, and I think there are large performance gains to be had with some more idiomatic code.

Manipulating big vectors is the least idiomatic Haskell possible. The usual way to refactor a giant data structure sitting in memory would be to stream it lazily through a thread-safe queue. Okasaki’s cons-list queues are popular and you can probably find a fast implementation on hackage, but even a lazy ByteStream would be better than that gargantuan vector. Same goes for bounds-checking after the fact, but I don’t see a simple way to rearrange that.

Also, I’m confused by your trepidation towards generating PRNs on-the-fly. Yes, you break referential transparency, but the “Random” in the type signature is enough to protect your code purity in spirit. Idiomatic Haskell is about encapsulating side-effects, not contorting your code to hide them. Streaming random numbers into memory as they are consumed would keep all of the dirty impurity hidden from the numeric functions anyways.

Also, the bounds-checking can be done with unsafe casts into the Bits type for fast mass bitwise operations. If you were unboxing your numericals anyways the overhead won’t be significant. This shatters purity, but not referential transparency, so its probably okay.

I don’t really have enough experience with Haskell right now to write good idiomatic code, but any improvements submitted to the source are most welcome. PRNG speed is the biggest factor in the overall speed right now, so for Haskell to compete with the other implementations it would really need an XorShift PRNG, which I’m not sure how to implement.

I think you are still not comparing like for like in the RNG stakes. D’s default Xorshift is the 128-bit version, i.e. the internal state consists of four uint32_t words. The C and Go versions use a single uint32_t for the RNG state, i.e. they are 32-bit Xorshift.

Try replacing Xorshift in the D code with Xorshift32 and see how that affects things.

Oh, and — be careful with Xorshift implementations that directly copy the paper by Marsaglia. It has typos🙂

It currently uses an unboxed vector of 10 million ints, which is slower than a lazy list of vectors. The only reason for this is that I’m unsure how to get an infinite list from the XorShift package; if you can change the following code:

gen <- newXorshift
let rands = U.unfoldrN 10000000 (Just . next) gen

so that rands is an infinite list, then that should speed it up a bit (U refers there to Data.Vector.Unboxed).

I'm not looking to use any concurrency yet so as to allow fair comparison between the different implementations, but within a month or so I'll be implementing concurrent versions in each language, so I'll try the prng channel then if I can't get the lazy list of rands working.

By the way, the performance goes straight to hell if you generate 4 random numbers in either H.hs or this version. I don't know why! Using 2 as H.hs does works well enough where the y and h are paired up with x and w….

I wrote a version that uses a State monad wrapper around the random number generator and runs every generation function in that monad. With -O2 it does not perform as well on my machine as the giant Vector one. Using Control.Monad.State.Strict made it a little faster, but not quite as fast.

According to the heap profiles i’ve generated though, the state monad generator used 1/50th of the memory of H.hs

Sure, if you post a copy to Pastebin I’ll test it and see how it compares to H.hs on my machine. When the code used the Mersenne twister prng, changing from an unboxed vector to a lazy infinite list sped it up by over 10%, so I’m surprised the same doesn’t apply to the XorShift prng. If the slowdown with the state monad generator saves that much memory then it would probably perform better when the number of rooms/iterations needed is scaled up, too, and would work better with parallel code.

That was surprisingly slow, less than half the speed of the unboxed vector one. I’ll profile it later to see why. I think the easiest way to speed it up would be to use the same XorShift function as the other implementations use:

int GenRand(uint32_t *gen) {
*gen += *gen;
*gen ^= 1;
int32_t tgen=*gen;
if ( tgen < 0) {
*gen ^= 0x88888eef;
}
int a = *gen;
return a;
}
rather than the XorShift library. It might also be easier to get an infinite list out of it if you code the XorShift yourself.

Yes, I was surprised how slow it was as well. I’m wondering if this is something for the ST Monad. When I profiled it it just looked like a ton of data being created and garbage collected over time. This makes me think we’d rather modify the same memory location a lot than create/destroy a ton of stuff.

I have another version that does away with the Vectors, and the strictness, still using the State Monad for the random number generation and it’s a little bit faster than the one with Vectors and bang patterns on my machine.

Your new version is a bit faster, but still doesn’t match the unboxed vector one. I’m not fully up to speed with the state monad so I can’t think of any way to improve it. I did try writing a new genRand function using Data.Bits, but it was incredibly slow, less than a third of the speed of the state monad one.

I’ve also found it doesn’t work well with 4 rands (20 million ints total); maybe it’s an addressing problem? 20 million ints (which are 64 bit by default on my program) is 160 million bytes, and maybe GHC has trouble properly addressing that much memory (entirely speculation, but I can’t think why else performance would decrease exponentially like that).

Go-lang code below may obtain 10% speed up. I’m using “go1.1.1 windows/386”.
(And it’s intresting that, if assignment of “x_m1,xw_p1,y_m1,yh_p1” split to 4 lines, or combine to 1 line, there will be 1% speed lose. So there’s still more optimization to be done in Go compiler)

Did it. Seems Haskell is the most ‘expressive’ of the implementations, followed by Scala and then Rust. Haskell is slower, but I think the Haskell code is much less optimised than the other implementations, as I haven’t found a Haskell programmer willing to improve the Haskell XorShift prng, and I’m not skilled enough at Haskell yet to do it myself.

1.
Intel compiler should have a position, for its great optimization among CPU instructions.
C/C++ should get much benefit from it, “(x<<1)+1" will get "lea ebx, DWORD PTR [esi+esi+1]" like, really great!

2.
Since many modifies made during comparison, algorithm becomes non-consistence.
It's more faire to move "for(…;.<50000;…" loop inside "MakeRoom" function, "Rand" changed to inline, especially on the benchmark-basement.

I didn’t realise there was a free version of the Intel compiler available, but apparently there’s one for Linux, so I’ll use it on the upcoming concurrent version of the benchmarks. I’ll also update the functions then to bring the for 50k loop into MakeRoom.

That it does, but it’s still possible to reduce the overall running time a bit by using a faster PRNG. I can’t think of any way to speed up genRooms that doesn’t involve changing the prng (or better, using a lazy list of random numbers, but I haven’t found an effective way to do that with the current prng).

I suppose if we wanted a faster PRNG, we could implement it as transformations done in the ST monad. It definitely feels a little like cheating, but it’s something Haskell programmers reach for on occasion for certain update-in-place situations. Given that we’re generating a LOT of data that seems to just get GC’d… it could be useful.

I, however, have never used ST for this kind of stuff, or much at all for that matter.

Actually, I take it back. I think ST is not going to help much as the way it seems to work is to take a regular old haskell value from which you create an STRef, then do an update, and return a new Haskell value. Unless the intermediate shifts in the PRNG cause a lot of allocations and such, it may be of little value.

A simple solution would be to just take the source at http://hackage.haskell.org/packages/archive/xorshift/2.0.1/doc/html/src/Random-Xorshift-Int32.html#makeXorshift32 and modify it slightly. Looking at it now however I’m not sure whether that would help much, as it uses the code:
step32 :: Xorshift32 -> Xorshift32
step32 (Xorshift32 a) = Xorshift32 d where
b = a `xor` (shiftL a 13)
c = b `xor` (shiftR b 17)
d = c `xor` (shiftL c 5)
which is the same as Scala’s prng, yet the Scala version performs faster than the Haskell one even though it’s using 4 random numbers per room and the Haskell one only uses two.

I’ll try changing the unboxed vector of random integers to a lazy vector and see if that helps at all.