I recently upgraded my phone to a Galaxy S3, an ARMv7 device with 2GB of ram and a quad-core 1.3ghz processor. After installing Arch Linux on it (only in a chroot; I’m not brave enough to boot directly into it), which seemed like the right thing to do, I thought it’d be interesting to do some benchmarking and see how various language implementations perform on ARM compared to on my x86-64 (Intel dual core i5 M430 2.27GHz) laptop.

This benchmark involves calculating a large number of Fibonacci numbers repeatedly. The exact code used is available here. Note that the purpose of this benchmark is to compare the ARM and x86 implementations of each language, not to compare the different languages to each other, so no effort has been put into optimising the different implementations.

The benchmarks were run with an input of 500 million, with the speed for each language being expressed as a percentage of the speed of the C implementation. This allows for easy comparison, as each language being expressed in terms of C speed (as opposed to in absolute terms) minimises the effect of the different speeds of the processors used.

The results from my x86-64 laptop are as follows (input of 500 million):

Language

Running time (ms)

% C speed

C (Clang)

3726

100

C# (Mono)

7625

48.8656

Go

8489

43.8921

Haskell

9839

37.8697

Java (OpenJDK)

4237

87.9396

LuaJit

15135

24.6184

Ocaml

9668

38.5395

Racket

22785

16.3529

Rust

3910

95.3

The results from my ARMv7 Galaxy S3 (input of 500 million):

Language

Running time (ms)

% C speed

C (Clang)

5051

100

C# (Mono)

52736

9.5779

Go

12246

41.2461

Haskell

22304

22.6462

Java (OpenJDK)

50079

10.0861

LuaJit

21738

23.2358

Ocaml

28774

17.554

Racket

56519

8.93682

Rust

5741

87.9812

The performance of each language on ARMv7 expressed as a percentage of its x86-64 performance:

Language

Performance (% x86-64 speed)

C

100

LuaJit

94.3839

Go

93.9716

Rust

92.3190

Haskell

59.8003

Racket

54.6498

Ocaml

45.5481

C# (Mono)

19.6005

Java (OpenJDK)

11.4693

The benchmarks can be run by `sh fib.sh 500000000 true`, replacing true with false if the executables have already been compiled. This will print an html table to stdout, and store the results in text form in `./sortedResults`. Rename this to something like armResults or x86Results, and two different such files may be compared using `sh fibdiff.sh armResults x86Results`, which will output the above html table to stdout.

Again, note the purpose of this benchmark is not to compare the languages with each other, but to see how well each one performs on ARM compared to on x86.

Interestingly, the Rust implementation actually runs faster on ARM. I suspect this indicates a bug of some sort in the compiler, as Rust generally performs at close to C speed. Both the ARM and x86 implementations are running on the latest Arch Linux build, but the Rust version on ARM appears to be “0.11.0-pre”, whereas on x86 it’s “0.11.0”, so it’s possible there was some regression between those two versions.

*Edit: the Rust version used the Rust int type, which is 64bit on x86-64, whereas C’s int is 32bit. This exaggerated Rust’s speed relative to C on ARM. The results table has now been updated to show results from using Rust’s fixed size i32 type, which is much faster on x86-64. It can now be seen that Rust manages to perform quite well on both platforms, which makes sense considering its use of the LLVM backend.

*Another edit: according to strncat on Reddit: “Rust is slower than C at this task on ARM (and x86_64, once the 64-bit overhead is gone) because it uses LLVM’s segmented stack support for stack overflow checking. The sane alternative is outputting a one byte write in multiples of the guard size, so 99.5% of functions won’t have any overhead, and the remaining ones will only have some writes rather than the branch that is there today.”

Java on ARM is hilariously slow, 1/10th of its x86 speed, which I suspect may be because the JVM I’m using, OpenJDK, doesn’t yet have a JIT compiler. Or, if it does, it’s not particularly well developed. The Oracle JVM is apparently faster on ARM.

C# (via Mono) is also much slower on ARM, 1/5 of its x86 speed. Presumably ARM code generation hasn’t received as much attention as x86.

Luajit’s performance on ARM, almost equal to its x86 performance, proves once again that Mike Pall is a genius. Interestingly, as everything is a double in Lua, it may actually have an advantage in this case as there’s a modulus operation in the code, and ARM doesn’t (as far as I’m aware) have hardware support for integer division, only float division.

The Go compiler also runs surprisingly well on ARM, nearly matching its performance on x86. I’m not sure whether this is due to lots of work having been put into ARM code generation or not so much work being put into x86 code generation, but it certainly bodes well for future plans to allow the use of Go in the Android NDK.

Racket, OCaml and Haskell all run at about half their x86 speed on ARM, which seems reasonable as I can’t imagine they’re often run on ARM, so I imagine ARM performance probably hasn’t received much attention.

In terms of the C implementation, it’s interesting to note that its absolute speed on the ARM device is almost 73% of the speed of its x86 speed, in spite of the x86 device being a 16 inch i5 laptop and the ARM device being just a 5 inch phone.

Miscellanea

The Haskell implementation is named hsfib.hs instead of fib.hs, breaking from the naming conventions of the other implementations. Why is this? If I name it fib.hs, GHC will notice the fib.o left over from compiling the OCaml implementation and try to link that, with hilarity ensuing soon thereafter. This could be avoided if GHC checked whether the object files it was attempting to link were actually Haskell object files.

Getting an Integer from a Number in Typed Racket seems to take a lot of work: `(numerator (inexact->exact (real-part n)))` Although there’s probably a simpler method I’m missing.

There’s a Common Lisp implementation in the repo, but the ARM support of SBCL is quite recent and I couldn’t get it to build.

At least 70% of the C# implementation is just copy pasted from the Java implementation. Syntactically those languages seem even more similar than Lisp and Scheme.

Recursion in Lua is interesting. Doing a comparison like `if n == 1` does not work out well, due to the nature of floating point equality, so `if n < 1` is necessary instead.

The correctness of the Rust and C implementations is actually optimisation dependent. If compiled with -O, they use constant space; if compiled without it, they use linear space and fail with a stack overflow. Never, ever write code like this for any project that’s even remotely important..

OCaml’s Emacs support is really awesome. I installed a couple of plugins, Tuareg and Merlin, and not only did they give me autocomplete but also automatic error checking whenever I save. I was particularly impressed when it combined with Ocaml’s type-checking to warn me that the float I was accidentally passing to printf didn’t match the type required by the %d in the format string; even Haskell’s default printf function doesn’t do compile time checking of the format string.

*Edit: I was asked why I didn’t include Nimrod. Nimrod compiles to C, so it would be the same speed on both ARM and x86. Seems like it would be a good option for ARM development.

Bonus

I ran the Java implementation of the benchmark under Dalvik. The results are rather underwhelming, to say the least. Here are the results from running with an input of 500 million (the same as above), with the Dalvik results included.

Language

Running time (ms)

% C speed

C

5051

100

C# (Mono)

52736

9.5779

Dalvik

75699

6.6725

Go

12246

41.2461

Haskell

22304

22.6462

Java (OpenJDK)

50079

10.0861

LuaJit

21738

23.2358

Ocaml

28774

17.554

Racket

56519

8.93682

Rust

5741

87.9812

Now that’s why the new ahead-of-time-compiled Android RunTime is A Good Thing. Also possibly part of the reason why the Android UI is less fluid than the iPhone’s, and I say this as someone who wouldn’t use an iPhone if you paid me. Where would Objective C appear in that table? Right at the top; Objective C is a superset of C, so valid C code is generally valid Objective C code, making it extremely easy to write fast code in Objective C (just write C).

Note however that I ran the Dalvik file (compiled with `dx –dex –output=fib.dex fib.class`) with `dalvikvm -cp fib.dex fib`, the simplest way to run it. It’s quite possible there’s a better way to compile/run it that makes it faster; I’m not particularly familiar with Android development. There apparently exists a tool called ‘dexopt’, which is run automatically when a new apk is installed, but I’m not sure whether it’s run when I execute a .dex file manually with `dalvikvm`. Running `dexopt` tells me to look at the dexopt source for usage information; I don’t have time to figure it out right now, but if someone wants to send me a script for running dexopt on a .dex file I’ll be happy to test it.

The point is not to compare the languages to each other, but only with themselves on ARM. I imagine the common case of porting a program to ARM would be that it primarily used ints, not i32s. Ints are more commonly used than fixed-size types in other languages, in my experience.

Then why include tables contain ratios in comparison to C at all? Or at least, move them further down as annex data, instead of having them as the first visible tables.

Your benchmarks ought to to the same exact thing in all languages. If Lua uses floats, it’s not doing the same thing. Neither would it be in JS, but you’re not including it although it’s a ubiquitous language and more relevant on current ARM devices, thus way more interesting.

You’re timing the number parsing in most languages except Lua and Rust, then in some of those, you’re timing the cast (Go) or transformation (Racket) to integer, and finally, you’re timing the formatted output.

Simply take all of those out of the benchmarks, it obviously introduces a constant time that is not interesting. I don’t care if the code isn’t so tidy, as long as it’s correct.

If you care about number parsing, integer casting or float to integer truncation and text formatting, make separate benchmarks for each of those.

You ought to justify why you use accumulating recursion in some languages and not in others. Is it that they are actually performing tail-call optimization? Then they should support deep recursion without a stack overflow, unlike a few of the other languages. You should either eliminate recursion where tail calls are not guaranteed, or test stack blowing for an advantage/disadvantage analysis.

You ought to justify why you didn’t declare fixnums or 32-bit integers in Common Lisp (I know you didn’t include it in the article, but it’s in the repo) and Racket, because you could make it much faster. On the other hand, if you don’t optimize them, they’ll support bignums, unlike a few of the other languages which will either fail (not that bad) or silently overflow (really bad). You should either optimize them, or include tests where languages with number restrictions either fail/overflow (32-bit integers) or yield imprecise results (floating-point numbers) for an advantage/disadvantage analysis.

Conclusion: your performance benchmarks are not testing just one thing, they’re not 1:1 comparable, some are not even correct, and you’re not providing an analysis of the advantages and disadvantages of each languages, i.e. what is it that you lose by using C or on the other hand what is it you earn by using languages with bignums.

Overall, this is a good effort, but for this kind of post, you must level up.

You are comparing the performance degradation of various languages on ARM. For this comparison to have any value, you need to perform equivalent tasks, because the degree to which different tasks are optimized for ARM is different. Taking your claim to its logical extreme, you could compute a performance degradation ratio for matrix multiplication in one language and image processing in another, then present them in the same table saying that’s fine I wasn’t comparing them to each other anyway. Different workloads have different degrees of optimizability, which affects how much of a variation you’re going to see between x86_64 and ARM.

You’re correct, I’m using an i5 m, 2010 model. Flags for the C were just O3.

I saw on Reddit that the speedup comes from Ocaml using a magic constant instead of idiv for the mod instruction, very impressive. I actually considered doing that manually for the implementations, but I could only find a formula for division, not for getting the modulus.

You’re probably not the right person to ask, but will 4.02 include the reentrant Ocaml runtime? Real threading support in Ocaml is probably one of the things I’m looking forward to most this year.

I experienced a little with your haskell code and found out that changing the ‘mod’ operation into a ‘rem’ (since the second argument is positive) gave me a 20% boost in performance (you will need to change the name of the variable rem in ‘doWork’ to be able to use this Prelude function). Could you try this as well ?

I tried that and it actually made the Haskell version around 40% faster than the C version. I suspect in that case GHC is smart enough to figure out that the Fibonacci function doesn’t need to be called n times, whereas Clang isn’t.

The benchmark game compares the performance of different languages on particular platforms. This is meant only to compare the performance of each language on multiple platforms, not to compare languages with each other.

Nine tables of two rows doesn’t seem particularly ergonomic. I’m going to take the traditionalist approach and maintain that I’m comparing what I say I’m comparing, not what my tables apparently suggest I’m comparing.

Right, but that optimisation wouldn’t affect the relative performance difference between Ocaml on x86 and ARM, would it?

I’ll admit it was probably a misnomer to call it a Fibonacci number benchmark, as the Fibonacci number bit is only there to stop the compiler from optimising out the whole loop. I just wanted an extremely simple program to compare ARM and x86 performance.