*Edit – For clarity, I’ve decided to just include the most up-to-date, fair and relevant results tables at the top here. Scroll down below to read more about the optimisation process and the effectiveness of various different implementations.

NumTrades = 150

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

C3

clang

0.0944

248

100.00

352224

757

60

1502

9

Clojure

–

0.0000

607

40.86

893908

976

79

2814

0

Java3

javac

0.8451

624

39.74

826964

864

119

3386

2

Scala

–

0.0000

737

33.65

751168

787

57

1719

0

Common Lisp

–

0.0000

1152

21.53

902192

715

44

2103

0

Python2 (Pypy)2

–

0.0000

1716

14.45

1382304

609

42

1382

0

C#3

mcs

0.7248

2099

11.82

727712

893

114

3268

4

F#3

fsharpc

4.2022

2126

11.67

731808

660

39

1541

12

Racket

raco

17.1353

2150

11.53

875424

691

38

1959

11142

Go3

go

0.3930

2516

9.86

352952

656

62

1313

2799

Ruby1

–

0.0000

6122

4.05

1070024

636

52

1266

0

JuliaBigNum

–

0.0000

8932

2.78

2526800

565

52

1227

0

Tcl

–

0.0000

18754

1.32

1702460

543

45

1130

0

NumTrades = 100

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

C3

clang

3.0596

210

100.00

235036

757

60

1502

9

Clojure

–

0.0000

295

71.19

706660

976

79

2814

0

Java3

javac

2.8895

421

49.88

561296

864

119

3386

2

Scala

–

0.0000

478

43.93

479916

787

57

1719

0

Common Lisp

–

0.0000

644

32.61

866428

715

44

2103

0

Python2 (Pypy)2

–

0.0000

1036

20.27

955904

609

42

1382

0

F#3

fsharpc

7.6323

1082

19.41

492912

660

39

1541

12

C#3

mcs

1.7726

1144

18.36

488812

893

114

3268

4

Racket

raco

26.5090

1208

17.38

639700

691

38

1959

11142

Go3

go

2.8093

1921

10.93

235768

656

62

1313

2799

Ruby1

–

0.0000

3045

6.90

675208

636

52

1266

0

Lua

–

0.0000

3780

5.56

2202448

576

53

1305

0

Smalltalk (GNU)

–

0.0000

5545

3.79

611088

712

76

1641

0

JuliaBigNum

–

0.0000

7836

2.68

1012580

565

52

1227

0

Tcl

–

0.0000

9865

2.13

1135652

543

45

1130

0

NumTrades = 50

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

C3

clang

3.6418

81

100.00

117848

757

60

1502

8

Common Lisp

–

0.0000

84

96.43

324988

715

44

2103

0

Clojure

–

0.0000

117

69.23

403120

976

79

2814

0

Racket

raco

20.1401

132

61.36

373368

691

38

1959

11142

Java3

javac

2.8739

176

46.02

318364

864

119

3386

2

Scala

–

0.0000

213

38.03

276620

787

57

1719

0

LuaJit

–

0.0000

230

35.22

649116

576

53

1305

0

F#3

fsharpc

5.3502

381

21.26

254008

660

39

1541

12

C#3

mcs

1.4964

389

20.82

249920

893

114

3268

4

Python2 (Pypy)2

–

0.0000

428

18.93

482480

609

42

1382

0

Go3

go

2.3946

814

9.95

118576

656

62

1313

2799

Ruby1

–

0.0000

1435

5.64

301384

636

52

1266

0

Lua

–

0.0000

1720

4.71

1101764

576

53

1305

0

Smalltalk (GNU)

–

0.0000

1825

4.44

311132

712

76

1641

0

JuliaBigNum

–

0.0000

2424

3.34

907624

565

52

1227

0

Python3

–

0.0000

2425

3.34

1338756

497

32

1126

0

Tcl

–

0.0000

4102

1.97

569216

543

45

1130

0

NumTrades = 10

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

Common Lisp

–

0.0000

16

100.00

111068

715

44

2103

0

C3

clang

0.0966

16

100.00

24096

757

60

1502

9

Racket

raco

17.0957

26

61.54

190028

691

38

1959

11142

Clojure

–

0.0000

27

59.26

121268

976

79

2814

0

Java3

javac

0.8250

39

41.03

96224

864

119

3386

2

LuaJit

–

0.0000

40

40.00

128036

576

53

1305

0

Scala

–

0.0000

41

39.02

148820

787

57

1719

0

F#3

fsharpc

4.1270

63

25.40

62920

660

39

1541

12

C#3

mcs

0.7194

64

25.00

58820

893

114

3268

4

Python2 (Pypy)2

–

0.0000

85

18.82

113608

609

42

1382

0

Go3

go

0.5028

158

10.13

24828

656

62

1313

2799

JuliaBigNum

–

0.0000

273

5.86

316656

565

52

1227

0

Ruby1

–

0.0000

286

5.59

65244

636

52

1266

0

Smalltalk (GNU)

–

0.0000

324

4.94

66276

712

76

1641

0

Lua

–

0.0000

370

4.32

216136

576

53

1305

0

Python3

–

0.0000

485

3.30

273000

497

32

1126

0

Tcl

–

0.0000

813

1.97

115772

543

45

1130

0

I recently came across an interesting blog post that dealt with optimising a common situation in which the JVM performed surprisingly poorly, which prompted me to try the microbenchmark in a few other languages to see how they fared. The code for this benchmark can be found here.

The problem: iterating over an array of objects modifying and reading its contents. What’s interesting about it? Firstly, as this is a relatively common task, the results give a tiny bit of insight into the ‘general’ speed of the languages tested. Secondly, it’s directly relevant to the effectiveness of object pooling, a technique I’m using in the other benchmark I’m working on. The idea of an object pool is to statically allocate the needed objects at initialisation and reuse them instead of creating new objects, to avoid generating garbage. The effectiveness of this technique would however be limited if it’s impossible to access and modify objects in the pool without generating garbage.

Why are some of the usual suspects (D, Rust, Nimrod, C++) missing? They use the same backend as C so they’d produce results of similar speed, and their metaprogramming facilities are powerful enough that it’d be easy to ‘cheat’ and do all the calculation at compile time. If somebody wants to submit an implementation in one of these languages they are however quite welcome, as long as it doesn’t ‘cheat’.

Now, the results (but don’t neglect to read the rest of the post, as there’s a twist at the end). The value ‘NumTrades’ multiplied by 50,000 is the number of objects in the array of trade objects. Note that if a particular language/implementation is missing from a certain NumTrades value’s results, this means that it ran out of memory or was otherwise killed by the OS. The tests are run on a dual-core 64bit, 1.6ghz Intel i5, with 4gb of ram, running Linux Mint 15 64bit.

*Edit: It should be noted that this set of graphs tests the form of allocation used in the blog post that inspired this post. It is however suboptimal, and the set of graphs further below involve implementations using an optimal method of allocation/object pooling. This first set of graphs just provides for interesting comparison, to show how small change can lead to significant speedup. Also note that the Ruby, Smalltalk and Lua implementations in this graph use the faster form of allocation, as they ran out of memory and died too soon without it. It is hence more fair to compare them with the implementations suffixed with 2, in the later set of graphs. In fact, these graphs are partially just here as part of explaining the optimisation process; if you’re interested in the fairest numbers, scroll to the final set of graphs.

NumTrades = 150

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

x86-64 assembly

fasm

0.0318

110

100.00

300524

2279

180

5840

8

Go

go

0.3103

124

88.71

352484

614

61

1206

2256

C

clang

0.0737

127

86.61

352044

712

56

1425

9

JavaUnSafe

javac

1.1290

203

54.19

326036

1182

156

5148

2

CLispFixNum

–

0.0000

228

48.24

810536

745

44

2152

0

Julia

–

0.0000

298

36.91

609564

552

52

1228

0

Common Lisp

–

0.0000

1120

9.82

902192

715

44

2103

0

Java

javac

2.1709

1575

6.98

918820

819

121

3329

1

C#

mcs

0.6872

1587

6.93

999892

878

115

3301

4

Racket

raco

16.8814

1955

5.63

877364

691

38

1959

11142

F#

fsharpc

4.2951

5761

1.91

1452868

646

40

1526

12

Python2 (Pypy)

–

0.0000

9489

1.16

1519092

561

38

1192

0

Tcl

–

0.0000

14486

0.76

1702460

544

46

1170

0

NumTrades = 100

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

x86-64 assembly

fasm

0.3159

70

100.00

200468

2279

180

5840

8

Go

go

0.3185

82

85.37

235300

614

61

1206

2256

C

clang

1.8142

84

83.33

234856

722

56

1435

9

JavaUnSafe

javac

1.1487

135

51.85

223596

1182

156

5148

2

CLispFixNum

–

0.0000

152

46.05

594160

745

44

2152

0

Julia

–

0.0000

200

35.00

415212

552

52

1228

0

Common Lisp

–

0.0000

656

10.67

866428

715

44

2103

0

C#

mcs

0.7074

953

7.35

966208

890

116

3320

4

Java

javac

2.7260

1052

6.65

651904

819

121

3329

1

Racket

raco

20.3636

1237

5.66

639704

691

38

1959

11142

F#

fsharpc

6.2570

1393

5.03

976720

646

40

1526

12

Python2 (Pypy)

–

0.0000

2234

3.13

1183108

561

38

1192

0

Lua

–

0.0000

4580

1.53

2202468

554

52

1208

0

Ruby

–

0.0000

4753

1.47

2233092

605

56

1250

0

Smalltalk (GNU)

–

0.0000

5313

1.32

611088

712

76

1641

0

Tcl

–

0.0000

9279

0.75

1135660

544

46

1170

0

Python3

–

0.0000

10654

0.66

1582084

607

39

1268

0

Clojure

–

0.0000

86824

0.08

994700

960

53

2811

0

NumTrades = 50

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

Go

go

1.8760

40

100.00

118112

614

61

1206

2256

x86-64 assembly

fasm

0.6947

40

100.00

100416

2279

180

5840

8

C

clang

1.6018

43

93.02

117668

722

56

1435

7

JavaUnSafe

javac

1.2099

73

54.79

121152

1182

156

5148

2

Common Lisp

–

0.0000

96

41.67

324988

715

44

2103

0

Julia

–

0.0000

99

40.40

271700

552

52

1228

0

Racket

raco

21.5142

130

30.77

373364

691

38

1959

11142

LuaJit

–

0.0000

230

17.39

649136

554

52

1208

0

C#

mcs

1.5198

460

8.70

488508

890

116

3320

4

F#

fsharpc

5.7907

480

8.33

455608

646

40

1526

12

Java

javac

3.0675

677

5.91

493744

819

121

3329

1

Python2 (Pypy)

–

0.0000

1010

3.96

620820

561

38

1192

0

Smalltalk (GNU)

–

0.0000

1835

2.18

311136

712

76

1641

0

Lua

–

0.0000

2150

1.86

1101764

554

52

1208

0

Ruby

–

0.0000

2197

1.82

1046204

605

56

1250

0

Python3_fst

–

0.0000

2460

1.62

1338748

497

32

1127

0

Tcl

–

0.0000

4121

0.97

569212

544

46

1170

0

Python3

–

0.0000

4951

0.81

794376

607

39

1268

0

Clojure

–

0.0000

10246

0.39

858024

960

53

2811

0

NumTrades = 10

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

Go

go

1.6017

8

100.00

24360

614

61

1206

2256

C

clang

1.6518

8

100.00

23920

722

56

1435

9

Java

javac

1.1308

13

61.54

117792

819

121

3329

1

JavaUnSafe

javac

1.0829

14

57.14

39256

1182

156

5148

2

Common Lisp

–

0.0000

16

50.00

111068

715

44

2103

0

Julia

–

0.0000

20

40.00

164340

552

52

1228

0

Racket

raco

17.0663

26

30.77

190052

691

38

1959

11142

LuaJit

–

0.0000

50

16.00

128032

554

52

1208

0

C#

mcs

0.6989

67

11.94

106228

890

116

3320

4

F#

fsharpc

4.4070

83

9.64

116864

646

40

1526

12

Python2 (Pypy)

–

0.0000

203

3.94

142012

561

38

1192

0

Smalltalk (GNU)

–

0.0000

324

2.47

66276

712

76

1641

0

Ruby

–

0.0000

448

1.79

212004

605

56

1250

0

Lua

–

0.0000

450

1.78

216132

554

52

1208

0

FasterPython

–

0.0000

484

1.65

273000

497

32

1127

0

Tcl

–

0.0000

821

0.97

115768

544

46

1170

0

Python3

–

0.0000

1017

0.79

163864

607

39

1268

0

Clojure

–

0.0000

1121

0.71

322328

960

53

2811

0

Take the results for Python, Ruby and Julia with a grain of salt, as I’m less experienced with these languages and there are probably optimisations that I’ve missed. Suggestions are welcome! (Note also that the rightmost column of the table represents the size of the resulting executable; it doesn’t seem to fit fully into the page.)

*Edit: The Ruby and Julia versions have been improved since the benchmark was originally run, and now perform quite impressively. A version that’s faster in CPython (but not in Pypy) has also been added, as Python3_fst.

Points of interest:

Pypy – Significantly faster than standard Python 3 (CPython); quite impressive. For such an otherwise elegant language, however, writing constructors in Python is surprisingly awkward. Or perhaps it’d be more accurate to say the fact that one even has to write constructors in the first place seems odd; is there a default constructor I’m not aware of, a more concise way to write the following? *edit: there isindeed a better way to write this. I need to learn more Python.

C# – Faster than Java, and uses less memory. An interesting note: I originally ran the benchmark with the older Mono 2.1 runtime, before having to upgrade to 3.2 to get F# to compile, and the C# implementation actually ran faster there. Presumably the old garbage collector was better suited to this case, whereas maybe the new one is more like Java’s.

x86-64 assembly – Uses less memory, courtesy of struct packing. Normally languages pad structs, to align memory such that each memory access of a particular type is at an address that’s a multiple of that type. So an int32 can be accessed at 4, 8, 12, etc, whereas an int64 can only be accessed at 8, 16, 24 etc. The trade struct in this benchmark is long, long, int, int, long, long, byte; 8, 8, 4, 4, 8, 8, 1, or 41 bytes total. If sticking these in an array, 41 is not a multiple of 8, so 7 bytes of padding are necessary at the end of each struct so that the long at the start of the next one is aligned. As assembly doesn’t have structs, just byte arrays, it’s possible to avoid this padding. Note that while it’s faster in this case, unaligned access is generally thought to be slower, and on non-X86 architectures can cause the program to segfault. Also, you can do the same in C with compiler extensions, and that’s generally considered a much better way to achieve unaligned datastructures than writing the whole program in assembly.

Note that I’m a pretty mediocre x86 assembly programmer; it might well be faster if written by someone more experienced. Pull request are welcome. Also note that if you exclude the printing code (I couldn’t get printf to work without segfaulting, so I copy-pasted an integer printing routine. Take a look; printing integers is actually ridiculously slow and complicated compared to printing strings.), the assembly code is still more concise than its Java equivalent. Heck, going by plain linecount the entire assembly file, including the ridiculous integer printing routine, is 225 lines, only around 15% more lines than the unsafe Java code To be fair, however, I should note that the Java code does include numerous setters and getters, which aren’t strictly necessary. For those who aren’t familiar with assembly, here’s an example of setting the values in a struct, where rcb is the address of the struct and rbx is the value you want to set tradeId, price and quantity to (please don’t hesitate to let me know if you spot any mistakes, as I’m quite confident the x86-64 implementation doesn’t function completely as intended):

Go – Almost matches C, quite impressive considering how many more man-hours the LLVM has had put into it than the Go compiler. Why is it so much faster than C# and Java? Maybe because it allows me to pointer chase, manipulating data in the array directly without generating any temporary objects, and hence minimising garbage creation. Something I learned: on x86-64, Go ints are 64bit; I initially made the mistake of assuming they were 32bit like C ints.

Common Lisp (SBCL) – Incredibly impressive speed for a dynamic language. Note in particular that I didn’t even add type annotations, as my attempts at adding them didn’t seem to speed the implementation up at all. The speed drops as NumTrades increases however, and profiling with time() reveals that garbage collection takes up a significant (60%+) percent of the running time. This means that either the compiler isn’t able to convert:

or that it’s unable to stack allocate trade-ref so as to avoid it becoming heap garbage. That, or its bignum implementation generates more garbage. The same applies to the Racket compiler and the JVM.

*Edit: Thanks to type annotations it’s possible to make the Common Lisp implementation perform even faster, via avoiding bignums and overflowing instead as C, Go and Java do. This result is titled CLispFixnum.

Pharo – Sure it’s slow, but it’s had far fewer development hours put into it than any of the other languages used. It was however one of the funnest implementations to write, and also the most concise. Wait, you say, Pharo’s not in the list? I converted the program to GNU Smalltalk as that’s easer to run from the command line and around 50% faster, but it was developed in Pharo. It’s also not too slow compared to CPython. If I was given the resources to write my own language (tangent incoming; feel free to skip the next two paragraphs)…

it would be a statically typed Smalltalk (or maybe something closer to Self) with type inference, a focus on referential transparency, and incremental compilation. If the methods of an immutable object only messaged other immutable objects (were pure functions), it’d be possible to compile the whole object into an optimised block that was potentially as fast as the equivalent C code. The compiler would be a first class citizen. so it’d also be possible to compile non-referentially-transparent blocks with Compiler compile: myBlock. Track would be kept of all the objects that this block messaged and its assumed responses from them, and if any were changed then the compiled block would become invalid and revert to interpreted code. The file system would also be a first class citizen; compiled blocks could be saved as executables that took their messages as flags (./myblock doLotsOfMathsWithInput: someNum) and outputted the result to stdout, which the runtime could then capture. Block data would be stored in metadata files accompanying the executables, and the filesystem could be used as if it were a Smalltalk image. Note that this would allow easy parallelism via the OS process scheduler, a bit like Erlang. It would also allow for runtime code specialisation, as described in this link, making it faster than C in some cases. For instance, the following function:

int myFun(a, b, c){ return a*b+c }

can be compiled to just return 12 + c if a and b are known at compile time to be 3 and 4, which is much faster. If these are runtime constants, however (Java final), while it should still be possible to do the same optimisation on the code, in practice the JVM doesn’t (and I don’t think the CLR does either). A language that took advantage of this kind of optimisation could be faster than c for problems where there’s a lot of information only known at runtime that doesn’t change once it’s loaded.

Racket – Very enjoyable to write. It was originally much faster, before I fixed a weird bug I was having. Can you spot the bug in the following code?

If not, neither did I. I couldn’t understand why my attempts to mutate the objects in the vector were producing weird results, especially when equivalent code in Common Lisp worked fine. Even though half the trades should have a side value of ‘S’ and half should have ‘B’, for some reason after the vector was initialised they all had ‘S’, even though my initialisation code had been checked and double-checked; Then I realised: for every other value, all the objects in the vector also shared the same value, the value of the final object in the vector. Why was this? Answer: all the objects in the vector pointed to the same object. Changing the array creation code to the following:

fixed the problem. It also slowed the code down significantly. It does however suggest at a solution to the problem of poor performance due to the lack of cache locality of *obj arrays compared to obj arrays: have all the *obj point to the same object, as then it’ll always be in the cache…

Lua – Luajit really is quite impressive, achieving a 10x speedup over the standard Lua interpreter, however still falling behind Racket and Common Lisp. Interestingly, although Luajit appears to use less heap, plain Lua actually handles conditions of high memory use better: the reason there is no Luajit implementation listed for numTrades values of 100 and 150 is because it died with an out of memory error while running them (the benchmarking script will automatically exclude languages that don’t compile or run properly).

Self – What’s Self? It’s a research language that’s like a more-pure form of Smalltalk. It was extremely awesome to write, but I didn’t include it as it ran out of memory on NumTrades values of less than 10 (and took over a second to run at NumTrades=10). Nevertheless I think it’s worth checking out for fans of Smalltalk as it does an even better job of ‘pure OO’, and is extremely elegant about it. My original implementation in Self was only around 20 lines of code, although there’s a few more now as I wasn’t sure how to concatenate strings for printing so I printed them the ugly way.

F# – A neat language, and generally fun to write. Not quite as fast as C#, although it was originally the fastest language. Why? Firstly, I made the same mistake as in Racket, initialising an array with a function that ended up filling it with references to the same object rather than different objects. After I fixed that, however, it was still really fast. This was because I didn’t realise that parameter-less functions had to be called with a () at the end; a function with one parameter can be called as ‘myFun param’, but a function with no parameters must be called as ‘myFun()’. In the process of figuring out what I was doing wrong, I ended up looking through the CIL (CLR bytecode) disassembly of the program, and I must say it was the one of the most nauseating things I’ve ever seen in my life. Imagine a language with twice the verbosity of Java and half the clarity of x86 assembly, and you’ve got a language that’s about four times as elegant as the CLR Common Intermediate Language. I really don’t think the following should belong in bytecode: *Edit: okay, I may be wrong here, and the CIL isn’t itself bytecode, but rather is translated into bytecode. In which case, ignore this complaint.

Even functions seemed to have multiple verbose constructors defined. The total assembly size was around 1800 lines, almost an order of magnitude more than the equivalent program written by hand or by the LLVM in an assembly language. If you ever plan on writing low-level code for a virtual machine by hand, my advice is to stick to the JVM, as its bytecode is much nicer (only 160 lines for the disassembly of Java.class, which is only 10 lines more than the original Java.java source file…). To be fair however, the disassembly of the C# executable (as opposed to the F# one) was only around 460 lines long, and didn’t have quite as much boilerplate.

Ruby – I don’t quite understand the fuss about Ruby; like Python, it seems to be a dynamic language that wants me to write constructors (and getters/setters), or else use ugly stuff like $trades[i].instance_variable_get(“@price”) . To be fair, it does seem to be slightly faster than CPython, but it’s still slower than Smalltalk, and if I was going to use a language that slow I’d definitely prefer Smalltalk. Maybe someone can enlighten me as to what’s so great about Ruby? (And Rails doesn’t count, as I’m not a webdev.)

*Edit: Okay, I just mentally compared Ruby to Javascript and PHP, and was overcome with love for the language. I think I now better understand its appeal.

*Edit2: Wow, I made a small change and now the Ruby implementation is twice as fast, faster than Smalltalk and beating CPython. Now I’m impressed.

Julia – The language is quite fast on many other tasks, comparable to C or Fortran, but it seems the optimiser doesn’t know what to do with this particular program. I suspect someone more familiar with the language than I am could significantly speed it up by changing the array of trades into a one-dimensional matrix of int64s and manipulating that; the language is built around numerical computing, and the compiler has specialised optimisations it applies when it detects idiomatic numerics code. Writing the language was quite pleasant; I think the future looks bright for Julia. It was also the most robust of the ‘slower’ languages: managing to complete 150 numTrades where GST, Ruby and Clojure dropped out. Note if you want to test it yourself that the Julia compiler takes quite a while to build, maybe at least half as long as Rust. Interestingly, this is the most concise language in terms of compressed source file size.

*Edit: Wow, a patch to make the global trades array local has sped it up massively, to near C speed. This shows the importance of being familiar with the gotchas of a language currently still in development in order to maximise its performance.

Clojure – Poor performance here can probably be blamed on the JVM itself, considering how poorly Java fares. The Clojure implementation was however actually the least enjoyable to write (although normally my experience with Clojure is quite pleasant), due to how the language goes out of its way to make declaring mutable types difficult. Where Racket only requires a #mutable tag to render a struct mutable, Clojure requires declaring an interface complete with getters and setters for every value, and then declaring a type that implements these getters and setters. Somebody please tell me there’s an alternative to the following monstrosity:

I wonder if Rich Hickey ever had any bad experiences with Godzilla, as the above code suggests that he must really hate mutation. Even Haskell makes declaring mutable datastructures easier. (I’d love if someone were to submit a Haskell information, by the way. My type-foo isn’t up to scratch for dealing with the state monad.)

Finally, the twist: although the C, Go and C# implementations are faster, they’re also wrong! Run the Clojure version with sufficient numTrades changing +’ to + and it’ll throw an ‘integer overflow’ exception, kindly letting us know as the other implementations didn’t that the buyCost and sellCost variables are overflowing. The other Lisps, Python and Smalltalk handle this by automatic promotion to bignum. The benchmark is hence not particularly fair, as bignum operations are slower than standard integer multiplication, and the dynamic languages are doing bignum calculations while the statically typed languages are simply overflowing. Look at the numbers for numruns < 100 for fairer comparison, as no languages are using bignums then. Also note that the Clojure version can use +’ instead of + to automatically promote to bignums and avoid the overflow, which the current version does.

I bet you thought that was all. But no, there’s more! Long story short, the Java code in the original blog post was poorly written from a performance perspective (unnecessary ‘new’s). I cleaned it up, and voila, it’s faster than the original unsafe Java implementation and quite close to the C. Moral of the story: don’t resort to sun.misc.unsafe until you’ve removed all the ‘new’s from your performance-sensitive code. In the following table, each language suffixed with ‘2’ is an implementation where the following (or equivalent):

It can be seen that this optimisation results in a huge speedup for the Java version, as it reduces object creation and hence reduces garbage collection. Similar for C#, F# and Python. Interestingly however it has an adverse effect on the performance of the Lisp implementations, suggesting that an object is created for each aref/vector-ref, so minimising the use of those functions produces better performance. This suggests, somewhat sadly, that object pooling may not be an effective option in Lisp, as there’s no way to mutate objects in an array without generating garbage (short of using unsafe ffi functions).

Finally, the most useful part of the benchmark: benchmarking the fastest implementation in each language, with the statically typed languages all using bignums so as to produce a fair result. The results are as follows:

NumTrades = 150

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

C3

clang

1.9361

252

100.00

352224

750

60

1492

9

Java3

javac

2.6750

547

46.07

797152

864

119

3386

2

Common Lisp

–

0.0000

1116

22.58

902192

715

44

2103

0

Racket

raco

20.1457

1957

12.88

875388

691

38

1959

11142

C#3

mcs

1.6699

2100

12.00

727712

893

114

3268

4

F#3

fsharpc

5.8450

2125

11.86

731956

659

39

1541

12

Go3

go

0.6288

2515

10.02

352956

656

62

1313

2799

Python2 (Pypy)

–

0.0000

3333

7.56

1519136

561

38

1192

0

NumTrades = 100

Language

Compiler

Compile time (s)

Running time (ms)

% Fastest

Resident mem use (KiB)

Compressed source size

Lines of code

Number of characters

Executable size (KB)

C3

clang

1.7614

162

100.00

235036

750

60

1492

9

Julia

–

0.0000

198

81.82

421180

552

52

1228

0

Java3

javac

2.9222

369

43.90

561684

864

119

3386

2

Common Lisp

–

0.0000

644

25.16

866428

715

44

2103

0

F#3

fsharpc

5.4417

1073

15.10

492908

659

39

1541

12

C#3

mcs

1.3916

1106

14.65

488812

893

114

3268

4

Racket

raco

20.3282

1209

13.40

639704

691

38

1959

11142

Go3

go

0.6998

1659

9.76

235768

656

62

1313

2799

Python2 (Pypy)

–

0.0000

2122

7.63

1183156

561

38

1192

0

Ruby

–

0.0000

4258

3.80

2233108

605

56

1250

0

Lua

–

0.0000

4600

3.52

2202356

554

52

1208

0

Smalltalk (GNU)

–

0.0000

5295

3.06

611084

712

76

1641

0

JuliaBigNum

–

0.0000

7726

2.10

1001056

565

52

1227

0

Python3

–

0.0000

10212

1.59

1582084

607

39

1268

0

C really shines here, courtesy of libGMP and stack allocation, but Go drops well behind, presumably due to its bignum library being slower and generating more garbage. Java performs relatively well, which I imagine is due either to its garbage collector being better or its bignum implementation generating little garbage. Interestingly, SBCL beats Go, F# and C# (twice as fast as them at NumTrades=150) in spite of being a dynamic language and being compiled without any type annotations (they didn’t seem to make it any faster), which just goes to show that dynamically typed languages needn’t necessarily be slower than statically typed ones in all cases. Racket also appears to beat them slightly at NumTrades=150, too. If anybody knows a way to stop the SBCL implementation from consing I’d be interested to hear it, as garbage collection accounts for over 50% of the running time so avoiding it would being it close to Java in speed. I actually also made a SBCL implementation that just calls libgmp via the ffi, which results in no garbage collection occurring (which means aref isn’t creating garbage, so object pooling _is_ possible in Common Lisp, yay!), yet it’s about twice as slow. It’s in the repo as Lisp3.lisp if anyone wants to fix it (it’s inspired by the lisp pidigits code from the alioth language benchmarks).

Note that D, Rust or Nimrod could also perform as well as C here, simply by calling out to libGMP. As could Go, for that matter, but I thought for the sake of the benchmark it made more sense to use the native Bignum implementation.

In the Benchmarker.go source the output and input file can be changed; as input, BenchmarkData.dat produces the first set of tables, BenchmarkData2.dat the second, and BenchmarkData3.dat the third. The numtrades values to test are stored near the top of the file in an array of strings, and the time to pause between runs can also be set via the WaitTime constant.

Hope you enjoyed the benchmark! The moralof this post: new considered harmful. The tl;dr of it: Lisp and Racket beat Go at Bignums.

1.Source code is converted to Common Intermediate Language (CIL), which is the CLI’s equivalent to assembly language for a CPU.
2.CIL is then assembled into a form of so-called bytecode and a CLI assembly is created.
3.Upon execution of a CLI assembly, its code is passed through the runtime’s JIT compiler to generate native code. Ahead-of-time compilation may also be used, which eliminates this step, but at the cost of executable-file portability.
4.The computer’s processor executes the native code.

This is I’ll admit lower level than my comfort-zone, and you’re clearly a more talented programmer, so my text isn’t meant to read as snide, I might have the wrong end of the stick… I’m just checking if you might be misinterpreting what CIL is… I still thought it was MIL and just found out its now called CIL. It was MIL I went looking for.

I was just assuming that CIL was bytecode, I didn’t realise that there may be another immediate step in which it’s transformed into bytecode. I’m more familiar with Java, where the intermediate language is the bytecode. Thanks for that; I’ll update the post to correct it.

Would Clojure’s agents let you write its implementation less hideously? Or maybe transients? Or hiphip array, if non-core libraries are allowed… in any case, I don’t see why a type definition has to be involved, unless it would definitely perform much worse without.

Can transients or agents be used with standard Java arrays? My reason for using a type definition is that types can be stuck in an array, and there’s a good change the JVM will store the objects sequentially in memory, improving cache performance.

I know that Ruby is not known for blazing performance relative to C in terms of execution speed, but the way this Ruby code is constructed is sabotaging things pretty badly. Adding a simple attr_accessor for each of the instance variables you want to access, which follows the Smalltalk convention of only exposing data you wish to have exposed rather than brute-forcing it with instance_variable_get, makes it perform twice as fast.

Obviously Ruby won’t come out on top, but it is a capable language that’s very fast to write in and has an unusually high degree of expressiveness. Compared to Perl, Python and PHP, it’s very easy to read and even easier to learn. Just don’t expect it to have the same philosophy as C, it’s coming at programming from a completely different angle, just as Haskell, LISP and others have their own take.

Ah, thank you, I didn’t realise that would make such a difference. I’ve updated the Ruby version to reflect this, and made a note under the results table. I’m planning on running the benchmarks again in a few hours (it takes over an hour to run them all) and updating the tables.

The benchmark is a bit silly because the fastest languages (ASM, C, Go) seem to be only limited by memory bandwidth. I believe the difference between C and ASM comes mostly from the fact that the C struct has 7 byte padding, which causes more memory traffic. By using the smaller integer sizes (int->short, long->int) and struct packing the struct size can be reduced from 48 bytes to 21 and the code runs twice as fast. Also it’s worth mentioning that “long” is 32 bits on some platforms, and the correct type for 64-bit data would be “long long” or “int64_t”.

It actually mentions in the x86-64 section that the reason it’s faster is because it uses less memory due to lack of padding

I’m using that pattern of longs and ints because they are what were used in the original benchmark that inspired this post; presumably ints and shorts would have insufficient capacity for that specific case, hence the use of longs and ints.

Adding some type declarations to the Common Lisp version speeds it up 5 times (130ms per iteration vs 80 ms for C), and makes it give the same incorrect result as the C version for numTrades=150:
– in the defstruct: (price 0 :type fixnum) (quantity 0 :type fixnum)
– change the declared type of trades: (declaim ((simple-array lisp-memory-trade *) trades))
– in perf-run declare the type of the intermediate result to be a fixnum, similar for sell-cost: (incf buy-cost (the fixnum (* (lisp-memory-trade-price trade) (lisp-memory-trade-quantity trade))))

Sorry to be critical, but I’m afraid that this is a bit of a waste of time for anyone who reads this if the results of each language version do not match. Having some versions silently overflow and benchmarking against versions producing a “correct” result while saying “Look, this is 100 times faster” is a nonsense.

Also please detail what system you are benchmarking on for comparitive/illustrative purposes. *Personally* I’m seeing numbers way higher than yours and failure to complete which gives me reason to doubt the benchmarks even further.

As it says near the top, the best versions for comparison are the final two graphs, in which all languages produce the same results. The first set are just a copy of the program described in the link in the first paragraph, which happened to overflow.

I’ll add data on my system. It’s a few years old, so I’m not surprised you’re seeing better numbers.

Considering C#3 is around four times slower than Java3 for me, it’s not surprising if a different JVM could be slower.

And, no worries. I think it would have worked better if I’d just focused on the bignum aspect of it. Essentially, it’s really just calculating n^2 + (n-1)^2 .. + 1^2 where X is NumTrades and is also < sqrt(LONG_MAX). That might make an interesting benchmark.

Most micro benchmarks are silly, but this one is particularly so. The Lua code is *very* far from optimal (almost everything is global) and no Lua programmer would write anything remotely resembling it.

Well contributions are welcome. The idea of the benchmark is to have people more familiar with the languages used submit more optimal implementations, but maybe I didn’t make that clear enough. For comparison, the implementations from the final two tables are what you should be looking at, as they all do the same thing. Well, apart from the Lua, as that uses floats rather than bignums.

Right, but the whole thing appears to be a randomly chosen sequence of operations. If you actually came up with a feasible, realistic thing to implement, other people could get creative with the implementation and actually compete to improve them without mindless, verbatim translation.

You’re right; if I’d thought it through more I could probably have designed a much better benchmark. I think I focused too much on just verbatim translation of the program that inspired the benchmark, without giving enough thought to what was actually being benchmarked.

If you read the whole post, you’d see that all the implementations suffixed 2 do the same thing, and all those suffixed 3 do too, and the ones suffixed 3 produce the same results as the Python, Ruby and Lisp implementations (which automatically promote to bignum). But I realise now I could definitely have organised it better.

The Clojure code had several issues in it. I’ve posted a pull request that addresses these issues. I think the two biggest problems were using boxed fields instead of primitive and calling def inside the loop and maybe the third was using a faster version of “odd?”.

Thank you! It is indeed comparable to the Java3 version. I feel like a complete idiot for not realising ^Long was boxed. I’m surprised that ‘def’ performs so poorly though, but I suppose there’s no reason something should work the same way in Clojure as it does in Scheme and Common Lisp.