December 26th, 2015

I have seen some ways to improve the performance Rust version of the Base64 benchmark, the performance of the Rust and Python versions of the Brainfuck benchmarks, but in this post I'll focus on the D version of the Brainfuck benchmark (currently my site is frozen, so for the moment I'll just inline all the code in this post. Later I'll offer links to zips on my site).

The benchmark asks to run this long branfuck program that generates a Mandelbrot Set ASCII art, "mandel.b":

Note: on Rosettacode there is a simple D program that with the LDC2 compiler runs the mandel brainfuck program in less than 1.1 seconds (and LDC2 compiles it in 2.3 seconds with full optimizations):http://rosettacode.org/wiki/Execute_Brain****/DBut here I'll use an interpreter.

The keys of the bracket_map associative array are points in the cleaned up Brainfuck program. Usually Brainfuck programs aren't huge, so using an associative array is a performance waste. This improved program uses a dynamic array (of ints) for the bracket_map:

Here I have reformatted the code a little, I have converted the classes to structs, I have converted code from string to uint[], and I have converted bracket_map to uint[]. Most of such changes don't improve performance, and indeed they decrease performance a little. But they lay the foundations for successive improvements.

Unlike the GCC compiler, the D language doesn't have computed gotos, so you can't use them to speed up this Brainfuck interpreter. So I've added a sequence of if-goto to the switch cases (using a string mixin to keep code short), and thankfully the powerful LLVM back-end of the LDC2 compiler recognizes this pattern and turns it into poor's man computed gotos:

This gives efficient code, and the Pairs are stack-allocated, but the compiler allows only a limited amount of nesting. The foreach inside the switch generates the switch cases statically. This works up to a depth of 13.

To reach higher levels of nesting in D language I use dynamic (runtime) polymorphism:

The toString() inside the interface and PairWrapper are used because D classes don't yet support the sink. The sink is important to increase performance, avoiding allocating lot of strings. This version is more efficient also because it leads to more coherence for the L1 code cache. This D solution is just a little slower than the Haskell solution (perhaps because the Haskell solution caches the formatting of the integer 42, I don't know. If I perform such caching the D code becomes faster than the Haskell code), and it's much faster than the Java version (despite currently the D garbage collector is less efficient than the OracleVM ones).

We're going to use Haskell, because this is all about functions (in the mathematical sense), and so Haskell, being a functional programming language, is especially well suited to the job.

Most of the code you below is in D language (with a C version at the end) because it's a good language to solve similar problems.

The problem statement defines a card shuffling technique:

1) Take one card from the top of the deck and discard it into a second pile.2) Take another card from the top of the deck, and put it at the bottom of the deck.3) Repeat these two steps, putting all discarded cards from step 1 into the same pile, until the original deck is all gone and the second pile has all the cards in it.

The problem is: how many shuffles does it take until a deck is in the same order as when you started, for a deck with an arbitrary number of cards? Write a function, f :: Int -> Int, such that, for a deck with n cards, f n is the minimum number of shuffles required to return it to its original order.

I have implemented this 'step' function in various ways in D, and at the end this is the version I have used (here I have removed the pre/post-conditions, see the files for the full code), this performs the whole shuffling (including the third step):

This function is (strongly) pure, it accepts a constant array of Cards and performs the shuffling using slices and concatenations, building a new output array B. Here a Deck is a dynamic array, but perhaps for this algorithm a more efficient data structure for a Deck is a (singly linked) list (about as in the Haskell code).

In perms3 I have a commented-out the range-based version of 'makeDeck' because currently the 'array' function is not @safe for that use case. So I've written a more internally imperative version of 'makeDeck':

I have used a Typedef to implement the 'newtype' of the Haskell code (despite currently Typedef has some bugs):

alias Card = Typedef!int;

The function 'f1' uses a do-while because Phobos still lacks an 'iterate' range-based function, as used in the 'shuffle' Haskell function.

I have run both perms1.hs and perms3.d on the few testing values, both print:

4 2
5 5
52 510
53 53
100 120
200 8460

perms1.hs produces this output in about 2.72 seconds:

[2,5,510,53,120,8460]

perms3.d produces this output in (best of 3) in about 1.39 seconds:

4 2
5 5
52 510
53 53
100 120
200 8460

I have then created a perms3b.d program that is very similar to perms3.d, but represents a Deck with a SList (a singly linked list) hoping to improve performance avoiding the O(n) copies of the slices of the array-based version. But perms3b.d runs in about 30.4 seconds. In most cases linked lists don't give any performance advantage.

- - - - - - - - - - - - - - - - - - -

perms4.d:

This version is similar to perms3.d, but here I have implemented 'deckShuffle' in a lower level way. This deckShuffle is now @nogc (it doesn't allocate on the heap), and works in-place on two given arrays, using indexes (it could also be implemented using pointer arithmetic, but there's no need for that, this function is fast and @safe). Now deckShuffle is still annotated with 'pure' but now it's weakly pure:

This kind of code is more fiddly and requires more care and testing compared to more a functional style, but it's also much faster and produces no heap garbage. The run-time of this function is also more predictable.

The run time of 'perms4.d' to produce the six output pairs is about 0.06 seconds, so this is about 23 times faster than 'perms3.d'. Unlike many other languages, in D it's easy to mix low level code with higher level code, and produce code that still looks natural and free of most accidental complexity.

- - - - - - - - - - - - - - - - - - -

perms5.d:

This implements the algorithm from perms2.hs. For simplicity I have used the slower but simpler 'deckShuffle' from perms3.d.

Instead of writing the various functions of the Haskell version, building the various intermediate data stutures (like the list of pairs), I have written a single 'f2' function that computes the lcm of the cycleLen without storing the actual cycles. D has less 'glue' than Haskell, so often in D it's more convenient to write a little larger functions (with the side effect that they are often faster than the Haskell versions, but also less easy to combine in different ways):

This code is less general and more specialized than the Haskell functions. It's less easy to see what the code is doing. Here 'graph' of the pairs is an associative array to avoid the linear searches in the lists (but a linear scan is still present despite being hidden, the one performed by byKey.front).

The run-time of perms5.d is very small, so beside calling 'f2' on the six test cases, this function also computes 'f2' on the numbers n in [1, 1_000[. The total run-time is about 1.29 seconds.

If I modify perms2.hs to compute f2 in the same [1, 1_000[ interval, the run-time is very large (about half an hour, and it uses 32 bit integers, so some results for the [1, 1_000[ interval are wrong). So the 'f2' function in D has some advantages (and it's still strongly pure). The D 'f2' function returns a ulong because when n grows large, the output grows a lot.

- - - - - - - - - - - - - - - - - - -

perms6.d:

This is similar to perms5.d, but uses the more efficient 'deckShuffle' function from perms4.d.The run-time of perms6.d is only 0.20 seconds (this includes the computations for n in [1, 1_000[), so it's more than six times faster than perms5.d.

- - - - - - - - - - - - - - - - - - -

perms7.d:perms7b.d:perms7c.d:

This is a lower-level version (that is easy to port to C language), it's similar to perms6.d, but here in 'f2' instead of the associative array 'graph' I have re-used the array D2, filling it with 'emptyItem' when I remove a value. 'f2' is now also @nogc because it accepts fixed-size arrays as arguments and then slices them. So the program allocates the arrays only inside the 'main' (and here are stack-allocated, so even the main is @nogc):

perms7.d runs in about 0.05 seconds. So it's about 4 times faster than perms6.d.

To avoid overflow bugs I have created the perms7b.d and perms7c.d versions, that use core.checkedint (despite core.checkedint is not meant for user code). For both perms7b.d and perms7c.d the run-time is similar to the unchecked versions, and perms7b.d spots the overflow bugs:

This version is very similar to perms7.d, but I've used BigInt instead of ulong to compute all correct results (if there are no other bugs or design mistakes).

perms8.d is still quite fast, it runs in about 0.08 seconds (this includes the computations for n in [1, 1_000[). If I increase nMax to 10_000 it runs in about 6.6 seconds. D BigInts are not very fast, but I think this is not a significant problem for this program.

- - - - - - - - - - - - - - - - - - -

perms9.c:

This is C code derived from perms7.d.

perms9.c runs in about 0.03 seconds or less. perms9.c is a little faster than perms7.d, but the speed difference is really visible when nMax is much larger (but then the program gives wrong outputs becuse of overflows). The difference between 0.03 and 0.05 seconds is mostly caused by the time taken to initialize the D-runtime.

Notes to this first version:- Inside LabelPixels the pixes are represented with an unsigned byte each. This reduces the RAM used to store the data and allows to use the CPU cache more efficiently.- "readData" is a lambda assigned to a global immutable value. This is not fully idiomatic D style, but I've used this because it looks more similar to the F# code. In practice I usually don't use such global lambdas because the output type gives me useful information to understand the purpose of the function, and they are less easy to find in the assembly (you can't just search for "readData" in the asm listing to find this lambda.- The D standard library contains a module to read CSV files (std.csv), but I always find its API bad, so I've used File.byLine and I have processed the lines lazily.- In readData I've used the eager function split instead of the lazy splitter (that uses less RAM) because for the data sets used by this program split is faster (readData loads both CSV files in about 0.8-0.9 seconds on my PC).- In the "distance" lambda the nothrow and @nogc attributes are commented out because zip() doesn't yet respect such attributes (zip throws an exception in certain cases and the code to throw the exception allocates it on the heap). The "classify" lambda calls "distance" so it can't be nothrow and @nogc (and because reduce doesn't have a default value in case its input range is empty, so it can throw, and because classify contains a heap-allocating lambda).- The distance lambda contains code like "map!(ab => (int(ab[0]) - ab[1])" because D is not yet able to de-structure tuples, like the 2-tuples generated lazily by zip.- The classify lambda contains a map of tuples because the max/min functions of Phobos still lack an optional "key" function like the max/min functions of Python. So I've created tuples of the mapped value plus the label and I've used reduce to find the min. I have used "ubyte(s.label)" to produce a mutable tuple, that reduce doesn't accept (unless you use a mutable seed).

Most run-time of this program is used inside the distance function. So if you want to optimize this program that's the function to improve.

In the zip you can find a "nearest1b.d" file that is very similar to this first function, but it's adapted to the current version of the LDC2 compiler, that is a little older (it doesn't support the @nogc attribute, T(x) syntax for implicit casts, and its Phobos doesn't have a sum function).

The D compilers (even the well optimizing LDC2 complier) is not yet very good at optimizing that functional distance function. So in the "nearest2.d" version of the program (that you can find in the zip archive) I've written distance in imperative way, this produces a program that compiled with ldc is more than 3 times faster:

Now this distance function can be pure nothrow @safe. If you look this function also cheats a little not computing the square root, because it's not useful if you want to search the closest. This function also computes the total using integers.

I have not shown the asm generated by LDC2 for the nearest lambda in the "nearest1.d" program because it's too much complex and messy, but I can show the asm for this second simpler distance function:

As you see the loop gets compiled to quite simple asm with 9 X86 (32 bit) instructions.

To improve performance I have changed the data layout a little. With the "binary_generator1.d" program you can find in the zip archive I have generated binary files of the training set and validation sample. So now the loading of the data is very simple and very fast (see in the zip archive for the full source of this "nearest3.d" program:

Notes:- The classify function is now imperative, but this improves the performance just a little.- @nogc is still commented out because the current version of the LDC2 compiler doesn't support it yet.- The main performance difference of this third program comes the data layout. Now I have hard-coded (defined statically) the number of columns nCols. So now the data sets are essentially a ubyte[N][]. In D this is represented as a dense array, that minimize cache misses and reduce by one the number of levels of indirection. Thanks to the compile-time knowledge of the loop bounds inside the distance function, the LLVM back-end of the LDC2 compiler performs some loop unrolling (in theory it can be performed even before, but in practice the the version 3.4.1 of the LLVM is not performing loop unrolling on dynamic bounds).

So the asm of the distance function of the "nearest3.d" program is longer because of a partial loop unwinding:

Thanks to the data layout changes and other smaller improvements, the "nearest3.d" function is almost three times faster than "nearest2.d".

To reduce the run time and better utilize one CPU core, we can use the SIMD registers of the CPU.

To perform a subtraction of the arrays I change the data layout again, this time I represent the pixels with short (in D signed 16 bit integers). So I use a short8 to perform several subtractions in parallel. A problem comes from the label, if I slice it away as in the "nearest3.d" program, the array pointers are not aligned to 16 bytes and this is not accepted by my CPU. To solve this problem I add a padding of 7 short in every line of the matrix after the label. This is performed by the "binary_generator2.d" program.

So the 4th version of the program contains (see in the zip archive for the full code of "nearest4.d"):

Notes:- The classify function is not changed much, it performs the slicing to throw away the first eight (short label + 7 padding short) items.- The distance function is a little more complex, but not too much. It uses the basic SIMD operations like subtraction, sum, product, plus the ".array" attribute that allows me to manage a short8 as a fixed-size value to sum all its contents.- The code of distance is designed to be adaptable to other sizes of SIMD registers (but it's not able to adapt automatically).- This "nearest4.d" program is only 1.6 times faster than "nearest4.d" probably because the SIMD code that manages shorts is not very clean.

The asm for the distance function of the "nearest4.d" program is rather long, despite being fast:

In the zip archive I have also added a "nearest5.d" program that shows a recent improvement of the D type system (not yet available in ldc2), to support fixed-size array length inference for slices passed to templated functions.

With a run-time 1.28 seconds for the "nearest4.d" program there are still ways to reduce the run-time. I am using an old 2.3 GHz CPU, so if you use a modern Intel 4 GHz CPU like the Core i7-4790K and with faster memory, you can see a significant speedup. If you use a modern CPU you can also use YMM registers with short16 SIMD, that should offer some speedup (LDC2 is already able to target such YMM registers, you need to replace short8 with short16 in the nearest4.d program and change the padding). And then this program is very easy to parallelize for two or four cores, you can use the std.parallelism module for the computation of nCorrect with map+reduce in the main function, that should give some speedup.