The
memory bandwidth problem is not so of a problem in the
single-threaded world, because the program still processes gigabytes
of data per second and is blazingly fast. But it's a serious problem
in the multi-threaded world, because the
program does not run any faster when additional threads added.

There is not much one can do when a program is memory bandwidth bound
(except upgrading hardware of course). The only thing software
developer can do about it is to reduce amount of input and/or output
data. In this problem we are concerned only with output, and
fortunately it's format is not fixed. So I tried to reduce size of
output data as much as possible.

The first natural thing to do is to encode each move as 2 bits
(instead of a whole byte). This reduces output 4 times.

The next thing I done is elimination of delimiters between paths.
Since I generate paths based on path class, and length of paths in
each class is constant and known, delimiters are needless.

Those
measures did not help with fighting the memory bandwidth problem.
It's easy to observe that 2 bits per move is not optimal packing (2
bits can represent 4 states, while we use only 3), and that it's
possible to pack 5 moves into a byte (3^5 <= 2^8). So I switched
to “5 moves per byte”
encoding:

byte = move0*3^0 + move1*3^1 + move2*3^2 + move3*3^3 + move4*3^4

Which is quite optimal because it uses 243 states out of 256 possible
states (95% efficiency). For the problem of size (11,11) there are
45'046'719 possible paths, and my program uses 184'257'251 bytes for
output, which is 4.09 bytes per path on average.

The program does speedup proportionally to decrease of output data
size. However it's still memory bandwidth bound, i.e. execution time
depends on total system's memory bandwidth rather than on
processor/core count.