Benchmarking ND4J and Neanderthal – Part 2

He claims that in one of the benchmarks Neanderthal is 1000 times faster. This piqued my interest again since those kinds of differences don’t magically show up – especially not in a library as widely used as ND4J.

In this post I investigate the source of this difference, show why it happens and what to do about it as a user of the library.

Comparing Apples and Pomelos

ND4J does not have a dedicated mm method like Neanderthal since chaining matrix multiplications can be easily achieved by chaining multiple mmul calls. That is also what Dragan did while trying to simulate mm for his benchmark. The following code snippet shows how he implemented it for his benchmark.

This kind of invocation will result in the computation going backwards, as if you literally had set the parenthesis in the multiplication like this: M = A(B(C(D(EF)))).

Normally you wouldn’t do it like that. Instead it would probably look like this:

m1.mmul(m2).mmul(m3).mmul(m4).mmul(m5).mmul(m6)

This second way of doing things is like setting the parenthesis for the multiplication like this: M = ((((AB)C)D)E)F.

The big difference between those two ways is how the temporary results are computed. Both ways of setting parenthesis are allowed because matrix multiplication is associative. Both of them arrive at the same result, but the matrix sizes at each different step may differ widely.

1000x difference?

Dragan has found Neanderthal to be 80 times and 1000 times faster in his non-uniform size benchmarks. Given what we now know about his implementation, and how it would be normally done, lets take a look how the differences play out.

In the following examination, I am using these matrix shapes, as they are what Dragan used in his benchmarks:

During the calculation of M = A(B(C(D(EF)))), i.e. the way Dragan implemented it, the computation carries through the column size of the last matrix.

Computation Step

Result shape

Required operations

Size of result

EF

7700x100000

1246 GFLOP

2937 MB

D(EF)

125x100000

192 GFLOP

47 MB

C(D(EF))

2300x100000

57 GFLOP

877 MB

B(C(D(EF)))

13456x100000

6188 GFLOP

5133 MB

A(B(C(D(EF))))

13x100000

34 GFLOP

4,95 MB

Now, lets take a look at how the more common usage would look like, i.e. using M = ((((AB)C)D)E)F

Computation Step

Result shape

Required operations

Size of result

AB

13x2300

0,805 GFLOP

0,11 MB

(AB)C

13x125

0,007 GFLOP

0,006 MB

((AB)C)D

13x7700

0,025 GFLOP

0,38 MB

(((AB)C)D)E

13x810

0,162 GFLOP

0,04 MB

((((AB)C)D)E)F)

13x100000

2,105 GFLOP

4,95 MB

As you can see, this way of structuring the order of the computation results with the number of rows of the first matrix being carried through the computation. The end result is still the same.

The difference in the required number of floating point operations to perform the actual computation is enormous. The first way requires a total of 7719 GFLOP, while the second one requires only about 3,1 GFLOP, with pretty much all the time spent on just the last multiplication.

Just the order of those operations results already in a huge difference, but there is even more to it. Once you take into consideration that each matrix at each step is also not just large, but absolutely enormous when compared to each other, you see how caching efficiency and memory access costs will dwarf even the huge difference in required operations.

So why do we see only a 1000x difference in speed? Looking at the GFLOPs alone it should be closer to 2500x. The reason is that on large multiplications MKL actually uses multiple cores. As my machine has 4 cores, as Dragan’s probably does. However, this often doesn’t scale linearly, especially when memory access and synchronization between those cores has to be accounted for.

Comparing apples to apples again

Let’s use Clojure threading macro to both, make the benchmark easier to read, and turn into more of an apples to apples comparison:

As you can see: it is still pretty much the same 2x difference that he found for uniform sizes. Or in his own words:

Not something to write the blog post about.

Repeating the benchmarks

I have again added those benchmarks to my own JMH benchmark suite and included some micro optimizations for the case when all matrices are uniformly sized and I’ve added a chained gemm invocation for good measure.

It shows that, on the current ND4J SNAPSHOT, the performance of mmul and gemm is pretty much equal. Since I didn’t run into an odd behavior of criterium (the benchmarking library used in Dragan’s benchmark) this time and my JMH benchmark results were pretty close to those shown above, I’m not adding another table of numbers here.

Conclusion

Benchmarking is hard. Fair benchmarking is even harder. Again, the difference turns out to be not as large as originally claimed. As ND4J doesn’t have a specialized mm call that takes any number of matrices, you will, also again, have to understand what you are actually doing when composing matrix multiplications in order to get the most performance out of ND4J.

This is even more apparent when you consider other matrix size constellations. Change the benchmark to start with 100000 and end with 13 and you will get the large difference again – because now it would be better to use the way Dragan implemented it originally. Neanderthal’s mm call includes an order optimization step, which given any number of matrices will decide up front in which order those matrices should be multiplied.

The biggest user of ND4J is Deeplearning4J, and this kind of chain matrix multiplications aren’t required there. And as no one else has asked for this functionality yet, no one has cared to implement a specialized method like that. This results in a situation where it is easy to create a benchmark that will show huge differences.

I’m personally not a huge fan of the “you are holding it wrong” type of argument, so I’m pretty impressed by what Neanderthal does to make cases like this work out of the box. In addition to this I’d actually say that even a 2x difference is something to write a blog post about. Especially since this gives us yet another benchmark that we can use to improve ND4J’s speed up out of the box.

Just like the last time, I am glad that Dragan created his benchmarks and published his numbers, as they show us where additional functionality and documentation is missing, and how we can improve ND4J even further.