In my Expression Templates Library (ETL) project, I have a lot of template heavy
code that needs to run as fast as possible and that is quite intensive to
compile. In this post, I'm going to compare the performance of a few of the
kernels produced by different compilers. I've got GCC 5.4, GCC 6.20 and clang
3.9. I also included zapcc which is based on clang 4.0.

These tests have been run on an Haswell processor. The automatic parallelization
of ETL has been turned off for these tests.

Keep in mind that some of the diagrams are presented in logarithmic form.

Vector multiplication

The first kernel is a very simple one, simple element-wise multiplication of two
vectors. Nothing fancy here.

For small vectors, clang is significantly slower than gcc-5.4 and gcc6.2. On
vectors from 100'000 elements, the speed is comparable for each compiler,
depending on the memory bandwidth. Overall, gcc-6.2 produces the fastest code
here. clang-4.0 is slightly slower than clang-3.9, but nothing dramatic.

Vector exponentiation

The second kernel is computing the exponentials of each elements of a vector and
storing them in another vector.

Interestingly, this time, clang versions are significantly faster for medium to
large vectors, from 1000 elements and higher, by about 5%. There is no
significant differences between the different versions of each compiler.

Matrix-Matrix Multiplication

The next kernel I did benchmark with the matrix-matrix multiplication operation.
In that case, the kernel is hand-unrolled and vectorized.

There are few differences between the compilers. The first thing is that for
some sizes such as 80x80 and 100x100, clang is significantly faster than GCC, by
more than 10%. The other interesting fact is that for large matrices
zapcc-clang-4.0 is always slower than clang-3.9 which is itself on par with the
two GCC versions. In my opinion, it comes from a regression in clang trunk but
it could also come from zapcc itself.

The results are much more interesting here! First, there is a huge regression in
clang-4.0 (or in zapcc for that matter). Indeed, it is up to 6 times slower than
clang-3.9. Moreover, the clang-3.9 is always significantly faster than gcc-6.2.
Finally, there is a small improvement in gcc-6.2 compared to gcc 5.4.

Fast-Fourrier Transform

The following kernel is the performance of a hand-crafted Fast-Fourrier
transform implementation.

On this benchmark, gcc-6.2 is the clear winner. It is significantly faster
than clang-3.9 and clang-4.0. Moreover, gcc-6.2 is also faster than gcc-5.4.
On the contrary, clang-4.0 is significantly slower than clang-3.9 except on one
configuration (10000 elements).

1D Convolution

This kernel is about computing the 1D valid convolution of two vectors.

While clang-4.0 is faster than clang-3.9, it is still slightly slower than both
gcc versions. On the GCC side, there is not a lot of difference except on the
1000x500 on which gcc-6.2 is 25% faster.

And here are the results with the naive implementation:

Again, on the naive version, clang is much faster than GCC on the naive, by
about 65%. This is a really large speedup.

2D Convolution

This next kernel is computing the 2D valid convolution of two matrices

There is no clear difference between the compilers in this code. Every compiler
here has up and down.

Let's look at the naive implementation of the 2D convolution (units are
milliseconds here not microseconds):

This time the difference is very large! Indeed, clang versions are about 60%
faster than the GCC versions! This is really impressive. Even though this does
not comes close to the optimized. It seems the vectorizer of clang is much more
efficient than the one from GCC.

4D Convolution

The final kernel that I'm testing is the batched 4D convolutions that is used a
lot in Deep Learning. This is not really a 4D convolution, but a large number
of 2D convolutions applied on 4D tensors.

Again, there are very small differences between each version. The best versions
are the most recent versions of the compiler gcc-6.2 and clang-4.0 on a tie.

Conclusion

Overall, we can see two trends in these results. First, when working with
highly-optimized code, the choice of compiler will not make a huge difference.
On these kind of kernels, gcc-6.2 tend to perform faster than the other
compilers, but only by a very slight margin, except in some cases. On the other
hand, when working with naive implementations, clang versions really did perform
much better than GCC. The clang compiled versions of the 1D and 2D convolutions
are more than 60% faster than their GCC counter parts. This is really
impressive. Overall, clang-4.0 seems to have several performance regressions,
but since it's not still a work in progress, I would not be suprised if these
regressions are not present in the final version. Since the clang-4.0 version is
in fact the clang version used by zapcc, it's also possible that zapcc is
introducing new performance regressions.

Overall, my advice would be to use GCC-6.2 (or 5.4) on hand-optimized kernels
and clang when you have mostly naive implementations. However, keep in mind that
at least for the example shown here, the naive version optimized by the compiler
never comes close to the highly-optimized version.

As ever, takes this with a grain of salt, it's only been tested on one project
and one machine, you may obtain very different results on other projects and on
other processors.