In my Expression Templates Library (ETL) project, I have a lot of template heavy code that needs to run as fast as possible and that is quite intensive to compile. In this post, I'm going to compare the performance of a few of the kernels produced by different compilers. I've got GCC 5.4, GCC 6.20 and clang 3.9. I also included zapcc which is based on clang 4.0.
These tests have been run on an Haswell processor. The automatic parallelization of ETL has been turned off for these tests.
Keep in mind that some of the diagrams are presented in logarithmic form.
The first kernel is a very simple one, simple element-wise multiplication of two vectors. Nothing fancy here.
For small vectors, clang is significantly slower than gcc-5.4 and gcc6.2. On vectors from 100'000 elements, the speed is comparable for each compiler, depending on the memory bandwidth. Overall, gcc-6.2 produces the fastest code here. clang-4.0 is slightly slower than clang-3.9, but nothing dramatic.
The second kernel is computing the exponentials of each elements of a vector and storing them in another vector.
Interestingly, this time, clang versions are significantly faster for medium to large vectors, from 1000 elements and higher, by about 5%. There is no significant differences between the different versions of each compiler.
The next kernel I did benchmark with the matrix-matrix multiplication operation. In that case, the kernel is hand-unrolled and vectorized.
There are few differences between the compilers. The first thing is that for some sizes such as 80x80 and 100x100, clang is significantly faster than GCC, by more than 10%. The other interesting fact is that for large matrices zapcc-clang-4.0 is always slower than clang-3.9 which is itself on par with the two GCC versions. In my opinion, it comes from a regression in clang trunk but it could also come from zapcc itself.
The results are much more interesting here! First, there is a huge regression in clang-4.0 (or in zapcc for that matter). Indeed, it is up to 6 times slower than clang-3.9. Moreover, the clang-3.9 is always significantly faster than gcc-6.2. Finally, there is a small improvement in gcc-6.2 compared to gcc 5.4.
The following kernel is the performance of a hand-crafted Fast-Fourrier transform implementation.
On this benchmark, gcc-6.2 is the clear winner. It is significantly faster than clang-3.9 and clang-4.0. Moreover, gcc-6.2 is also faster than gcc-5.4. On the contrary, clang-4.0 is significantly slower than clang-3.9 except on one configuration (10000 elements).
This kernel is about computing the 1D valid convolution of two vectors.
While clang-4.0 is faster than clang-3.9, it is still slightly slower than both gcc versions. On the GCC side, there is not a lot of difference except on the 1000x500 on which gcc-6.2 is 25% faster.
And here are the results with the naive implementation:
Again, on the naive version, clang is much faster than GCC on the naive, by about 65%. This is a really large speedup.
This next kernel is computing the 2D valid convolution of two matrices
There is no clear difference between the compilers in this code. Every compiler here has up and down.
Let's look at the naive implementation of the 2D convolution (units are milliseconds here not microseconds):
This time the difference is very large! Indeed, clang versions are about 60% faster than the GCC versions! This is really impressive. Even though this does not comes close to the optimized. It seems the vectorizer of clang is much more efficient than the one from GCC.
The final kernel that I'm testing is the batched 4D convolutions that is used a lot in Deep Learning. This is not really a 4D convolution, but a large number of 2D convolutions applied on 4D tensors.
Again, there are very small differences between each version. The best versions are the most recent versions of the compiler gcc-6.2 and clang-4.0 on a tie.
Overall, we can see two trends in these results. First, when working with highly-optimized code, the choice of compiler will not make a huge difference. On these kind of kernels, gcc-6.2 tend to perform faster than the other compilers, but only by a very slight margin, except in some cases. On the other hand, when working with naive implementations, clang versions really did perform much better than GCC. The clang compiled versions of the 1D and 2D convolutions are more than 60% faster than their GCC counter parts. This is really impressive. Overall, clang-4.0 seems to have several performance regressions, but since it's not still a work in progress, I would not be suprised if these regressions are not present in the final version. Since the clang-4.0 version is in fact the clang version used by zapcc, it's also possible that zapcc is introducing new performance regressions.
Overall, my advice would be to use GCC-6.2 (or 5.4) on hand-optimized kernels and clang when you have mostly naive implementations. However, keep in mind that at least for the example shown here, the naive version optimized by the compiler never comes close to the highly-optimized version.
As ever, takes this with a grain of salt, it's only been tested on one project and one machine, you may obtain very different results on other projects and on other processors.