Compiler benchmark GCC and Clang on C++ library (ETL)

Baptiste Wicht

It's been a while since I've done a benchmark of different compilers on C++ code. Since I've recently released the version 1.1 of my ETL project (an optimized matrix/vector computation library with expression templates), I've decided to use it as the base of my benchmark. It's a C++14 library with a lot of templates. I'm going to compile the full test suite (124 test cases). This is done directly on the last release (1.1) code. I'm going to compile once in debug mode and once in release_debug (release plus debug symbols and assertions) and record the times for each compiler. The tests were compiled with support for every option in ETL to account to maximal compilation time. Each compilation was made using four threads (make -j4). I'm also going to test a few of the benchmarks to see the difference in runtime performance between the code generated by each compiler. The benchmark will be compiled in release mode and its compilation time recorded as well.

I'm going to test the following compilers:

GCC-4.9.4
GCC-5.4.0
GCC-6.3.0
GCC-7.1.0
clang-3.9.1
clang-4.0.1
zapcc-1.0 (commercial, based on clang-5.0 trunk)

All have been installed directly using Portage (Gentoo package manager) except for clang-4.0.1 that has been installed from sources and zapcc since it does not have a Gentoo package. Since clang package on Gentoo does not support multislotting, I had to install one version from source and the other from the package manager. This is also the reason I'm testing less versions of clang, simply less practical.

For the purpose of these tests, the exact same options have been used throughout all the compilers. Normally, I use different options for clang than for GCC (mainly more aggressive vectorization options on clang). This may not lead to the best performance for each compiler, but allows for comparison between the results with defaults optimization level. Here are the main options used:

In debug mode: -g
In release_debug mode: -g -O2
In release mode: -g -O3 -DNDEBUG -fomit-frame-pointer

In each case, a lot of warnings are enabled and the ETL options are the same.

All the results have been gathered on a Gentoo machine running on Intel Core i7-2600 (Sandy Bridge...) @3.4GHz with 4 cores and 8 threads, 12Go of RAM and a SSD. I do my best to isolate as much as possible the benchmark from perturbations and that my benchmark code is quite sound, it may well be that some results are not totally accurate. Moreover, some of the benchmarks are using multithreading, which may add some noise and unpredictability. When I was not sure about the results, I ran the benchmarks several time to confirm them and overall I'm confident of the results.

Compilation Time

Let's start with the results of the performance of the compilers themselves:

Compiler	Debug	Release_Debug	Benchmark
g++-4.9.4	402s	616s	100s
g++-5.4.0	403s	642s	95s
g++-6.3.0	399s	683s	102s
g++-7.1.0	371s	650s	105s
clang++-3.9.1	380s	807s	106s
clang++-4.0.1	260s	718s	92s
zapcc++-1.0	221s	649s	108s

Note: For Release_Debug and Benchmark, I only use three threads with zapcc, because 12Go of RAM is not enough memory for four threads.

There are some very significant differences between the different compilers. Overall, clang-4.0.1 is by far the fastest free compiler for Debug mode. When the tests are compiled with optimizations however, clang is falling behind. It's quite impressive how clang-4.0.1 manages to be so much faster than clang-3.9.1 both in debug mode and release mode. Really great work by the clang team here! With these optimizations, clang-4.0.1 is almost on par with gcc-7.1 in release mode. For GCC, it seems that the cost of optimization has been going up quite significantly. However, GCC 7.1 seems to have made optimization faster and standard compilation much faster as well. If we take into account zapcc, it's the fastest compiler on debug mode, but it's slower than several gcc versions on release mode.

Overall, I'm quite impressed by the performance of clang-4.0.1 which seems really fast! I'll definitely make more tests with this new version of the compiler in the near future. It's also good to see that g++-7.1 also did make the build faster than gcc-6.3. However, the fastest gcc version for optimization is still gcc-4.9.4 which is already an old branch with low C++ standard support.

Runtime Performance

Let's now take a look at the quality of the generated code. For some of the benchmarks, I've included two versions of the algorithm. std is the most simple algorithm (the naive one) and vec is the hand-crafted vectorized and optimized implementation. All the tests were done on single-precision floating points.

Dot product

The first benchmark that is run is to compute the dot product between two vectors. Let's look first at the naive version:

dot (std)	100	500	1000	10000	100000	1000000	2000000	3000000	4000000	5000000	10000000
g++-4.9.4	64.96ns	97.12ns	126.07ns	1.89us	25.91us	326.49us	1.24ms	1.92ms	2.55ms	3.22ms	6.36ms
g++-5.4.0	72.96ns	101.62ns	127.89ns	1.90us	23.39us	357.63us	1.23ms	1.91ms	2.57ms	3.20ms	6.32ms
g++-6.3.0	73.31ns	102.88ns	130.16ns	1.89us	24.314us	339.13us	1.47ms	2.16ms	2.95ms	3.70ms	6.69ms
g++-7.1.0	70.20ns	104.09ns	130.98ns	1.90us	23.96us	281.47us	1.24ms	1.93ms	2.58ms	3.19ms	6.33ms
clang++-3.9.1	64.69ns	98.69ns	128.60ns	1.89us	23.33us	272.71us	1.24ms	1.91ms	2.56ms	3.19ms	6.37ms
clang++-4.0.1	60.31ns	96.34ns	128.90ns	1.89us	22.87us	270.21us	1.23ms	1.91ms	2.55ms	3.18ms	6.35ms
zapcc++-1.0	61.14ns	96.92ns	125.95ns	1.89us	23.84us	285.80us	1.24ms	1.92ms	2.55ms	3.16ms	6.34ms

The differences are not very significant between the different compilers. The clang-based compilers seem to be the compilers producing the fastest code. Interestingly, there seem to have been a big regression in gcc-6.3 for large containers, but that has been fixed in gcc-7.1.

dot (vec)	100	500	1000	10000	100000	1000000	2000000	3000000	4000000	5000000	10000000
g++-4.9.4	48.34ns	80.53ns	114.97ns	1.72us	22.79us	354.20us	1.24ms	1.89ms	2.52ms	3.19ms	6.55ms
g++-5.4.0	47.16ns	77.70ns	113.66ns	1.72us	22.71us	363.86us	1.24ms	1.89ms	2.52ms	3.19ms	6.56ms
g++-6.3.0	46.39ns	77.67ns	116.28ns	1.74us	23.39us	452.44us	1.45ms	2.26ms	2.87ms	3.49ms	7.52ms
g++-7.1.0	49.70ns	80.40ns	115.77ns	1.71us	22.46us	355.16us	1.21ms	1.85ms	2.49ms	3.14ms	6.47ms
clang++-3.9.1	46.13ns	78.01ns	114.70ns	1.66us	22.82us	359.42us	1.24ms	1.88ms	2.53ms	3.16ms	6.50ms
clang++-4.0.1	45.59ns	74.90ns	111.29ns	1.57us	22.47us	351.31us	1.23ms	1.85ms	2.49ms	3.12ms	6.45ms
zapcc++-1.0	45.11ns	75.04ns	111.28ns	1.59us	22.46us	357.32us	1.25ms	1.89ms	2.53ms	3.15ms	6.47ms

If we look at the optimized version, the differences are even slower. Again, the clang-based compilers are producing the fastest executables, but are closely followed by gcc, except for gcc-6.3 in which we can still see the same regression as before.

Logistic Sigmoid

The next test is to check the performance of the sigmoid operation. In that case, the evaluator of the library will try to use parallelization and vectorization to compute it. Let's see how the different compilers fare:

sigmoid	10	100	1000	10000	100000	1000000
g++-4.9.4	8.16us	5.23us	6.33us	29.56us	259.72us	2.78ms
g++-5.4.0	7.07us	5.08us	6.39us	29.44us	266.27us	2.96ms
g++-6.3.0	7.13us	5.32us	6.45us	28.99us	261.81us	2.86ms
g++-7.1.0	7.03us	5.09us	6.24us	28.61us	252.78us	2.71ms
clang++-3.9.1	7.30us	5.25us	6.57us	30.24us	256.75us	1.99ms
clang++-4.0.1	7.47us	5.14us	5.77us	26.03us	235.87us	1.81ms
zapcc++-1.0	7.51us	5.26us	6.48us	28.86us	258.31us	1.95ms

Interestingly, we can see that gcc-7.1 is the fastest for small vectors while clang-4.0 is the best for producing code for larger vectors. However, except for the biggest vector size, the difference is not really significantly. Apparently, there is a regression in zapcc (or clang-5.0) since it's slower than clang-4.0 at the same level as clang-3.9.

y = alpha * x + y (axpy)

The third benchmark is the well-known axpy (y = alpha * x + y). This is entirely resolved by expressions templates in the library, no specific algorithm is used. Let's see the results:

saxpy	10	100	1000	10000	100000	1000000
g++-4.9.4	38.1ns	61.6ns	374ns	3.65us	40.8us	518us
g++-5.4.0	35.0ns	58.1ns	383ns	3.87us	43.2us	479us
g++-6.3.0	34.3ns	59.4ns	371ns	3.57us	40.4us	452us
g++-7.1.0	34.8ns	59.7ns	399ns	3.78us	43.1us	547us
clang++-3.9.1	32.3ns	53.8ns	297ns	3.21us	38.3us	466us
clang++-4.0.1	32.4ns	59.8ns	296ns	3.31us	38.2us	475us
zapcc++-1.0	32.0ns	54.0ns	333ns	3.32us	38.7us	447us

Even on the biggest vector, this is a very fast operation, once vectorized and parallelized. At this speed, some of the differences observed may not be highly significant. Again clang-based versions are the fastest versions on this code, but by a small margin. There also seems to be a slight regression in gcc-7.1, but again quite small.

Matrix Matrix multiplication (GEMM)

The next benchmark is testing the performance of a Matrix-Matrix Multiplication, an operation known as GEMM in the BLAS nomenclature. In that case, we test both the naive and the optimized vectorized implementation. To save some horizontal space, I've split the tables in two.

sgemm (std)	10	20	40	60	80	100
g++-4.9.4	7.04us	50.15us	356.42us	1.18ms	3.41ms	5.56ms
g++-5.4.0	8.14us	74.77us	513.64us	1.72ms	4.05ms	7.92ms
g++-6.3.0	8.03us	64.78us	504.41us	1.69ms	4.02ms	7.87ms
g++-7.1.0	7.95us	65.00us	508.84us	1.69ms	4.02ms	7.84ms
clang++-3.9.1	3.58us	28.59us	222.36us	0.73ms	1.77us	3.41ms
clang++-4.0.1	4.00us	25.47us	190.56us	0.61ms	1.45us	2.80ms
zapcc++-1.0	4.00us	25.38us	189.98us	0.60ms	1.43us	2.81ms

sgemm (std)	200	300	400	500	600	700	800	900	1000	1200
g++-4.9.4	44.16ms	148.88ms	455.81ms	687.96ms	1.47s	1.98s	2.81s	4.00s	5.91s	9.52s
g++-5.4.0	63.17ms	213.01ms	504.83ms	984.90ms	1.70s	2.70s	4.03s	5.74s	7.87s	14.905
g++-6.3.0	64.04ms	212.12ms	502.95ms	981.74ms	1.69s	2.69s	4.13s	5.85s	8.10s	14.08s
g++-7.1.0	62.57ms	210.72ms	499.68ms	974.94ms	1.68s	2.67s	3.99s	5.68s	7.85s	13.49s
clang++-3.9.1	27.48ms	90.85ms	219.34ms	419.53ms	0.72s	1.18s	1.90s	2.44s	3.36s	5.84s
clang++-4.0.1	22.01ms	73.90ms	175.02ms	340.70ms	0.58s	0.93s	1.40s	1.98s	2.79s	4.69s
zapcc++-1.0	22.33ms	75.80ms	181.27ms	359.13ms	0.63s	1.02s	1.52s	2.24s	3.21s	5.62s

This time, the differences between the different compilers are very significant. The clang compilers are leading the way by a large margin here, with clang-4.0 being the fastest of them (by another nice margin). Indeed, clang-4.0.1 is producing code that is, on average, about twice faster than the code generated by the best GCC compiler. Very interestingly as well, we can see a huge regression starting from GCC-5.4 and that is still here in GCC-7.1. Indeed, the best GCC version, in the tested versions, is again GCC-4.9.4. Clang is really doing an excellent job of compiling the GEMM code.

sgemm (vec)	10	20	40	60	80	100
g++-4.9.4	264.27ns	0.95us	3.28us	14.77us	23.50us	60.37us
g++-5.4.0	271.41ns	0.99us	3.31us	14.811us	24.116us	61.00us
g++-6.3.0	279.72ns	1.02us	3.27us	15.39us	24.29us	61.99us
g++-7.1.0	273.74ns	0.96us	3.81us	15.55us	31.35us	71.11us
clang++-3.9.1	296.67ns	1.34us	4.18us	19.93us	33.15us	82.60us
clang++-4.0.1	322.68ns	1.38us	4.17us	20.19us	34.17us	83.64us
zapcc++-1.0	307.49ns	1.41us	4.10us	19.72us	33.72us	84.80us

sgemm (vec)	200	300	400	500	600	700	800	900	1000	1200
g++-4.9.4	369.52us	1.62ms	2.91ms	7.17ms	11.74ms	22.91ms	34.82ms	51.67ms	64.36ms	111.15ms
g++-5.4.0	387.54us	1.60ms	2.97ms	7.36ms	12.11ms	24.37ms	35.37ms	52.27ms	65.72ms	112.74ms
g++-6.3.0	384.43us	1.74ms	3.12ms	7.16ms	12.44ms	24.15ms	34.87ms	52.59ms	70.074ms	119.22ms
g++-7.1.0	458.05us	1.81ms	3.44ms	7.86ms	13.43ms	24.70ms	36.54ms	53.47ms	66.87ms	117.25ms
clang++-3.9.1	494.52us	1.96ms	4.80ms	8.88ms	18.20ms	29.37ms	41.24ms	60.72ms	72.28ms	123.75ms
clang++-4.0.1	511.24us	2.04ms	4.11ms	9.46ms	15.34ms	27.23ms	38.27ms	58.14ms	72.78ms	128.60ms
zapcc++-1.0	492.28us	2.03ms	3.90ms	9.00ms	14.31ms	25.72ms	37.09ms	55.79ms	67.88ms	119.92ms

As for the optimized version, it seems that the two families are reversed. Indeed, GCC is doing a better job than clang here, and although the margin is not as big as before, it's still significant. We can still observe a small regression in GCC versions because the 4.9 version is again the fastest. As for clang versions, it seems that clang-5.0 (used in zapcc) has had some performance improvements for this case.

For this case of matrix-matrix multiplication, it's very impressive that the differences in the non-optimized code are so significant. And it's also impressive that each family of compilers has its own strength, clang being seemingly much better at handling unoptimized code while GCC is better at handling vectorized code.

Convolution (2D)

The last benchmark that I considered is the case of the valid convolution on 2D images. The code is quite similar to the GEMM code but more complicated to optimized due to cache locality.

sconv2_valid (std)	100x50	105x50	110x55	115x55	120x60	125x60	130x65	135x65	140x70
g++-4.9.4	27.93ms	33.68ms	40.62ms	48.23ms	57.27ms	67.02ms	78.45ms	92.53ms	105.08ms
g++-5.4.0	37.60ms	44.94ms	54.24ms	64.45ms	76.63ms	89.75ms	105.08ms	121.66ms	140.95ms
g++-6.3.0	37.10ms	44.99ms	54.34ms	64.54ms	76.54ms	89.87ms	105.35ms	121.94ms	141.20ms
g++-7.1.0	37.55ms	45.08ms	54.39ms	64.48ms	76.51ms	92.02ms	106.16ms	125.67ms	143.57ms
clang++-3.9.1	15.42ms	18.59ms	22.21ms	26.40ms	31.03ms	36.26ms	42.35ms	48.87ms	56.29ms
clang++-4.0.1	15.48ms	18.67ms	22.34ms	26.50ms	31.27ms	36.58ms	42.61ms	49.33ms	56.80ms
zapcc++-1.0	15.29ms	18.37ms	22.00ms	26.10ms	30.75ms	35.95ms	41.85ms	48.42ms	55.74ms

In that case, we can observe the same as for the GEMM. The clang-based versions are much producing significantly faster code than the GCC versions. Moreover, we can also observe the same large regression starting from GCC-5.4.

sconv2_valid (vec)	100x50	105x50	110x55	115x55	120x60	125x60	130x65	135x65	140x70
g++-4.9.4	878.32us	1.07ms	1.20ms	1.68ms	2.04ms	2.06ms	2.54ms	3.20ms	4.14ms
g++-5.4.0	853.73us	1.03ms	1.15ms	1.36ms	1.76ms	2.05ms	2.44ms	2.91ms	3.13ms
g++-6.3.0	847.95us	1.02ms	1.14ms	1.35ms	1.74ms	1.98ms	2.43ms	2.90ms	3.12ms
g++-7.1.0	795.82us	0.93ms	1.05ms	1.24ms	1.60ms	1.77ms	2.20ms	2.69ms	2.81ms
clang++-3.9.1	782.46us	0.93ms	1.05ms	1.26ms	1.60ms	1.84ms	2.21ms	2.65ms	2.84ms
clang++-4.0.1	767.58us	0.92ms	1.04ms	1.25ms	1.59ms	1.83ms	2.20ms	2.62ms	2.83ms
zapcc++-1.0	782.49us	0.94ms	1.06ms	1.27ms	1.62ms	1.83ms	2.24ms	2.65ms	2.85ms

This time, clang manages to produce excellent results. Indeed, all the produced executables are significantly faster than the versions produced by GCC, except for GCC-7.1 which is producing similar results. The other versions of GCC are falling behind it seems. It seems that it was only for the GEMM that clang was having a lot of troubles handling the optimized code.

Conclusion

Clang seems to have recently done a lot of optimizations regarding compilation time. Indeed, clang-4.0.1 is much faster for compilation than clang-3.9. Although GCC-7.1 is faster than GCC-6.3, all the GCC versions are slower than GCC-4.9.4 which is the fastest at compiling code with optimizations. GCC-7.1 is the fastest GCC version for compiling code in debug mode.

In some cases, there is almost no difference between different compilers in the generated code. However, in more complex algorithms such as the matrix-matrix multiplication or the two-dimensional convolution, the differences can be quite significant. In my tests, Clang have shown to be much better at compiling unoptimized code. However, and especially in the GEMM case, it seems to be worse than GCC at handling hand-optimized. I will investigate that case and try to tailor the code so that clang is having a better time with it.

For me, it's really weird that the GCC regression, apparently starting from GCC-5.4, has still not been fixed in GCC 7.1. I was thinking of dropping support for GCC-4.9 in order to go full C++14 support, but now I may have to reconsider my position. However, seeing that GCC is generally the best at handling optimized code (especially for GEMM), I may be able to do the transition, since the optimized code will be used in most cases.

As for zapcc, although it is still the fastest compiler in debug mode, with the new speed of clang-4.0.1, its margin is quite small. Moreover, on optimized build, it's not as fast as GCC. If you use clang and can have access to zapcc, it's still quite a good option to save some time.

Overall, I have been quite pleased by clang-4.0.1 and GCC-7.1, the most recent versions I have been testing. It seems that they did quite some good work. I will definitely run some more tests with them and try to adapt the code. I'm still considering whether I will drop support for some older compilers.

I hope this comparison was interesting :) My next post will probably be about the difference in performance between my machine learning framework and other frameworks to train neural networks.

Decrease DLL neural network compilation time with C++17

Partial type erasing in Deep Learning Library (DLL) to improve compilation time

zapcc - a faster C++ compiler

How I made my Deep Learning Library 38% faster to compile (Optimization and C++17 if constexpr)

Release of zapcc 1.0 - Fast C++ compiler

zapcc C++ compilation speed against gcc 5.4 and clang 3.9