Compiler benchmark GCC and Clang on C++ library (ETL)

It's been a while since I've done a benchmark of different compilers on C++ code. Since I've recently released the version 1.1 of my ETL project (an optimized matrix/vector computation library with expression templates), I've decided to use it as the base of my benchmark. It's a C++14 library with a lot of templates. I'm going to compile the full test suite (124 test cases). This is done directly on the last release (1.1) code. I'm going to compile once in debug mode and once in release_debug (release plus debug symbols and assertions) and record the times for each compiler. The tests were compiled with support for every option in ETL to account to maximal compilation time. Each compilation was made using four threads (make -j4). I'm also going to test a few of the benchmarks to see the difference in runtime performance between the code generated by each compiler. The benchmark will be compiled in release mode and its compilation time recorded as well.

I'm going to test the following compilers:

  • GCC-4.9.4

  • GCC-5.4.0

  • GCC-6.3.0

  • GCC-7.1.0

  • clang-3.9.1

  • clang-4.0.1

  • zapcc-1.0 (commercial, based on clang-5.0 trunk)

All have been installed directly using Portage (Gentoo package manager) except for clang-4.0.1 that has been installed from sources and zapcc since it does not have a Gentoo package. Since clang package on Gentoo does not support multislotting, I had to install one version from source and the other from the package manager. This is also the reason I'm testing less versions of clang, simply less practical.

For the purpose of these tests, the exact same options have been used throughout all the compilers. Normally, I use different options for clang than for GCC (mainly more aggressive vectorization options on clang). This may not lead to the best performance for each compiler, but allows for comparison between the results with defaults optimization level. Here are the main options used:

  • In debug mode: -g

  • In release_debug mode: -g -O2

  • In release mode: -g -O3 -DNDEBUG -fomit-frame-pointer

In each case, a lot of warnings are enabled and the ETL options are the same.

All the results have been gathered on a Gentoo machine running on Intel Core i7-2600 (Sandy Bridge...) @3.4GHz with 4 cores and 8 threads, 12Go of RAM and a SSD. I do my best to isolate as much as possible the benchmark from perturbations and that my benchmark code is quite sound, it may well be that some results are not totally accurate. Moreover, some of the benchmarks are using multithreading, which may add some noise and unpredictability. When I was not sure about the results, I ran the benchmarks several time to confirm them and overall I'm confident of the results.

Compilation Time

Let's start with the results of the performance of the compilers themselves:

Compiler

Debug

Release_Debug

Benchmark

g++-4.9.4

402s

616s

100s

g++-5.4.0

403s

642s

95s

g++-6.3.0

399s

683s

102s

g++-7.1.0

371s

650s

105s

clang++-3.9.1

380s

807s

106s

clang++-4.0.1

260s

718s

92s

zapcc++-1.0

221s

649s

108s

Note: For Release_Debug and Benchmark, I only use three threads with zapcc, because 12Go of RAM is not enough memory for four threads.

There are some very significant differences between the different compilers. Overall, clang-4.0.1 is by far the fastest free compiler for Debug mode. When the tests are compiled with optimizations however, clang is falling behind. It's quite impressive how clang-4.0.1 manages to be so much faster than clang-3.9.1 both in debug mode and release mode. Really great work by the clang team here! With these optimizations, clang-4.0.1 is almost on par with gcc-7.1 in release mode. For GCC, it seems that the cost of optimization has been going up quite significantly. However, GCC 7.1 seems to have made optimization faster and standard compilation much faster as well. If we take into account zapcc, it's the fastest compiler on debug mode, but it's slower than several gcc versions on release mode.

Overall, I'm quite impressed by the performance of clang-4.0.1 which seems really fast! I'll definitely make more tests with this new version of the compiler in the near future. It's also good to see that g++-7.1 also did make the build faster than gcc-6.3. However, the fastest gcc version for optimization is still gcc-4.9.4 which is already an old branch with low C++ standard support.

Runtime Performance

Let's now take a look at the quality of the generated code. For some of the benchmarks, I've included two versions of the algorithm. std is the most simple algorithm (the naive one) and vec is the hand-crafted vectorized and optimized implementation. All the tests were done on single-precision floating points.

Dot product

The first benchmark that is run is to compute the dot product between two vectors. Let's look first at the naive version:

dot (std)

100

500

1000

10000

100000

1000000

2000000

3000000

4000000

5000000

10000000

g++-4.9.4

64.96ns

97.12ns

126.07ns

1.89us

25.91us

326.49us

1.24ms

1.92ms

2.55ms

3.22ms

6.36ms

g++-5.4.0

72.96ns

101.62ns

127.89ns

1.90us

23.39us

357.63us

1.23ms

1.91ms

2.57ms

3.20ms

6.32ms

g++-6.3.0

73.31ns

102.88ns

130.16ns

1.89us

24.314us

339.13us

1.47ms

2.16ms

2.95ms

3.70ms

6.69ms

g++-7.1.0

70.20ns

104.09ns

130.98ns

1.90us

23.96us

281.47us

1.24ms

1.93ms

2.58ms

3.19ms

6.33ms

clang++-3.9.1

64.69ns

98.69ns

128.60ns

1.89us

23.33us

272.71us

1.24ms

1.91ms

2.56ms

3.19ms

6.37ms

clang++-4.0.1

60.31ns

96.34ns

128.90ns

1.89us

22.87us

270.21us

1.23ms

1.91ms

2.55ms

3.18ms

6.35ms

zapcc++-1.0

61.14ns

96.92ns

125.95ns

1.89us

23.84us

285.80us

1.24ms

1.92ms

2.55ms

3.16ms

6.34ms

The differences are not very significant between the different compilers. The clang-based compilers seem to be the compilers producing the fastest code. Interestingly, there seem to have been a big regression in gcc-6.3 for large containers, but that has been fixed in gcc-7.1.

dot (vec)

100

500

1000

10000

100000

1000000

2000000

3000000

4000000

5000000

10000000

g++-4.9.4

48.34ns

80.53ns

114.97ns

1.72us

22.79us

354.20us

1.24ms

1.89ms

2.52ms

3.19ms

6.55ms

g++-5.4.0

47.16ns

77.70ns

113.66ns

1.72us

22.71us

363.86us

1.24ms

1.89ms

2.52ms

3.19ms

6.56ms

g++-6.3.0

46.39ns

77.67ns

116.28ns

1.74us

23.39us

452.44us

1.45ms

2.26ms

2.87ms

3.49ms

7.52ms

g++-7.1.0

49.70ns

80.40ns

115.77ns

1.71us

22.46us

355.16us

1.21ms

1.85ms

2.49ms

3.14ms

6.47ms

clang++-3.9.1

46.13ns

78.01ns

114.70ns

1.66us

22.82us

359.42us

1.24ms

1.88ms

2.53ms

3.16ms

6.50ms

clang++-4.0.1

45.59ns

74.90ns

111.29ns

1.57us

22.47us

351.31us

1.23ms

1.85ms

2.49ms

3.12ms

6.45ms

zapcc++-1.0

45.11ns

75.04ns

111.28ns

1.59us

22.46us

357.32us

1.25ms

1.89ms

2.53ms

3.15ms

6.47ms

If we look at the optimized version, the differences are even slower. Again, the clang-based compilers are producing the fastest executables, but are closely followed by gcc, except for gcc-6.3 in which we can still see the same regression as before.

Logistic Sigmoid

The next test is to check the performance of the sigmoid operation. In that case, the evaluator of the library will try to use parallelization and vectorization to compute it. Let's see how the different compilers fare:

sigmoid

10

100

1000

10000

100000

1000000

g++-4.9.4

8.16us

5.23us

6.33us

29.56us

259.72us

2.78ms

g++-5.4.0

7.07us

5.08us

6.39us

29.44us

266.27us

2.96ms

g++-6.3.0

7.13us

5.32us

6.45us

28.99us

261.81us

2.86ms

g++-7.1.0

7.03us

5.09us

6.24us

28.61us

252.78us

2.71ms

clang++-3.9.1

7.30us

5.25us

6.57us

30.24us

256.75us

1.99ms

clang++-4.0.1

7.47us

5.14us

5.77us

26.03us

235.87us

1.81ms

zapcc++-1.0

7.51us

5.26us

6.48us

28.86us

258.31us

1.95ms

Interestingly, we can see that gcc-7.1 is the fastest for small vectors while clang-4.0 is the best for producing code for larger vectors. However, except for the biggest vector size, the difference is not really significantly. Apparently, there is a regression in zapcc (or clang-5.0) since it's slower than clang-4.0 at the same level as clang-3.9.

y = alpha * x + y (axpy)

The third benchmark is the well-known axpy (y = alpha * x + y). This is entirely resolved by expressions templates in the library, no specific algorithm is used. Let's see the results:

saxpy

10

100

1000

10000

100000

1000000

g++-4.9.4

38.1ns

61.6ns

374ns

3.65us

40.8us

518us

g++-5.4.0

35.0ns

58.1ns

383ns

3.87us

43.2us

479us

g++-6.3.0

34.3ns

59.4ns

371ns

3.57us

40.4us

452us

g++-7.1.0

34.8ns

59.7ns

399ns

3.78us

43.1us

547us

clang++-3.9.1

32.3ns

53.8ns

297ns

3.21us

38.3us

466us

clang++-4.0.1

32.4ns

59.8ns

296ns

3.31us

38.2us

475us

zapcc++-1.0

32.0ns

54.0ns

333ns

3.32us

38.7us

447us

Even on the biggest vector, this is a very fast operation, once vectorized and parallelized. At this speed, some of the differences observed may not be highly significant. Again clang-based versions are the fastest versions on this code, but by a small margin. There also seems to be a slight regression in gcc-7.1, but again quite small.

Matrix Matrix multiplication (GEMM)

The next benchmark is testing the performance of a Matrix-Matrix Multiplication, an operation known as GEMM in the BLAS nomenclature. In that case, we test both the naive and the optimized vectorized implementation. To save some horizontal space, I've split the tables in two.

sgemm (std)

10

20

40

60

80

100

g++-4.9.4

7.04us

50.15us

356.42us

1.18ms

3.41ms

5.56ms

g++-5.4.0

8.14us

74.77us

513.64us

1.72ms

4.05ms

7.92ms

g++-6.3.0

8.03us

64.78us

504.41us

1.69ms

4.02ms

7.87ms

g++-7.1.0

7.95us

65.00us

508.84us

1.69ms

4.02ms

7.84ms

clang++-3.9.1

3.58us

28.59us

222.36us

0.73ms

1.77us

3.41ms

clang++-4.0.1

4.00us

25.47us

190.56us

0.61ms

1.45us

2.80ms

zapcc++-1.0

4.00us

25.38us

189.98us

0.60ms

1.43us

2.81ms

sgemm (std)

200

300

400

500

600

700

800

900

1000

1200

g++-4.9.4

44.16ms

148.88ms

455.81ms

687.96ms

1.47s

1.98s

2.81s

4.00s

5.91s

9.52s

g++-5.4.0

63.17ms

213.01ms

504.83ms

984.90ms

1.70s

2.70s

4.03s

5.74s

7.87s

14.905

g++-6.3.0

64.04ms

212.12ms

502.95ms

981.74ms

1.69s

2.69s

4.13s

5.85s

8.10s

14.08s

g++-7.1.0

62.57ms

210.72ms

499.68ms

974.94ms

1.68s

2.67s

3.99s

5.68s

7.85s

13.49s

clang++-3.9.1

27.48ms

90.85ms

219.34ms

419.53ms

0.72s

1.18s

1.90s

2.44s

3.36s

5.84s

clang++-4.0.1

22.01ms

73.90ms

175.02ms

340.70ms

0.58s

0.93s

1.40s

1.98s

2.79s

4.69s

zapcc++-1.0

22.33ms

75.80ms

181.27ms

359.13ms

0.63s

1.02s

1.52s

2.24s

3.21s

5.62s

This time, the differences between the different compilers are very significant. The clang compilers are leading the way by a large margin here, with clang-4.0 being the fastest of them (by another nice margin). Indeed, clang-4.0.1 is producing code that is, on average, about twice faster than the code generated by the best GCC compiler. Very interestingly as well, we can see a huge regression starting from GCC-5.4 and that is still here in GCC-7.1. Indeed, the best GCC version, in the tested versions, is again GCC-4.9.4. Clang is really doing an excellent job of compiling the GEMM code.

sgemm (vec)

10

20

40

60

80

100

g++-4.9.4

264.27ns

0.95us

3.28us

14.77us

23.50us

60.37us

g++-5.4.0

271.41ns

0.99us

3.31us

14.811us

24.116us

61.00us

g++-6.3.0

279.72ns

1.02us

3.27us

15.39us

24.29us

61.99us

g++-7.1.0

273.74ns

0.96us

3.81us

15.55us

31.35us

71.11us

clang++-3.9.1

296.67ns

1.34us

4.18us

19.93us

33.15us

82.60us

clang++-4.0.1

322.68ns

1.38us

4.17us

20.19us

34.17us

83.64us

zapcc++-1.0

307.49ns

1.41us

4.10us

19.72us

33.72us

84.80us

sgemm (vec)

200

300

400

500

600

700

800

900

1000

1200

g++-4.9.4

369.52us

1.62ms

2.91ms

7.17ms

11.74ms

22.91ms

34.82ms

51.67ms

64.36ms

111.15ms

g++-5.4.0

387.54us

1.60ms

2.97ms

7.36ms

12.11ms

24.37ms

35.37ms

52.27ms

65.72ms

112.74ms

g++-6.3.0

384.43us

1.74ms

3.12ms

7.16ms

12.44ms

24.15ms

34.87ms

52.59ms

70.074ms

119.22ms

g++-7.1.0

458.05us

1.81ms

3.44ms

7.86ms

13.43ms

24.70ms

36.54ms

53.47ms

66.87ms

117.25ms

clang++-3.9.1

494.52us

1.96ms

4.80ms

8.88ms

18.20ms

29.37ms

41.24ms

60.72ms

72.28ms

123.75ms

clang++-4.0.1

511.24us

2.04ms

4.11ms

9.46ms

15.34ms

27.23ms

38.27ms

58.14ms

72.78ms

128.60ms

zapcc++-1.0

492.28us

2.03ms

3.90ms

9.00ms

14.31ms

25.72ms

37.09ms

55.79ms

67.88ms

119.92ms

As for the optimized version, it seems that the two families are reversed. Indeed, GCC is doing a better job than clang here, and although the margin is not as big as before, it's still significant. We can still observe a small regression in GCC versions because the 4.9 version is again the fastest. As for clang versions, it seems that clang-5.0 (used in zapcc) has had some performance improvements for this case.

For this case of matrix-matrix multiplication, it's very impressive that the differences in the non-optimized code are so significant. And it's also impressive that each family of compilers has its own strength, clang being seemingly much better at handling unoptimized code while GCC is better at handling vectorized code.

Convolution (2D)

The last benchmark that I considered is the case of the valid convolution on 2D images. The code is quite similar to the GEMM code but more complicated to optimized due to cache locality.

sconv2_valid (std)

100x50

105x50

110x55

115x55

120x60

125x60

130x65

135x65

140x70

g++-4.9.4

27.93ms

33.68ms

40.62ms

48.23ms

57.27ms

67.02ms

78.45ms

92.53ms

105.08ms

g++-5.4.0

37.60ms

44.94ms

54.24ms

64.45ms

76.63ms

89.75ms

105.08ms

121.66ms

140.95ms

g++-6.3.0

37.10ms

44.99ms

54.34ms

64.54ms

76.54ms

89.87ms

105.35ms

121.94ms

141.20ms

g++-7.1.0

37.55ms

45.08ms

54.39ms

64.48ms

76.51ms

92.02ms

106.16ms

125.67ms

143.57ms

clang++-3.9.1

15.42ms

18.59ms

22.21ms

26.40ms

31.03ms

36.26ms

42.35ms

48.87ms

56.29ms

clang++-4.0.1

15.48ms

18.67ms

22.34ms

26.50ms

31.27ms

36.58ms

42.61ms

49.33ms

56.80ms

zapcc++-1.0

15.29ms

18.37ms

22.00ms

26.10ms

30.75ms

35.95ms

41.85ms

48.42ms

55.74ms

In that case, we can observe the same as for the GEMM. The clang-based versions are much producing significantly faster code than the GCC versions. Moreover, we can also observe the same large regression starting from GCC-5.4.

sconv2_valid (vec)

100x50

105x50

110x55

115x55

120x60

125x60

130x65

135x65

140x70

g++-4.9.4

878.32us

1.07ms

1.20ms

1.68ms

2.04ms

2.06ms

2.54ms

3.20ms

4.14ms

g++-5.4.0

853.73us

1.03ms

1.15ms

1.36ms

1.76ms

2.05ms

2.44ms

2.91ms

3.13ms

g++-6.3.0

847.95us

1.02ms

1.14ms

1.35ms

1.74ms

1.98ms

2.43ms

2.90ms

3.12ms

g++-7.1.0

795.82us

0.93ms

1.05ms

1.24ms

1.60ms

1.77ms

2.20ms

2.69ms

2.81ms

clang++-3.9.1

782.46us

0.93ms

1.05ms

1.26ms

1.60ms

1.84ms

2.21ms

2.65ms

2.84ms

clang++-4.0.1

767.58us

0.92ms

1.04ms

1.25ms

1.59ms

1.83ms

2.20ms

2.62ms

2.83ms

zapcc++-1.0

782.49us

0.94ms

1.06ms

1.27ms

1.62ms

1.83ms

2.24ms

2.65ms

2.85ms

This time, clang manages to produce excellent results. Indeed, all the produced executables are significantly faster than the versions produced by GCC, except for GCC-7.1 which is producing similar results. The other versions of GCC are falling behind it seems. It seems that it was only for the GEMM that clang was having a lot of troubles handling the optimized code.

Conclusion

Clang seems to have recently done a lot of optimizations regarding compilation time. Indeed, clang-4.0.1 is much faster for compilation than clang-3.9. Although GCC-7.1 is faster than GCC-6.3, all the GCC versions are slower than GCC-4.9.4 which is the fastest at compiling code with optimizations. GCC-7.1 is the fastest GCC version for compiling code in debug mode.

In some cases, there is almost no difference between different compilers in the generated code. However, in more complex algorithms such as the matrix-matrix multiplication or the two-dimensional convolution, the differences can be quite significant. In my tests, Clang have shown to be much better at compiling unoptimized code. However, and especially in the GEMM case, it seems to be worse than GCC at handling hand-optimized. I will investigate that case and try to tailor the code so that clang is having a better time with it.

For me, it's really weird that the GCC regression, apparently starting from GCC-5.4, has still not been fixed in GCC 7.1. I was thinking of dropping support for GCC-4.9 in order to go full C++14 support, but now I may have to reconsider my position. However, seeing that GCC is generally the best at handling optimized code (especially for GEMM), I may be able to do the transition, since the optimized code will be used in most cases.

As for zapcc, although it is still the fastest compiler in debug mode, with the new speed of clang-4.0.1, its margin is quite small. Moreover, on optimized build, it's not as fast as GCC. If you use clang and can have access to zapcc, it's still quite a good option to save some time.

Overall, I have been quite pleased by clang-4.0.1 and GCC-7.1, the most recent versions I have been testing. It seems that they did quite some good work. I will definitely run some more tests with them and try to adapt the code. I'm still considering whether I will drop support for some older compilers.

I hope this comparison was interesting :) My next post will probably be about the difference in performance between my machine learning framework and other frameworks to train neural networks.

Related articles

  • Decrease DLL neural network compilation time with C++17
  • Partial type erasing in Deep Learning Library (DLL) to improve compilation time
  • zapcc - a faster C++ compiler
  • How I made my Deep Learning Library 38% faster to compile (Optimization and C++17 if constexpr)
  • Release of zapcc 1.0 - Fast C++ compiler
  • zapcc C++ compilation speed against gcc 5.4 and clang 3.9
  • Comments

    Comments powered by Disqus