Blog blog("Baptiste Wicht"); (Posts about gcc)

Decrease DLL neural network compilation time with C++17

Baptiste Wicht — Wed, 07 Feb 2018 10:39:02 GMT

Just last week, I've migrated my Expression Templates Library (ETL) library to C++17, it is now also done in my Deep Learning Library (DLL) library. In ETL, this resulted in a much nicer code overall, but no real improvement in compilation time.

The objective of the migration of DLL was two-fold. First, I also wanted to simplify some code, especially with if constexpr. But I also especially wanted to try to reduce the compilation time. In the past, I've already tried a few changes with C++17, with good results on the compilation of the entire test suite. While this is very good, this is not very representative of users of the library. Indeed, normally you'll have only one network in your source file not several. The new changes will especially help in the case of many networks, but less in the case of a single network per source file.

This time, I decided to test the compilation on the examples. I've tested the eight official examples from the DLL library:

mnist_dbn: A fully-connected Deep Belief Network (DBN) on the MNIST data set with three layers
char_cnn: A special CNN with embeddings and merge and group layers for text recognition
imagenet_cnn: A 12 layers Convolutional Neural Network (CNN) for Imagenet
mnist_ae: A simple two-layers auto-encoder for MNIST
mnist_cnn: A simple 6 layers CNN for MNIST
mnist_deep_ae: A deep auto-encoder for MNIST, only fully-connected
mnist_lstm: A Recurrent Neural Network (RNN) with Long Short Term Memory (LSTM) cells
mnist_mlp: A simple fully-connected network for MNIST, with dropout
mnist_rnn: A simple RNN with simple cells for MNIST

This is really representative of what users can do with the library and I think it's a much better for compilation time.

For reference, you can find the source code of all the examples online.

Results

Let's start with the results. I've tested this at different stages of the migration with clang 5 and GCC 7.2. I tested the following steps:

The original C++14 version
Simply compiling in c++17 mode (-std=c++17)
Using the C++17 version of the ETL library
Upgrading DLL to C++17 (without ETL)
ETL and DLL in C++17 versions

I've compiled each example independently in release_debug mode. Here are the results for G++ 7.2:

Example	0	1	2	3	4	5	6	7	8
C++14	37.818	32.944	33.511	15.403	29.998	16.911	24.745	18.974	19.006
-std=c++17	38.358	32.409	32.707	15.810	30.042	16.896	24.635	19.134	19.027
ETL C++17	36.045	31.000	30.942	15.322	28.840	16.747	24.151	18.208	18.939
DLL C++17	35.251	32.577	32.854	15.653	29.758	16.851	24.606	19.098	19.146
Final C++17	32.289	31.133	30.939	15.232	28.753	16.526	24.326	18.116	17.819
Final Improvement	14.62%	5.49%	7.67%	1.11%	4.15%	2.27%	1.69%	4.52%	6.24%

The difference by just enabling c++17 is not significant. On the other hand, some significant gain can be obtained by using the C++17 version of ETL, especially for the DBN version and for the CNN versions. Except for the DBN case, the migration of DLL to C++17 did not bring any significant advantage. When everything is combined, the gains are more important :) In the best case, the example is 14.6% faster to compile.

Let's see if it's the same with clang++ 5.0:

Example	0	1	2	3	4	5	6	7	8
C++14	40.690	34.753	35.488	16.146	31.926	17.708	29.806	19.207	20.858
-std=c++17	40.502	34.664	34.990	16.027	31.510	17.630	29.465	19.161	20.860
ETL C++17	37.386	33.008	33.896	15.519	30.269	16.995	28.897	18.383	19.809
DLL C++17	37.252	34.592	35.250	16.131	31.782	17.606	29.595	19.126	20.782
Final C++17	34.470	33.154	33.881	15.415	30.279	17.078	28.808	18.497	19.761
Final Improvement	15.28%	4.60%	4.52%	4.52%	5.15%	3.55%	3.34%	3.69%	5.25%

First of all, as I have seen time after time, clang is still slower than GCC. It's a not a big difference, but still significant. Overall, the gains are a bit higher on clang than on GCC, but not by much. Interestingly, the migration of DLL to C++17 is less interesting in terms of compilation time for clang. It seems even to slow down compilation on some examples. On the other hand, the migration of ETL is more important than on GCC.

Overall, every example is faster to compile using both libraries in C++17, but we don't have spectacular speed-ups. With clang, we have speedups from 3.3% to 15.3%. With GCC, we have speedup from 1.1% to 14.6%. It's not very high, but I'm already satisfied with these results.

C++17 in DLL

Overall, the migration of DLL to C++17 was quite similar to that of ETL. You can take a look at my previous article if you want more details on C++17 features I've used.

I've replaced a lot of SFINAE functions with if constexpr. I've also replaced a lot of statif_if with if constexpr. There was a large number of these in DLL's code. I also enabled all the constexpr that were commented for this exact time :)

I was also thinking that I could replace a lot of meta-programming stuff with fold expressions. While I was able to replace a few of them, most of them were harder to replace with fold expressions. Indeed, the variadic pack is often hidden behind another class and therefore the pack is not directly usable from the network class or the group and merge layers classes. I didn't want to start a big refactoring just to use a C++17 feature, the current state of this code is fine.

I made some use of structured bindings as well, but again not as much as I was thinking. In fact, a lot of time, I'm assigning the elements of a pair or tuple to existing variables not declaring new variables and unfortunately, you can only use structured bindings with auto declaration.

Overall, the code is significantly better now, but there was less impact than there was on ETL. It's also a smaller code base, so maybe this is normal and my expectations were too high ;)

Conclusion

The trunk of DLL is now a C++17 library :) I think this improve the quality of the code by a nice margin! Even though, there is still some work to be done to improve the code, especially for the DBN pretraining code, the quality is quite good now. Moreover, the switch to C++17 made the compilation of neural networks using the DLL library faster to compile, from 1.1% in the worst case to 15.3% in the best case! I don't know when I will release the next version of DLL, but it will take some time. I'll especially have to polish the RNN support and add a sequence to sequence loss before I will release the 1.1 version of DLL.

I'm quite satisfied with C++17 even if I would have liked a bit more features to play with! I'm already a big fan of if constexpr, this can make the code much nicer and fold expressions are much more intuitive than their previous recursive template counterpart.

I may also consider migrating some parts of the cpp-utils library, but if I do, it will only be through the use of conditionals in order not to break the other projects that are based on the library.

How I made my Deep Learning Library 38% faster to compile (Optimization and C++17 if constexpr)

Baptiste Wicht — Thu, 21 Sep 2017 17:44:34 GMT

My Deep Learning Library (DLL) project is a C++ library for training and using artificial neural networks (you can take a look at this post about DLL if you want more information).

While I made a lot of effort to make it as fast as possible to train and run neural networks, the compilation time has been steadily going up and is becoming quite annoying. This library is heavily templated and all the matrix operations are done using my Expression Templates Library (ETL) which is more than template-heavy itself.

In this post, I'll present two techniques with which I've been able to reduce the total compilation of the DLL unit tests by up to 38%.

Compiler benchmark GCC and Clang on C++ library (ETL)

Baptiste Wicht — Mon, 07 Aug 2017 07:16:21 GMT

It's been a while since I've done a benchmark of different compilers on C++ code. Since I've recently released the version 1.1 of my ETL project (an optimized matrix/vector computation library with expression templates), I've decided to use it as the base of my benchmark. It's a C++14 library with a lot of templates. I'm going to compile the full test suite (124 test cases). This is done directly on the last release (1.1) code. I'm going to compile once in debug mode and once in release_debug (release plus debug symbols and assertions) and record the times for each compiler. The tests were compiled with support for every option in ETL to account to maximal compilation time. Each compilation was made using four threads (make -j4). I'm also going to test a few of the benchmarks to see the difference in runtime performance between the code generated by each compiler. The benchmark will be compiled in release mode and its compilation time recorded as well.

I'm going to test the following compilers:

GCC-4.9.4
GCC-5.4.0
GCC-6.3.0
GCC-7.1.0
clang-3.9.1
clang-4.0.1
zapcc-1.0 (commercial, based on clang-5.0 trunk)

All have been installed directly using Portage (Gentoo package manager) except for clang-4.0.1 that has been installed from sources and zapcc since it does not have a Gentoo package. Since clang package on Gentoo does not support multislotting, I had to install one version from source and the other from the package manager. This is also the reason I'm testing less versions of clang, simply less practical.

For the purpose of these tests, the exact same options have been used throughout all the compilers. Normally, I use different options for clang than for GCC (mainly more aggressive vectorization options on clang). This may not lead to the best performance for each compiler, but allows for comparison between the results with defaults optimization level. Here are the main options used:

In debug mode: -g
In release_debug mode: -g -O2
In release mode: -g -O3 -DNDEBUG -fomit-frame-pointer

In each case, a lot of warnings are enabled and the ETL options are the same.

All the results have been gathered on a Gentoo machine running on Intel Core i7-2600 (Sandy Bridge...) @3.4GHz with 4 cores and 8 threads, 12Go of RAM and a SSD. I do my best to isolate as much as possible the benchmark from perturbations and that my benchmark code is quite sound, it may well be that some results are not totally accurate. Moreover, some of the benchmarks are using multithreading, which may add some noise and unpredictability. When I was not sure about the results, I ran the benchmarks several time to confirm them and overall I'm confident of the results.

Compilation Time

Let's start with the results of the performance of the compilers themselves:

Compiler	Debug	Release_Debug	Benchmark
g++-4.9.4	402s	616s	100s
g++-5.4.0	403s	642s	95s
g++-6.3.0	399s	683s	102s
g++-7.1.0	371s	650s	105s
clang++-3.9.1	380s	807s	106s
clang++-4.0.1	260s	718s	92s
zapcc++-1.0	221s	649s	108s

Note: For Release_Debug and Benchmark, I only use three threads with zapcc, because 12Go of RAM is not enough memory for four threads.

There are some very significant differences between the different compilers. Overall, clang-4.0.1 is by far the fastest free compiler for Debug mode. When the tests are compiled with optimizations however, clang is falling behind. It's quite impressive how clang-4.0.1 manages to be so much faster than clang-3.9.1 both in debug mode and release mode. Really great work by the clang team here! With these optimizations, clang-4.0.1 is almost on par with gcc-7.1 in release mode. For GCC, it seems that the cost of optimization has been going up quite significantly. However, GCC 7.1 seems to have made optimization faster and standard compilation much faster as well. If we take into account zapcc, it's the fastest compiler on debug mode, but it's slower than several gcc versions on release mode.

Overall, I'm quite impressed by the performance of clang-4.0.1 which seems really fast! I'll definitely make more tests with this new version of the compiler in the near future. It's also good to see that g++-7.1 also did make the build faster than gcc-6.3. However, the fastest gcc version for optimization is still gcc-4.9.4 which is already an old branch with low C++ standard support.

Runtime Performance

Let's now take a look at the quality of the generated code. For some of the benchmarks, I've included two versions of the algorithm. std is the most simple algorithm (the naive one) and vec is the hand-crafted vectorized and optimized implementation. All the tests were done on single-precision floating points.

Dot product

The first benchmark that is run is to compute the dot product between two vectors. Let's look first at the naive version:

dot (std)	100	500	1000	10000	100000	1000000	2000000	3000000	4000000	5000000	10000000
g++-4.9.4	64.96ns	97.12ns	126.07ns	1.89us	25.91us	326.49us	1.24ms	1.92ms	2.55ms	3.22ms	6.36ms
g++-5.4.0	72.96ns	101.62ns	127.89ns	1.90us	23.39us	357.63us	1.23ms	1.91ms	2.57ms	3.20ms	6.32ms
g++-6.3.0	73.31ns	102.88ns	130.16ns	1.89us	24.314us	339.13us	1.47ms	2.16ms	2.95ms	3.70ms	6.69ms
g++-7.1.0	70.20ns	104.09ns	130.98ns	1.90us	23.96us	281.47us	1.24ms	1.93ms	2.58ms	3.19ms	6.33ms
clang++-3.9.1	64.69ns	98.69ns	128.60ns	1.89us	23.33us	272.71us	1.24ms	1.91ms	2.56ms	3.19ms	6.37ms
clang++-4.0.1	60.31ns	96.34ns	128.90ns	1.89us	22.87us	270.21us	1.23ms	1.91ms	2.55ms	3.18ms	6.35ms
zapcc++-1.0	61.14ns	96.92ns	125.95ns	1.89us	23.84us	285.80us	1.24ms	1.92ms	2.55ms	3.16ms	6.34ms

The differences are not very significant between the different compilers. The clang-based compilers seem to be the compilers producing the fastest code. Interestingly, there seem to have been a big regression in gcc-6.3 for large containers, but that has been fixed in gcc-7.1.

dot (vec)	100	500	1000	10000	100000	1000000	2000000	3000000	4000000	5000000	10000000
g++-4.9.4	48.34ns	80.53ns	114.97ns	1.72us	22.79us	354.20us	1.24ms	1.89ms	2.52ms	3.19ms	6.55ms
g++-5.4.0	47.16ns	77.70ns	113.66ns	1.72us	22.71us	363.86us	1.24ms	1.89ms	2.52ms	3.19ms	6.56ms
g++-6.3.0	46.39ns	77.67ns	116.28ns	1.74us	23.39us	452.44us	1.45ms	2.26ms	2.87ms	3.49ms	7.52ms
g++-7.1.0	49.70ns	80.40ns	115.77ns	1.71us	22.46us	355.16us	1.21ms	1.85ms	2.49ms	3.14ms	6.47ms
clang++-3.9.1	46.13ns	78.01ns	114.70ns	1.66us	22.82us	359.42us	1.24ms	1.88ms	2.53ms	3.16ms	6.50ms
clang++-4.0.1	45.59ns	74.90ns	111.29ns	1.57us	22.47us	351.31us	1.23ms	1.85ms	2.49ms	3.12ms	6.45ms
zapcc++-1.0	45.11ns	75.04ns	111.28ns	1.59us	22.46us	357.32us	1.25ms	1.89ms	2.53ms	3.15ms	6.47ms

If we look at the optimized version, the differences are even slower. Again, the clang-based compilers are producing the fastest executables, but are closely followed by gcc, except for gcc-6.3 in which we can still see the same regression as before.

Logistic Sigmoid

The next test is to check the performance of the sigmoid operation. In that case, the evaluator of the library will try to use parallelization and vectorization to compute it. Let's see how the different compilers fare:

sigmoid	10	100	1000	10000	100000	1000000
g++-4.9.4	8.16us	5.23us	6.33us	29.56us	259.72us	2.78ms
g++-5.4.0	7.07us	5.08us	6.39us	29.44us	266.27us	2.96ms
g++-6.3.0	7.13us	5.32us	6.45us	28.99us	261.81us	2.86ms
g++-7.1.0	7.03us	5.09us	6.24us	28.61us	252.78us	2.71ms
clang++-3.9.1	7.30us	5.25us	6.57us	30.24us	256.75us	1.99ms
clang++-4.0.1	7.47us	5.14us	5.77us	26.03us	235.87us	1.81ms
zapcc++-1.0	7.51us	5.26us	6.48us	28.86us	258.31us	1.95ms

Interestingly, we can see that gcc-7.1 is the fastest for small vectors while clang-4.0 is the best for producing code for larger vectors. However, except for the biggest vector size, the difference is not really significantly. Apparently, there is a regression in zapcc (or clang-5.0) since it's slower than clang-4.0 at the same level as clang-3.9.

y = alpha * x + y (axpy)

The third benchmark is the well-known axpy (y = alpha * x + y). This is entirely resolved by expressions templates in the library, no specific algorithm is used. Let's see the results:

saxpy	10	100	1000	10000	100000	1000000
g++-4.9.4	38.1ns	61.6ns	374ns	3.65us	40.8us	518us
g++-5.4.0	35.0ns	58.1ns	383ns	3.87us	43.2us	479us
g++-6.3.0	34.3ns	59.4ns	371ns	3.57us	40.4us	452us
g++-7.1.0	34.8ns	59.7ns	399ns	3.78us	43.1us	547us
clang++-3.9.1	32.3ns	53.8ns	297ns	3.21us	38.3us	466us
clang++-4.0.1	32.4ns	59.8ns	296ns	3.31us	38.2us	475us
zapcc++-1.0	32.0ns	54.0ns	333ns	3.32us	38.7us	447us

Even on the biggest vector, this is a very fast operation, once vectorized and parallelized. At this speed, some of the differences observed may not be highly significant. Again clang-based versions are the fastest versions on this code, but by a small margin. There also seems to be a slight regression in gcc-7.1, but again quite small.

Matrix Matrix multiplication (GEMM)

The next benchmark is testing the performance of a Matrix-Matrix Multiplication, an operation known as GEMM in the BLAS nomenclature. In that case, we test both the naive and the optimized vectorized implementation. To save some horizontal space, I've split the tables in two.

sgemm (std)	10	20	40	60	80	100
g++-4.9.4	7.04us	50.15us	356.42us	1.18ms	3.41ms	5.56ms
g++-5.4.0	8.14us	74.77us	513.64us	1.72ms	4.05ms	7.92ms
g++-6.3.0	8.03us	64.78us	504.41us	1.69ms	4.02ms	7.87ms
g++-7.1.0	7.95us	65.00us	508.84us	1.69ms	4.02ms	7.84ms
clang++-3.9.1	3.58us	28.59us	222.36us	0.73ms	1.77us	3.41ms
clang++-4.0.1	4.00us	25.47us	190.56us	0.61ms	1.45us	2.80ms
zapcc++-1.0	4.00us	25.38us	189.98us	0.60ms	1.43us	2.81ms

sgemm (std)	200	300	400	500	600	700	800	900	1000	1200
g++-4.9.4	44.16ms	148.88ms	455.81ms	687.96ms	1.47s	1.98s	2.81s	4.00s	5.91s	9.52s
g++-5.4.0	63.17ms	213.01ms	504.83ms	984.90ms	1.70s	2.70s	4.03s	5.74s	7.87s	14.905
g++-6.3.0	64.04ms	212.12ms	502.95ms	981.74ms	1.69s	2.69s	4.13s	5.85s	8.10s	14.08s
g++-7.1.0	62.57ms	210.72ms	499.68ms	974.94ms	1.68s	2.67s	3.99s	5.68s	7.85s	13.49s
clang++-3.9.1	27.48ms	90.85ms	219.34ms	419.53ms	0.72s	1.18s	1.90s	2.44s	3.36s	5.84s
clang++-4.0.1	22.01ms	73.90ms	175.02ms	340.70ms	0.58s	0.93s	1.40s	1.98s	2.79s	4.69s
zapcc++-1.0	22.33ms	75.80ms	181.27ms	359.13ms	0.63s	1.02s	1.52s	2.24s	3.21s	5.62s

This time, the differences between the different compilers are very significant. The clang compilers are leading the way by a large margin here, with clang-4.0 being the fastest of them (by another nice margin). Indeed, clang-4.0.1 is producing code that is, on average, about twice faster than the code generated by the best GCC compiler. Very interestingly as well, we can see a huge regression starting from GCC-5.4 and that is still here in GCC-7.1. Indeed, the best GCC version, in the tested versions, is again GCC-4.9.4. Clang is really doing an excellent job of compiling the GEMM code.

sgemm (vec)	10	20	40	60	80	100
g++-4.9.4	264.27ns	0.95us	3.28us	14.77us	23.50us	60.37us
g++-5.4.0	271.41ns	0.99us	3.31us	14.811us	24.116us	61.00us
g++-6.3.0	279.72ns	1.02us	3.27us	15.39us	24.29us	61.99us
g++-7.1.0	273.74ns	0.96us	3.81us	15.55us	31.35us	71.11us
clang++-3.9.1	296.67ns	1.34us	4.18us	19.93us	33.15us	82.60us
clang++-4.0.1	322.68ns	1.38us	4.17us	20.19us	34.17us	83.64us
zapcc++-1.0	307.49ns	1.41us	4.10us	19.72us	33.72us	84.80us

sgemm (vec)	200	300	400	500	600	700	800	900	1000	1200
g++-4.9.4	369.52us	1.62ms	2.91ms	7.17ms	11.74ms	22.91ms	34.82ms	51.67ms	64.36ms	111.15ms
g++-5.4.0	387.54us	1.60ms	2.97ms	7.36ms	12.11ms	24.37ms	35.37ms	52.27ms	65.72ms	112.74ms
g++-6.3.0	384.43us	1.74ms	3.12ms	7.16ms	12.44ms	24.15ms	34.87ms	52.59ms	70.074ms	119.22ms
g++-7.1.0	458.05us	1.81ms	3.44ms	7.86ms	13.43ms	24.70ms	36.54ms	53.47ms	66.87ms	117.25ms
clang++-3.9.1	494.52us	1.96ms	4.80ms	8.88ms	18.20ms	29.37ms	41.24ms	60.72ms	72.28ms	123.75ms
clang++-4.0.1	511.24us	2.04ms	4.11ms	9.46ms	15.34ms	27.23ms	38.27ms	58.14ms	72.78ms	128.60ms
zapcc++-1.0	492.28us	2.03ms	3.90ms	9.00ms	14.31ms	25.72ms	37.09ms	55.79ms	67.88ms	119.92ms

As for the optimized version, it seems that the two families are reversed. Indeed, GCC is doing a better job than clang here, and although the margin is not as big as before, it's still significant. We can still observe a small regression in GCC versions because the 4.9 version is again the fastest. As for clang versions, it seems that clang-5.0 (used in zapcc) has had some performance improvements for this case.

For this case of matrix-matrix multiplication, it's very impressive that the differences in the non-optimized code are so significant. And it's also impressive that each family of compilers has its own strength, clang being seemingly much better at handling unoptimized code while GCC is better at handling vectorized code.

Convolution (2D)

The last benchmark that I considered is the case of the valid convolution on 2D images. The code is quite similar to the GEMM code but more complicated to optimized due to cache locality.

sconv2_valid (std)	100x50	105x50	110x55	115x55	120x60	125x60	130x65	135x65	140x70
g++-4.9.4	27.93ms	33.68ms	40.62ms	48.23ms	57.27ms	67.02ms	78.45ms	92.53ms	105.08ms
g++-5.4.0	37.60ms	44.94ms	54.24ms	64.45ms	76.63ms	89.75ms	105.08ms	121.66ms	140.95ms
g++-6.3.0	37.10ms	44.99ms	54.34ms	64.54ms	76.54ms	89.87ms	105.35ms	121.94ms	141.20ms
g++-7.1.0	37.55ms	45.08ms	54.39ms	64.48ms	76.51ms	92.02ms	106.16ms	125.67ms	143.57ms
clang++-3.9.1	15.42ms	18.59ms	22.21ms	26.40ms	31.03ms	36.26ms	42.35ms	48.87ms	56.29ms
clang++-4.0.1	15.48ms	18.67ms	22.34ms	26.50ms	31.27ms	36.58ms	42.61ms	49.33ms	56.80ms
zapcc++-1.0	15.29ms	18.37ms	22.00ms	26.10ms	30.75ms	35.95ms	41.85ms	48.42ms	55.74ms

In that case, we can observe the same as for the GEMM. The clang-based versions are much producing significantly faster code than the GCC versions. Moreover, we can also observe the same large regression starting from GCC-5.4.

sconv2_valid (vec)	100x50	105x50	110x55	115x55	120x60	125x60	130x65	135x65	140x70
g++-4.9.4	878.32us	1.07ms	1.20ms	1.68ms	2.04ms	2.06ms	2.54ms	3.20ms	4.14ms
g++-5.4.0	853.73us	1.03ms	1.15ms	1.36ms	1.76ms	2.05ms	2.44ms	2.91ms	3.13ms
g++-6.3.0	847.95us	1.02ms	1.14ms	1.35ms	1.74ms	1.98ms	2.43ms	2.90ms	3.12ms
g++-7.1.0	795.82us	0.93ms	1.05ms	1.24ms	1.60ms	1.77ms	2.20ms	2.69ms	2.81ms
clang++-3.9.1	782.46us	0.93ms	1.05ms	1.26ms	1.60ms	1.84ms	2.21ms	2.65ms	2.84ms
clang++-4.0.1	767.58us	0.92ms	1.04ms	1.25ms	1.59ms	1.83ms	2.20ms	2.62ms	2.83ms
zapcc++-1.0	782.49us	0.94ms	1.06ms	1.27ms	1.62ms	1.83ms	2.24ms	2.65ms	2.85ms

This time, clang manages to produce excellent results. Indeed, all the produced executables are significantly faster than the versions produced by GCC, except for GCC-7.1 which is producing similar results. The other versions of GCC are falling behind it seems. It seems that it was only for the GEMM that clang was having a lot of troubles handling the optimized code.

Conclusion

Clang seems to have recently done a lot of optimizations regarding compilation time. Indeed, clang-4.0.1 is much faster for compilation than clang-3.9. Although GCC-7.1 is faster than GCC-6.3, all the GCC versions are slower than GCC-4.9.4 which is the fastest at compiling code with optimizations. GCC-7.1 is the fastest GCC version for compiling code in debug mode.

In some cases, there is almost no difference between different compilers in the generated code. However, in more complex algorithms such as the matrix-matrix multiplication or the two-dimensional convolution, the differences can be quite significant. In my tests, Clang have shown to be much better at compiling unoptimized code. However, and especially in the GEMM case, it seems to be worse than GCC at handling hand-optimized. I will investigate that case and try to tailor the code so that clang is having a better time with it.

For me, it's really weird that the GCC regression, apparently starting from GCC-5.4, has still not been fixed in GCC 7.1. I was thinking of dropping support for GCC-4.9 in order to go full C++14 support, but now I may have to reconsider my position. However, seeing that GCC is generally the best at handling optimized code (especially for GEMM), I may be able to do the transition, since the optimized code will be used in most cases.

As for zapcc, although it is still the fastest compiler in debug mode, with the new speed of clang-4.0.1, its margin is quite small. Moreover, on optimized build, it's not as fast as GCC. If you use clang and can have access to zapcc, it's still quite a good option to save some time.

Overall, I have been quite pleased by clang-4.0.1 and GCC-7.1, the most recent versions I have been testing. It seems that they did quite some good work. I will definitely run some more tests with them and try to adapt the code. I'm still considering whether I will drop support for some older compilers.

I hope this comparison was interesting :) My next post will probably be about the difference in performance between my machine learning framework and other frameworks to train neural networks.

Partial type erasing in Deep Learning Library (DLL) to improve compilation time

Baptiste Wicht — Wed, 15 Mar 2017 06:43:44 GMT

In a previous post, I compared the compilation time on my Deep Learning Library (DLL) project with different compilers. I realized that the compilation times were quickly going unreasonable for this library, especially for compiling the unit cases which clearly hurts the development of the library. Indeed, you want to be able to run the unit tests reasonably quickly after you integrated new changes.

Reduce the compilation time

The first thing I did was to split the compilation in three executables: one for the unit tests, one for the various performance tests and one for the various other miscellaneous tests. With this, it is much faster to compile only the unit test cases.

But this can be improved significantly more. In DLL a network is a variadic template containing the list of layers, in order. In DLL, there are two main different ways of declaring a neural networks. In the first version, the fast version, the layers directly know their sizes:

using network_t =
    dll::dbn_desc<
        dll::dbn_layers<
            dll::rbm_desc<28 * 28, 500, dll::momentum, dll::batch_size<64>>::layer_t,
            dll::rbm_desc<500    , 400, dll::momentum, dll::batch_size<64>>::layer_t,
            dll::rbm_desc<400    , 10,  dll::momentum, dll::batch_size<64>, dll::hidden<dll::unit_type::SOFTMAX>>::layer_t>,
        dll::trainer<dll::sgd_trainer>, dll::batch_size<64>>::dbn_t;

auto network = std::make_unique<network_t>();
network->pretrain(dataset.training_images, 10);
network->fine_tune(dataset.training_images, dataset.training_labels, 10);

In my opinion, this is the best way to use DLL. This is the fastest and the clearest. Moreover, the dimensions of the network can be validated at compile time, which is always better than at runtime. However, the dimensions of the network cannot be changed at runtime. For this, there is a different version, the dynamic version:

using network_t =
    dll::dbn_desc<
        dll::dbn_layers<
            dll::dyn_rbm_desc<dll::momentum>::layer_t,
            dll::dyn_rbm_desc<dll::momentum>::layer_t,
            dll::dyn_rbm_desc<dll::momentum, dll::hidden<dll::unit_type::SOFTMAX>>::layer_t>,
        dll::batch_size<64>, dll::trainer<dll::sgd_trainer>>::dbn_t;

auto network = std::make_unique<network_t>();

network->template layer_get<0>().init_layer(28 * 28, 500);
network->template layer_get<1>().init_layer(500, 400);
network->template layer_get<2>().init_layer(400, 10);
network->template layer_get<0>().batch_size = 64;
network->template layer_get<1>().batch_size = 64;
network->template layer_get<2>().batch_size = 64;

network->pretrain(dataset.training_images, 10);
network->fine_tune(dataset.training_images, dataset.training_labels, 10);

This is a bit more verbose, but the configuration can be changed at runtime with this system. Moreover, this is also faster to compile. On the other hand, there is some performance slowdown.

There is also a third version that is a hybrid of the first version:

using network_t =
    dll::dyn_dbn_desc<
        dll::dbn_layers<
            dll::rbm_desc<28 * 28, 500, dll::momentum, dll::batch_size<64>>::layer_t,
            dll::rbm_desc<500    , 400, dll::momentum, dll::batch_size<64>>::layer_t,
            dll::rbm_desc<400    , 10,  dll::momentum, dll::batch_size<64>, dll::hidden<dll::unit_type::SOFTMAX>>::layer_t>,
        dll::trainer<dll::sgd_trainer>, dll::batch_size<64>>::dbn_t;

auto network = std::make_unique<network_t>();
network->pretrain(dataset.training_images, 10);
network->fine_tune(dataset.training_images, dataset.training_labels, 10);

Only one line was changed compared to the first version, dbn_desc becomes dyn_dbn_desc. What this changes is that all the layers are automatically transformed into their dynamic versions and all the parameters are propagated at runtime. This is a form a type erasing since the sizes will not be propagated at compilation time. But this is simple since the types are simply transformed from one type to another directly. Behind the scene, it's the dynamic version using the front-end of the fast version. This is almost as fast to compile as the dynamic version, but the code is much better. It executes the same as the dynamic version.

If we compare the compilation time of the three versions when compiling a single network and 5 different networks with different architectures, we get the following results (with clang):

Model	Time [s]
1 Fast	30
1 Dynamic	16.6
1 Hybrid	16.6
5 Fast	114
5 Dynamic	16.6
5 Hybrid	21.9

Even with one single network, the compilation time is reduced by 44%. When five different networks are compilation, time is reduced by 85%. This can be explained easily. Indeed, for the hybrid and dynamic versions, the layers will have the same type and therefore a lot of template instantiations will only be done once instead of five times. This makes a lot of difference since almost everything is template inside the library.

Unfortunately, this also has an impact on the runtime of the network:

Model	Pretrain [s]	Train [s]
Fast	195	114
Dynamic	203	123
Hybrid	204	122

On average, for dense models, the slowdown is between 4% and 8%. For convolutional models, it is between 10% and 25%. I will definitely work on trying to make the dynamic and especially the hybrid version faster in the future, most on the work should be on the matrix library (ETL) that is used.

Since for test cases, a 20% increase in runtime is not really a problem, tests being fast already, I decided to add an option to DLL so that everything can be compiled by default in hybrid model. By using a compilation flag, all the dbn_desc are becoming dyn_dbn_desc and therefore each used network is becoming a hybrid network. Without a single change in the code, the compilation time of the entire library can be significantly improved, as seen in the next section. This can also be used in user code to improve compilation time during debugging and experiments and can be turned off for the final training.

On my Continuous Integration system, I will build the system in both configurations. This is not really an issue, since my personal machine at home is more powerful than what I have available here.

Results

On a first experiment, I measured the difference before and after this change on the three executables of the library, with gcc:

Model	Unit [s]	Perf [s]	Misc [s]
Before	1029	192	937
After	617	143	619
Speedup	40.03%	25.52%	33.93%

It is clear that the speedups are very significant! The compilation is between 25% and 40% faster with the new option. Overall, this is a speedup of 36%! I also noticed that the compilation takes significantly less memory than before. Therefore, I decided to rerun the compiler benchmark on the library. In the previous experiment, zapcc was taking so much memory that it was impossible to use more than one thread. Let's see how it is faring now. The time to compile the full unit tests is computed for each compiler. Let's start in debug mode:

Debug	-j1	-j2	-j3	-j4
clang-3.9	527	268	182	150
gcc-4.9.3	591	303	211	176
gcc-5.3.0	588	302	209	175
zapcc-1.0	375	187	126	121

This time, zapcc is able to scale to four threads without problems. Moreover, it is always the fastest compiler, by a significant margin, in this configuration. It is followed by clang and then by gcc for which both versions are about the same speed.

If we compile again in release mode:

Release	-j1	-j2	-j3	-j4
clang-3.9	1201	615	421	356
gcc-4.9.3	1041	541	385	321
gcc-5.3.0	1114	579	412	348
zapcc-1.0	897	457	306	306

The difference in compilation time is very large, it's twice slower to compile with all optimizations enabled. It also takes significantly more memory. Indeed, zapcc was not able to compile with 4 threads. Nevertheless, even the results with three threads are better than the other compilers using four threads. zapcc is clearly the winner again on this test, followed by gcc4-9 which is faster than gcc-5.3 which is itself faster than clang. It seems that while clang is better at frontend than gcc, it is slower for optimizations. Note that this may also be an indication that clang performs more optimizations than gcc and may not be slower.

Conclusion

By using some form of type erasing to simplify the templates types at compile time, I was able to reduce the overall compilation time of my Deep Learning Library (DLL) by 36%. Moreover, this can be done by switching a simple compilation flag. This also very significantly reduce the memory used during the compilation, allowing zapcc to to compile with up to three threads, compared with only one before. This makes zapcc the fastest compiler again on this benchmark. Overall, this will make debugging much easier on this library and will save me a lot of time.

In the future, I plan to try to improve compilation time even more. I have a few ideas, especially in ETL that should significantly improve the compilation time but that will require a lot of time to implement, so that will likely have to wait a while. In the coming days, I plan to work on the performance of DLL, especially for stochastic gradient descent.

If you want more information on DLL, you can check out the dll Github repository.

Disappointing zapcc performance on Deep Learning Library (DLL)

Baptiste Wicht — Thu, 09 Mar 2017 12:41:06 GMT

One week ago, zapcc 1.0 was released and I've observed it to be much faster than the other compilers in terms of compile time. This can be seen when I tested it on my Expression Templates Library (ETL). It was almost four times faster than clang 3.9 and about 2.5 times faster than GCC.

The ETL library is quite heavy to compile, but still reasonable. This is not the case for my Deep Learning Library (DLL) where compiling all the test cases takes a very long time. I have to admit that I have been going overboard with templates and such and I have now to pay the price. In practice, for the users of the library, this is not a big problem since only one or two neural networks will be compiled (and it will take hours to train), but in the test cases, there are hundreds of them and this is a huge pain. Anyway, enough with the ramble, I figured it would be very good to test zapcc on it and see what I can gain from using it.

In this article, when I speak of a compiler thread, I mean an instance of the processor, so it's really a process in the Linux world.

Results

However, I soon realized that I would have more issues than I thought. The first problem is the memory consumed by zapcc. Indeed, it is based on clang and I always had problem with huge memory consumption from clang on this library and zapcc has even bigger memory consumption because some information is cached between runs. The amount of memory that zapcc is able to cache can be configured in the configuration file. By default, it can use 1.5Go of memory. When zapcc goes over the memory limit, it simply wipes out its caches. This means that all the gain for the next compilation will be lost, since the cache will have to be rebuilt from scratch. This is not a hard limit for the compilation itself. Indeed, if the compilation itself takes 3Go, it will still be able to complete it, but it is likely that the cache will be wiped after the compilation.

When I tried compiling using several threads, it soon used all my memory and crashed. The same occurs with clang but I can still compile with 3 or 4 threads without too much issues on this computer. The same also occurs with GCC but it can still handle 4 or 5 threads (depending on the order of the compilation units).

The tests are performed on my desktop computer at work, which is not really good... I have 12Go of RAM (I had to ask for extra...) and an old Sandy Bridge processor, but at least I have an SSD (also had to ask for extra).

I started with testing with only one compiler thread. For zapcc, I set the maximum memory limit to 8Go. Even with such a limit, the zapcc server restarted more than 10 times during the compilation of the 84 test cases. After this first experiment, I increased the number of threads to 2 for each compiler, using 4Go limit for zapcc. The limit is for each server and each parallel thread will spawn a new server, so the effective limit is the number of threads times the limit. Even with two threads, I was unable to finish a compilation with zapcc. This is quite disappoint for me since clang is able to run with 4 threads in parallel. Moreover, a big problem with that is that the servers are not always killed when there is no no more memory, they just hang and use all the memory of the computer, which is evidently really inconvenient for service processes. When this happens with clang or gcc, the compiler simply crashes and the memory is released and make is interrupted. Since zapcc is not able to work with more than one thread on this computer, the results are the ones with one thread. I was also surprised to be able to compile the library with clang and four threads, this was not possible before clang-3.9.

Compiler	-j1	-j2	-j3	-j4
gcc-4.9.3	2250.95	1256.36	912.67	760.84
gcc-5.3.0	2305.37	1279.49	918.08	741.38
clang-3.9	2047.61	1102.93	899.13	730.42
zapcc-1.0	1483.73	1483.73	1483.73	1483.73
Difference against Clang	-27.55%	+25.69%	+39.37%	+50.77%
Speedup VS GCC-5.3	-35.66%	+13.75%	+38.09%	+50.03%
Speedup VS GCC-4.9	-34.08%	+15.30%	+38.50%	+48.75%

If we look at the results with only one thread, we can see that there still are some significant improvements when using zapcc, but nowhere near as good as what was seen in the compilation of ETL. Here, the compilation time is reduced by 34% compared to gcc and by 27% compared to clang. This is not bad, since it is faster than the other compilers, but I would have expected better speedups. We can see that g++-4.9 is slightly faster than g++-5.3, but this is not really a significant difference. I'm actually very surprised to find that clang is faster than g++ on this experiment. On ETL, it is always very significantly slower and before, it was also significantly slower on DLL. I was so used to this, that I stopped using it on this project. I may have to reconsider my position when working on this project.

Let's look at the results with more than two threads. Even with two threads, every compiler is faster than zapcc. Indeed, zapcc is slower than Clang by 25% and slower than GCC by about 15%. If we use more threads, the other compilers are becoming even faster and the slowdowns of zapcc are more important. When using four threads, zapcc is about 48% slower than gcc and about 50% slower than clang. This is really showing one big downside of zapcc that has a very large memory consumption. When it is used to compile really heavy template code, it is failing very early to use more processes. And even when there is enough memory, the speedups are not as great as for relatively simpler code.

One may argue that this is not a fair comparison since zapcc does not have the same numbers of threads. However, considering that this is the best zapcc can do on this machine, I would argue that this is a fair comparison in this limited experimental setting. If we were to have a big machine for compilation, which I don't have at work, the zapcc results would likely be more interesting, but in this specific limited case, it shows that zapcc suffers from its high memory consumption. It should also be taken into account that this experiment was done with almost nothing else running on the machine (no browser for instance) to have as much memory as possible available for the compilers. This is not a common use case. Most of the days, when I compile something, I have my browser open, which makes a large difference in memory available, and several other applications (but consoles and vim instances do not really consume memory :D).

This experiment made me realize that the compilation times for this library were quickly becoming crazy. Most of the time, the complete test suite is only compiled on my Continuous Integration machine at home which has a much faster processor and much more RAM. Therefore, it is relatively fast since it uses more threads to compile. Nevertheless, this is not a good point that the unit tests takes so much time to compile. I plan to split the test cases in several sets. Because, currently the real unit tests are compiled with the performance tests and other various tests. I'll probably end up generating three executables. This will help greatly during development. Moreover, I also have a technique to decrease the compilation time by erasing some template parameters at compilation time. This is already ready, but has currently a runtime overhead that I will try to remove and then use this technique everywhere to get back to reasonable compilation times. I'll also try to see if I can find obvious compilation bottlenecks in the code.

Conclusion

To conclude, while zapcc brings some very interesting compilation speedups in some cases like in my ETL library, it also has some downsides, namely huge memory consumption. This memory consumption may prevent the use of several compiler threads and render zapcc much less interesting than other compilers.

When trying to compile my DLL library on a machine with 12Go of RAM with two zapcc threads, it was impossible for me to make it complete. While zapcc was faster with one thread than the other compilers, they were able to use up to four threads and in the end zapcc was about twice slower than clang.

I knew that zapcc memory consumption was very large, but I would have not have expected something so critical. Another feature that would be interesting in zapcc would be to set a max memory hard limit for the server instead of simply a limit on the cache they are able to keep in memory. This would prevent hanging the complete computer when something goes wrong.

I had a good surprise with clang that was actually faster than GCC and also able to work with four threads in parallel. This was not the case with previous version of clang. On ETL, it is still significantly slower than GCC though.

For now, I'll continue using clang on this DLL project and use zapcc only on my ETL project. I'll also focus on improving the compilation time on this project and make it reasonable again.

Release of zapcc 1.0 - Fast C++ compiler

Baptiste Wicht — Thu, 02 Mar 2017 13:50:04 GMT

If you remember, I recently wrote about zapcc C++ compilation speed against gcc 5.4 and clang 3.9 in which I was comparing the beta version of zapcc against gcc and clang.

I just been informed that zapcc was just released in version 1.0. I though it was a good occasion to test it again. It will be compared against gcc-4.9, gcc-5.3 and clang-3.9. This version is based on the trunk of clang-5.0.

Again, I will use my Expression Template Library (ETL) project. This is a purely header-only library with lots of templates. I'm going to compile the full test cases. This is a perfect example for long compilation times.

The current tests are made on the last version of the library and with slightly different parameters for compilation, therefore the absolute times are not comparable, but the speedups should be comparable.

Just like last time, I have configured zapcc to let is use 2Go RAM per caching server, which is the maximum allowed. Moreover, I killed the servers before each tests.

Debug results

Let's start with a debug build, with no optimizations enabled. Every build will use four threads. This is the equivalent of doing make -j4 debug/bin/etl_test without the link step.

Compiler
g++-4.9.3	190.09s
g++-5.3.0	200.92s
clang++-3.9	313.85
zapcc++	81.25
Speedup VS Clang	3.86
Speedup VS GCC-5.3	2.47
Speedup VS GCC-4.9	2.33

The speedups are even more impressive than last time! zapcc is almost four times fast than clang-3.9 and around 2.5 times faster than GCC-5.3. Interestingly, we can see that gcc-5.3 is slighly slower than GCC-4.9.

It seems that they have the compiler even faster!

Release results

Let's look now how the results are looking with optimizations enabled. Again, every build will use four threads. This is the equivalent of doing make -j4 release_debug/bin/etl_test without the link step.

Compiler
g++-4.9.3	252.99
g++-5.3.0	264.96
clang++-3.9	361.65
zapcc++	237.96
Speedup VS Clang	1.51
Speedup VS GCC-5.3	1.11
Speedup VS GCC-4.9	1.06

We can see that this time the speedups are not as interesting as they were. Very interestingly, it's the compiler that suffers the more from the optimization overhead. Indeed, zapcc is three times slower in release mode than it was in debug mode. Nevertheless, it still manages to beat the three other compilers, by about 10% for Gcc and 50% than clang, which is already interesting.

Conclusion

To conclude, we have observed that zapcc is always faster than the three compilers tested in this experiment. Moreover, in debug mode, the speedups are very significant, it was almost 4 times faster than clang and around 2.5 faster than gcc.

I haven't seen any problem with the tool, it's like clang and it should generate code of the same performance, but just compile it much faster. One problem I have with zapcc is that it is not based on an already released version of clang but on the trunk. That means it is hard to be compare with the exact same version of clang and it is also a risk of running into clang bugs.

Although the prices have not been published yet, it is indicated on the website that zapcc is free for non-commercial entities. Which is really great.

If you want more information, you can go to the official website of zapcc

C++ Compiler benchmark on Expression Templates Library (ETL)

Baptiste Wicht — Sun, 11 Dec 2016 13:17:30 GMT

In my Expression Templates Library (ETL) project, I have a lot of template heavy code that needs to run as fast as possible and that is quite intensive to compile. In this post, I'm going to compare the performance of a few of the kernels produced by different compilers. I've got GCC 5.4, GCC 6.20 and clang 3.9. I also included zapcc which is based on clang 4.0.

These tests have been run on an Haswell processor. The automatic parallelization of ETL has been turned off for these tests.

Keep in mind that some of the diagrams are presented in logarithmic form.

Vector multiplication

The first kernel is a very simple one, simple element-wise multiplication of two vectors. Nothing fancy here.

For small vectors, clang is significantly slower than gcc-5.4 and gcc6.2. On vectors from 100'000 elements, the speed is comparable for each compiler, depending on the memory bandwidth. Overall, gcc-6.2 produces the fastest code here. clang-4.0 is slightly slower than clang-3.9, but nothing dramatic.

Vector exponentiation

The second kernel is computing the exponentials of each elements of a vector and storing them in another vector.

Interestingly, this time, clang versions are significantly faster for medium to large vectors, from 1000 elements and higher, by about 5%. There is no significant differences between the different versions of each compiler.

Matrix-Matrix Multiplication

The next kernel I did benchmark with the matrix-matrix multiplication operation. In that case, the kernel is hand-unrolled and vectorized.

There are few differences between the compilers. The first thing is that for some sizes such as 80x80 and 100x100, clang is significantly faster than GCC, by more than 10%. The other interesting fact is that for large matrices zapcc-clang-4.0 is always slower than clang-3.9 which is itself on par with the two GCC versions. In my opinion, it comes from a regression in clang trunk but it could also come from zapcc itself.

The results are much more interesting here! First, there is a huge regression in clang-4.0 (or in zapcc for that matter). Indeed, it is up to 6 times slower than clang-3.9. Moreover, the clang-3.9 is always significantly faster than gcc-6.2. Finally, there is a small improvement in gcc-6.2 compared to gcc 5.4.

Fast-Fourrier Transform

The following kernel is the performance of a hand-crafted Fast-Fourrier transform implementation.

On this benchmark, gcc-6.2 is the clear winner. It is significantly faster than clang-3.9 and clang-4.0. Moreover, gcc-6.2 is also faster than gcc-5.4. On the contrary, clang-4.0 is significantly slower than clang-3.9 except on one configuration (10000 elements).

1D Convolution

This kernel is about computing the 1D valid convolution of two vectors.

While clang-4.0 is faster than clang-3.9, it is still slightly slower than both gcc versions. On the GCC side, there is not a lot of difference except on the 1000x500 on which gcc-6.2 is 25% faster.

And here are the results with the naive implementation:

Again, on the naive version, clang is much faster than GCC on the naive, by about 65%. This is a really large speedup.

2D Convolution

This next kernel is computing the 2D valid convolution of two matrices

There is no clear difference between the compilers in this code. Every compiler here has up and down.

Let's look at the naive implementation of the 2D convolution (units are milliseconds here not microseconds):

This time the difference is very large! Indeed, clang versions are about 60% faster than the GCC versions! This is really impressive. Even though this does not comes close to the optimized. It seems the vectorizer of clang is much more efficient than the one from GCC.

4D Convolution

The final kernel that I'm testing is the batched 4D convolutions that is used a lot in Deep Learning. This is not really a 4D convolution, but a large number of 2D convolutions applied on 4D tensors.

Again, there are very small differences between each version. The best versions are the most recent versions of the compiler gcc-6.2 and clang-4.0 on a tie.

Conclusion

Overall, we can see two trends in these results. First, when working with highly-optimized code, the choice of compiler will not make a huge difference. On these kind of kernels, gcc-6.2 tend to perform faster than the other compilers, but only by a very slight margin, except in some cases. On the other hand, when working with naive implementations, clang versions really did perform much better than GCC. The clang compiled versions of the 1D and 2D convolutions are more than 60% faster than their GCC counter parts. This is really impressive. Overall, clang-4.0 seems to have several performance regressions, but since it's not still a work in progress, I would not be suprised if these regressions are not present in the final version. Since the clang-4.0 version is in fact the clang version used by zapcc, it's also possible that zapcc is introducing new performance regressions.

Overall, my advice would be to use GCC-6.2 (or 5.4) on hand-optimized kernels and clang when you have mostly naive implementations. However, keep in mind that at least for the example shown here, the naive version optimized by the compiler never comes close to the highly-optimized version.

As ever, takes this with a grain of salt, it's only been tested on one project and one machine, you may obtain very different results on other projects and on other processors.

zapcc C++ compilation speed against gcc 5.4 and clang 3.9

Baptiste Wicht — Mon, 05 Dec 2016 17:46:09 GMT

A week ago, I compared the compilation time performance of zapcc against gcc-4.9.3 and clang-3.7. On debug builds, zapcc was about 2 times faster than gcc and 3 times faster than clang. In this post, I'm going to try some more recent compilers, namely gcc 5.4 and clang 3.9 on the same project. If you want more information on zapcc, read the previous posts, this post will concentrate on results.

Again, I use my Expression Template Library (ETL). This is a purely header-only library with lots of templates. I'm going to compile the full test cases.

The results of the two articles are not directly comparable, since they were obtained on two different computers. The one on which the present results are done has a less powerful and only 16Go of RAM compared to the 32Go of RAM of my build machine. Also take into account that that the present results were obtained on a Desktop machine, there can be some perturbations from background tasks.

Just like on the previous results, it does not help using more threads than physical cores, therefore, the results were only computed on up to 4 cores on this machine.

The link time is not taken into account on the results.

Debug build

Let's start with the result of the debug build.

Compiler	-j1	-j2	-j4
g++-5.4.0	469s	230s	130s
clang++-3.9	710s	371s	218s
zapcc++	214s	112s	66s
Speedup VS Clang	3.31	3.31	3.3
Speedup VS GCC	2.19	2.05	1.96

The results are almost the same as the previous test. zapcc is 3.3 times faster to compile than Clang and around 2 times faster than GCC. It seems that GCC 5.4 is a bit faster than GCC 4.9.3 while clang 3.9 is a bit slower than clang 3.7, but nothing terribly significant.

Overall, for debug builds, zapcc can bring a very significant improvement to your compile times.

Release build

Let's see what is the status of Release builds. Since the results are comparable between the numbers of threads, the results here are just for one thread.

This is more time consuming since a lot of optimizations are enabled and more features from ETL are enabled as well.

Compiler	-j1
g++-5.4.0	782s
clang++-3.9	960s
zapcc++	640s
Speedup VS Clang	1.5
Speedup VS GCC	1.22

On a release build, the speedups are much less interesting. Nevertheless, they are still significant. zapcc is still 1.2 times faster than gcc and 1.5 times faster than clang. Then speedup against clang 3.9 is significantly higher than it was on my experiment with clang 3.7, it's possible that clang 3.9 is slower or simply has new optimization passes.

Conclusion

The previous conclusion still holds with modern version of compilers: zapcc is much faster than other compilers on Debug builds of template heavy code. More than 3 times faster than clang-3.9 and about 2 times faster than gcc-5.4. Since it's based on clang, there should not be any issue compiling projects that already compile with a recent clang. Even though the speedups are less interesting on a release build, it is still significantly, especially compared against clang.

I'm really interested in finding out what will be the pricing for zapcc once out of the beta or if they will be able to get even faster!

For the comparison with gcc 4.9.3 and clang 3.7, you can have a look at this article.

If you want more information about zapcc, you can go to the official website of zapcc

zapcc - a faster C++ compiler

Baptiste Wicht — Sat, 26 Nov 2016 12:17:50 GMT

Update: For a comparison against more modern compiler versions, you can read: zapcc C++ compilation speed against gcc 5.4 and clang 3.9

I just joined the private beta program of zapcc. Zapcc is a c++ compiler, based on Clang which aims at being much faster than other C++ compilers. How they are doing this is using a caching server that saves some of the compiler structures, which should speed up compilation a lot. The private beta is free, but once the compiler is ready, it will be a commercial compiler.

Every C++ developer knows that compilation time can quickly be an issue when programs are getting very big and especially when working with template-heavy code.

To benchmark this new compiler, I use my Expression Template Library (ETL). This is a purely header-only library with lots of templates. There are lots of test cases which is what I'm going to compile. I'm going to compare against Clang-3.7 and gcc-4.9.3.

I have configured zapcc to let is use 2Go RAM per caching server, which is the maximum allowed. Moreover, I killed the servers before each tests.

Debug build

Let's start with a debug build. In that configuration, there is no optimization going on and several of the features of the library (GPU, BLAS, ...) are disabled. This is the fastest way to compile ETL. I gathered this result on a 4 core, 8 threads, Intel processor, with an SSD.

The following table presents the results with different number of threads and the difference of zapcc compared to the other compilers:

Compiler	-j1	-j2	-j4	-j6	-j8
g++-4.9.3	350s	185s	104s	94s	91s
clang++-3.7	513s	271s	153s	145s	138s
zapcc++	158s	87s	47s	44s	42s
Speedup VS Clang	3.24	3.103	3.25	3.29	3.28
Speedup VS GCC	2.21	2.12	2.21	2.13	2.16

The result is pretty clear! zapcc is around three times faster than Clang and around two times faster than GCC. This is pretty impressive!

For those that think than Clang is always faster than GCC, keep in mind that this is not the case for template-heavy code such as this library. In all my tests, Clang has always been slower and much memory hungrier than GCC on template-heavy C++ code. And sometimes the difference is very significant.

Interestingly, we can also see that going past the physical cores is not really interesting on this computer. On some computer, the speedups are interesting, but not on this one. Always benchmark!

Release build

We have seen the results on a debug build, let's now compare on something a bit more timely, a release build with all options of ETL enabled (GPU, BLAS, ...), which should make it significantly longer to compile.

Again, the table:

Compiler	-j1	-j2	-j4	-j6	-j8
g++-4.9.3	628s	336s	197s	189s	184s
clang++-3.7	663s	388s	215s	212s	205s
zapcc++	515s	281s	173s	168s	158s
Speedup VS Clang	1.28	1.38	1.24	1.26	1.29
Speedup VS GCC	1.21	1.30	1.13	1.12	1.16

This time, we can see that the difference is much lower. Zapcc is between 1.2 and 1.4 times faster than Clang and between 1.1 and 1.3 times faster than GCC. This shows that most of the speedups from zapcc are in the front end of the compiler. This is not a lot but still significant over long builds, especially if you have few threads where the absolute difference would be higher.

We can also observe that Clang is now almost on par with GCC which shows that optimization is faster in Clang while front and backend is faster in gcc.

You also have to keep in mind that zapcc memory usage is higher than Clang because of all the caching. Moreover, the server are still up in between compilations, so this memory usage stays between builds, which may not be what you want.

As for runtime, I have not seen any significant difference in performance between the clang version and the zapcc. According to the official benchmarks and documentation, there should not be any difference in that between zapcc and the version of clang on which zapcc is based.

Incremental build

Normally, zapcc should shine at incremental building, but I was unable to show any speedup when changing a single without killing the zapcc servers. Maybe I did something wrong in my usage of zapcc.

Conclusion

In conclusion, we can see that zapcc is always faster than both GCC and Clang, on my template-heavy library. Moreover, on debug builds, it is much faster than any of the two compilers, being more than 2 times faster than GCC and more than 3 times faster than clang. This is really great. Moreover, I have not seen any issue with the tool so far, it can seamlessly replace Clang without problem.

It's a bit weird that you cannot allocate more than 2Go to the zapcc servers.

For a program, that's really impressive. I hope that they are continuing the good work and especially that this motivates other compilers to improve the speed of compilation (especially of templates).

If you want more information, you can go to the official website of zapcc

Blazing fast unit test compilation with doctest 1.1

Baptiste Wicht — Wed, 21 Sep 2016 19:45:13 GMT

You may remember my quest for faster compilation times. I had made several changes to the Catch test framework macros in order to save some compilation at the expense of my test code looking a bit less nice:

REQUIRE(a == 9); //Before
REQUIRE_EQUALS(a, 9); //After

The first line is a little bit better, but using several optimizations, I was able to dramatically change the compilation time of the test cases of ETL. In the end, I don't think that the difference between the two lines justifies the high overhead in compilation times.

doctest

doctest is a framework quite similar to Catch but that claims to be much lighter. I tested doctest 1.0 early on, but at this point it was actually slower than Catch and especially slower than my versions of the macro.

Today, doctest 1.1 was released with promises of being even lighter than before and providing several new ways of speeding up compilation. If you want the results directly, you can take a look at the next section.

First of all, this new version improved the basic macros to make expression decomposition faster. When you use the standard REQUIRE macro, the expression is composed by using several template techniques and operator overloading. This is really slow to compile. By removing the need for this decomposition, the fast Catch macros are much faster to compile.

Moreover, doctest 1.1 also introduces CHECK_EQ that does not any expression decomposition. This is close to what I did in my macros expect that it is directly integrated into the framework and preserves all its features. It is also possible to bypass the expression checking code by using FAST_CHECK_EQ macro. In that case, the exceptions are not captured. Finally, a new configuration option is introduced (DOCTEST_CONFIG_SUPER_FAST_ASSERTS) that removes some features related to automatic debugger breaks. Since I don't use the debugger features and I don't need to capture exception everywhere (it's sufficient for me that the test fails completely if an exception is thrown), I'm more than eager to use these new features.

Results

For evaluation, I have compiled the complete test suite of ETL, with 1 thread, using gcc 4.9.3 with various different options, starting from Catch to doctest 1.1 with all compilation time features. Here are the results, in seconds:

Version	Time	VS Catch	VS Fast Catch	VS doctest 1.0
Catch	724.22
Fast Catch	464.52	-36%
doctest 1.0	871.54	+20%	+87%
doctest 1.1	614.67	-16%	+32%	-30%
REQUIRE_EQ	493.97	-32%	+6%	-43%
FAST_REQUIRE_EQ	439.09	-39%	-6%	-50%
SUPER_FAST_ASSERTS	411.11	-43%	-12%	-53%

As you can see, doctest 1.1 is much faster to compile than doctest 1.0! This is really great news. Moreover, it is already 16% faster than Catch. When all the features are used, doctest is 12% faster than my stripped down versions of Catch macros (and 43% faster than Catch standard macros). This is really cool! It means that I don't have to do any change in the code (no need to strip macros myself) and I can gain a lot of compilation time compared to the bare Catch framework.

I really think the author of doctest did a great job with the new version. Although this was not of as much interest for me, there are also a lot of other changes in the new version. You can consult the changelog if you want more information.

Conclusion

Overall, doctest 1.1 is much faster to compile than doctest 1.0. Moreover, it offers very fast macros for test assertions that are much faster to compile than Catch versions and even faster than the versions I created myself to reduce compilation time. I really thing this is a great advance for doctest. When compiling with all the optimizations, doctest 1.1 saves me 50 seconds in compilation time compared to the fast version of Catch macro and more than 5 minutes compared to the standard version of Catch macros.

I'll probably start using doctest on my development machine. For now, I'll keep Catch as well since I need it to generate the unit test reports in XML format for Sonarqube. Once this feature appears in doctest, I'll probably drop Catch from ETL and DLL

If you need blazing fast compilation times for your unit tests, doctest 1.1 is probably the way to go.

Improve DLL and ETL Compile Time further

Baptiste Wicht — Fri, 29 Jan 2016 16:02:34 GMT

For a while, the compilation time of my matrix/vector computation library (ETL), based on Expression Templates has become more and more problematic. I've already worked on this problem here and there, using some general techniques (pragmas, precompiled headers, header removals and so on). On this post, I'll talk about two major improvements I have been able to do directly in the code.

Use of static_if

Remember static_if ? I was able to use it to really reduce the compile time of DLL.

I wrote a script to time each test case of the DLL project to find the test cases that took the longest to compile. Once I found the best candidate, I isolated the functions that took the longest to compile. It was quite tedious and I did it by hand, primarily by commenting parts of the code and going deeper and deeper in the code. I was quite suprised to find that a single function call (template function of course ;) ) was responsible for 60% of the compilation time of my candidate test case. The function was instantiating a whole bunch of expression templates (to compute the free energy of several models). The function itself was not really optimizable, but what was really interesting is that this function was only used in some very rare cases and that these cases were known at compile-time :) This was a perfect case to use a static_if. And once the call was inside the static_if, the test case was indeed about 60% faster. This reduced the overall compilation time of DLL by about 30%!

This could also of course also have been achieved by using two functions, one with the call, one empty and selected by SFINAE (Substitution Failure Is Not An Error). I prefer the statif_if version since this really shows the intent and hides SFINAE behind nicer syntax.

I was also able to use static_if at other places in the DLL code to avoid instantiating some templates, but the improvements were much less dramatic (about 1% of the total compilation time). I was very lucky to find a single function that accounted for so much compile time. After some more tests, I concluded that much of the compilation time of DLL was spent compiling the Expression Templates from my ETL library so I decided to delve into ETL code directly.

Removal of std::async

The second improvement was very surprising. I was working on improving the compilation of ETL and found out that the sum and average reductions of matrices were dramatically slow, about an order of magnitude slower than standard operations on matrices. In parallel (but the two facts are linked), I also found out another weird fact when splitting a file into 10 parts (the file was comprised of 10 test cases). Compiling the 10 parts separarely (and sequentially, not multiple threads) was about 40% faster than compiling the complete file. There was no swapping so it was not a memory issue. This is not expected. Generally, it is faster to compile a big file than to compile its parts separately. The advantage of smaller files is that you can compile them in parallel and that incremental builds are faster (only compile a small part).

By elimination, I found out that most of the time was spent inside the function that was dispatching in parallel the work for accumulating the sum of a matrix. Here is the function:

template <typename T, typename Functor, typename AccFunctor>
inline void dispatch_1d_acc(bool p, Functor&& functor, AccFunctor&& acc_functor, std::size_t first, std::size_t last){
    if(p){
        std::vector<std::future<T>> futures(threads - 1);

        auto n = last - first;
        auto batch = n / threads;

        for(std::size_t t = 0; t < threads - 1; ++t){
            futures[t] = std::async(std::launch::async, functor, first + t * batch, first + (t+1) * batch);
        }

        acc_functor(functor(first + (threads - 1) * batch, last));

        for(auto& fut : futures){
            acc_functor(fut.get());
        }
    } else {
        acc_functor(functor(first, last));
    }
}

There isn't anything really fancy about this function. This takes one functor that will be done in parallel and one function for accumulation. It dispatches all the work in batch and then accumulates the results. I tried several things to optimize the compilation time of this function, but nothing worked. The line that was consuming all the time was the std::async line. This function was using std::async because the thread pool that I'm generally using does not support returning values from parallel functors. I decided to use a workaround and use my thread pool and I came out with this version:

template <typename T, typename Functor, typename AccFunctor>
inline void dispatch_1d_acc(bool p, Functor&& functor, AccFunctor&& acc_functor, std::size_t first, std::size_t last){
    if(p){
        std::vector<T> futures(threads - 1);
        cpp::default_thread_pool<> pool(threads - 1);

        auto n = last - first;
        auto batch = n / threads;

        auto sub_functor = [&futures, &functor](std::size_t t, std::size_t first, std::size_t last){
            futures[t] = functor(first, last);
        };

        for(std::size_t t = 0; t < threads - 1; ++t){
            pool.do_task(sub_functor, t, first + t * batch, first + (t+1) * batch);
        }

        acc_functor(functor(first + (threads - 1) * batch, last));

        pool.wait();

        for(auto fut : futures){
            acc_functor(fut);
        }
    } else {
        acc_functor(functor(first, last));
    }
}

I simply preallocate space for all the threads and create a new functor calling the input functor and saving its result inside the vector. It is less nice, but it works well. And it compiles MUCH faster. This reduced the compilation time of my biggest test case by a factor of 8 (from 344 seconds to 44 seconds). This is really crazy. It also fixed the problem where splitting the test case was faster than big file (it is now twice faster to compile the big files than compiling all the small files separately). This reduced the total compilation time of dll by about 400%.

As of now, I still have no idea why this makes such a big difference. I have looked at the std::async code, but I haven't found a valid reason for this slowdown. If someone has any idea, I'd be very glad to discuss in the comments below.

Improving the template instantiation tree

I recently discovered the templight tool that is a profiler for templates (pretty cool). After some time, I was able to build it and use it on ETL. For now, I haven't been able to reduce compile time a lot, but I have been able to reduce the template instantiation tree a lot seeing that some instantiations were completely useless and I optimized the code to remove them.

I won't be go into much details here because I plan to write a post on this subject in the coming days.

Conclusion

In conclusion, I would say that it is pretty hard to improve the compile time of complex C++ programs once you have gone through all the standard methods. However, I was very happy to found that two optimizations in the source code reduced the overall compilation of DLL by almost 500%. I will continue working on this, but for now, the compilation time is much more reasonable.

I hope the two main facts in this article were interesting. If you have similar experience, comments or ideas for further improvements, I'd be glad to discuss them with you in the comments :)

Improve ETL compile-time with Precompiled Headers

Baptiste Wicht — Sat, 20 Jun 2015 13:08:31 GMT

Very recently, I started trying to improve the compile-time of the ETL test suite. While not critical, it is always better to have tests that compile as fast as possible. In a previous post, I was able to improve the time a bit by improve the makefile, using pragra once and avoiding <iostream> headers. With these techniques, I reduced the compile-time from 87.5 to 84.1, which is not bad, but not as good as I would have expected.

In the previous, I had not tried to use Precompiled Headers (PCH) to improve the compile time, so I thought it would be a good time to do it.

Precompiled Headers

Precompiled Headers are an option of the compiler, where one header gets compiled. Normally, you only compile source files into object files, but you can also compile headers, although it is not the same thing. When a compiler compiles a header, it can do a lot of preprocessing (macros, includes, AST, symbols) and then store all the results into a precompiled header file. Once you compile the source files, the compiler will try to use the precompiled header file instead of the real header file. Of course, this can breaks the C++ standard since with that a header can not have different behaviour based on macros for instance. For these reasons (and probably implementation reasons as well), precompiled headers are really limited.

If we take the case of G++, G++ will consider the precompiled header file instead of the standard header only if (for a complete list, take a look at the GCC docs):

The same compilation flags are the same between the two compilations
The same compiler binary is used for the compilations
Only one precompiled header can be used in each compilation
The same macros must be defined
The include of the header must be before every possible C/C++ token

If all these conditions are met and you try to #include "header.hpp and there is a header.hpp.gch (the precompiled file) available in the search path, then the precompiled header will be taken instead of the standard one.

With clang, it is a bit different because the precompiled header cannot be included automatically, but has to be included explicitely in the source code, meaning you have to modify your code for this technique to work. This is a bad thing in my opinion, you never should have to modify your code to profit from a compiler feature. This is why I haven't used and don't plan to use precompiled headers with clang.

How-to

Once you know all the conditions for a precompiled header to be automatically included, it is quite straightforward to use them.

To generate a PCH file is easy:

g++ options header.hpp

This will generate header.hpp.gch. When you compile your source file using header.hpp, you don't have anything to do, you just have to compile it as usually and if all the conditions are met, the PCH file will be used instead of the other header.

Results and conclusion

I added precompiled header support into my make-utils collection of Makefile utilities and tested it on ETL. I have precompiled a header that itself included Catch and ETL. Almost all test files are including this header. With this change, I went from 84 seconds to 78seconds. Headers are taking 1.5seconds to be precompiled. This is a nice result I think. If your application is not as template-heavy as mine or if you have more source files, you should expect better improvements.

To conclude, even if precompiled headers are a sound way to reduce compile-time, they are really limited to some cases. I'm not a fan of the feature overally. It is not portable between compilers and not standard. Anyway, if you are really in need of saving some time, you should not hesitate too much ;)

How I improved (a bit) compile time of ETL ?

Baptiste Wicht — Tue, 16 Jun 2015 20:00:21 GMT

Recently I read several articles about C++ and compile time and I wondered if I could improve the compile time of my Expression Template Library (ETL) project. ETL is a header-only and template-heavy library. I'm not going to the change the design completely or to use type erasure techniques to reduce the compile time, ETL is all about performance.

As a disclaimer, don't expect fancy results from this post, I haven't been able to reduce compile time a lot, but I still wanted to share my experience.

I've used g++-4.9.2 to perform these tests.

I'm compiling the complete test suite (around 6900 source lines of codes in 36 files) in release mode. Each test file includes the ETL (around 10K SLOC). Each test is run with 8 threads (make -j8). For each result, I have run a complete build 5 times and taken the best result as the final result. Everything is run on a SSD and I have more than enough RAM to handle all the compilation in parallel.

The reference build time was 87.5 seconds.

Compile and generate dependency files at the same time

To help write my makefiles, I'm using a set of functions that I have written. This includes automatic dependency generation using -MM -MT options of the compiler. Until now, I had two targets, one to compile the cpp file into the object file and another one to generate the dependency file. I recently saw that compilers were able to do both at the same time! Clang, G++ and the Intel compiler all have a -MD -MF options that lets you generate the dependency file at the same time you compile your file, saving you at least one read of the file.

My compilation rule in my makefile has now become:

release/$(1)/%.cpp.o: $(1)/%.cpp
    @ mkdir -p release/$(1)/
    $(CXX) $(CXX_FLAGS) $(RELEASE_FLAGS) $(2) -MD -MF release/$(1)/$$*.cpp.d -o release/$(1)/$$*.cpp.o -c $(1)/$$*.cpp
    @ sed -i -e 's@^\(.*\)\.o:@\1.d \1.o:@' release/$(1)/$$*.cpp.d

This reduced the compilation time to 86.8 seconds. Not that much reduction, but it still is quite nice to know that. I would have expected this to reduce more the compile time.

Use #pragma once

Normally, I'm not a fan of #pragma since it is not standard, but for now ETL only supports three compilers and only very recent of them, so I have the guarantee that #pragma once is available, so what the hell!

I've replaced all the include guards by single #pragma once directives.

Again, the results are not impressive, this reduced the compile time to 86.2 seconds. I would only advise to use this if you are sure of the compilers you want to support and you need the extra time.

Avoid <iostream>

I've read that the <iostream> header was one of the slowest to compile of the STL. It is only one that is included several times in my headers only for stream operators and it turns out that there is a <iosfwd> header that forward declares a lot of things from the <iostream> and other I/O headers.

By replacing all <iostream> include by <iosfwd>, compile time has gone down to 84.1 seconds.

Conclusion

By using the three techniques, I've reduced the compile time from 87.5 to 84.1 seconds. I would have honestly hoped for more improvements, but this is a already a good start.

As a side note, clang compile time is 45.2 seconds under the same conditions (was 46.2 seconds before the optimizations). It is really much faster :) I'm still using GCC a lot since in several cases, it does generate much better code and in average, the generated code if faster (on my benchmarks at least). I don't have the numbers for icc, but icc is definitely the slowest of the three. When I have it available (at work), I use for release build before running something. The generated executables are generally faster (I only use Intel processors) and sometimes the difference can be quite important.

If you have ideas to reduce further the compile time on this test case, I'd be glad to hear them and put them to the test.

I hope that this small experience would be helpful to some of you :)

Other techniques

There are several other techniques that you can use to reduce compile time:

Precompiled Headers are supported by both Clang and GCC, altough not in a compatible. I haven't tested this in a while, but it is quite effective and a very interesting technique. The main problem with this is that is not standard and not compatible between compilers. But it probably is the most efficient techniques when you have lots of headers and lots of templates as in my case.
Unity builds can make full rebuild much faster. I personally don't like unity builds especially because it is only really good for full builds and you generally don't do full rebuilds that much (I know, I know, this is also the test done in this article :) ). Moreover, it also sucks at doing parallel builds.
Pimpl idioms and other type erasure techniques can reduce compile time a lot. If it is well done, it can be implemented without so much overhead.
Explicit instantiation of templates can also help, but only in the case of a user program. In the case of a library itself, you cannot do anything.
Reduce inclusions and use forward declarations, obviously...
Use tools like distcc (I very rarely use it) and ccache (I generally use it).
Update your compiler
Upgrade your computer ;)
...

GCC 4.7 vs CLang 3.1 on eddic

Baptiste Wicht — Mon, 12 Nov 2012 08:28:44 GMT

Now that eddic can be compiled with CLang, I wanted to compare the differences in compilation time and in performance of the generated executable between those two compilers. The tests are done using GCC 4.7.2 and CLang 3.1 on Gentoo.

Compilation Time

The first thing that I tested has been the compilation time of the two compilers to compile eddic with different flags. I tested the compilation in debug mode and with -O2 and -O3.

The most interesting fact in these results is that CLang is much faster than GCC. It takes twice less times to compile eddic with CLang in debug mode than with GCC. The impact on optimizations on CLang's compilation is also more important than on GCC. For both compilers, -O3 does not seems to add a lot of overhead.

Runtime performance

Then, I tested the performance of the generated executable. I tested it on three things, the whole test suite and two test cases that I know are the slowest for the EDDI Compiler. For each case, I took the slowest value of 5 consecutive executions.

The difference are very small. In -02, GCC performs a bit better, but in -O3, the performance are equivalent. I was a bit disappointed by the results, because I thought that there would be higher differences. It seems that CLang is not as far from GCC that some people would like to say. It also certainly depends on the program being compiled.

Conclusion

It is clear that CLang is much faster than GCC to compile eddic. Moreover, the performance of the generated executable are almost similar.

I will continue to use CLang as my development compiler and switches between the two when I'm doing performance benchmarking. I will try to update the benchmark once new versions of GCC / CLang are available.

eddic compiles with CLang 3.1

Baptiste Wicht — Thu, 01 Nov 2012 08:11:05 GMT

I finally added support for compiling eddic with LLVM CLang 3.1 !

The current development version can be completely compiled with CLang. Starting with the version 1.1.4, all versions of eddic will be support GCC and CLang.

The changes have not been as painful as I first thought.

The main problem that I has was about a static const variable of a class that had no user-constructor. GCC allows that, but it is not standard compliant and CLang was complaining.
Another problem that I encountered was about the used of bit flags and Template Meta Programming. I simplified that by the use of a simple type traits and it worked. I don't really know why this does not worked at first.
The remaining effort was to fix the several warnings that CLang had.

CLang also fixed a bug in my code with a warning on a assignment that was not supposed to be an assignment, thanks CLang.

The most interesting fact about CLang is that is it twice faster to build eddic than GCC. I think I'm gonna use it during development to fasten the compile time. Moreover, even if I only worked two days with it, it seems that the error messages are indeed better than the GCC's ones.

I haven't tried to compare the performances of eddic in both cases, but I will do that in the future, soon after the 1.1.4 version is released.

I tried the CLang static analyzer on eddic but it didn't found any bugs. Moreover, it crashed on several of my files. I didn't found why for now, but I will continue to investigate, perhaps I'm not using it correctly.

I expect to publish the next version of eddic in the next two weeks. This version has much more improvements that I thought at first and I have less time to work now that I'm working on my Master thesis.

More informations on CLang: The official site.

Back in Berkeley, California

Baptiste Wicht — Thu, 13 Sep 2012 08:35:43 GMT

I arrived yesterday to Berkeley, California.

Just like I did my Bachelor thesis in Lawrence Berkeley National Laboratory (LBNL), I will do my Master Thesis there too. The thesis will last a bit less than a semester.

During my Master Thesis I will try to use profiling samples from the Linux perf tools in GCC or Clang to optimize processor cache usage (avoid cache and page faults).

I will try to publish some posts about that during the semester if I have time.

Install the Insight Debugger on Linux Mint (works for Ubuntu too)

Baptiste Wicht — Thu, 26 Jan 2012 08:28:41 GMT

Insight is a very good debugger based on gdb. I prefer it over ddd or kdbg as I find it clearer and easier to use. Moreover, this debugger is also the one used in the book Assembly language Step by Step, for Linux. However, Insight has been removed from Debian packages already more than a year ago.

But, thanks to SevenMachines, a PPA repository is available to install it on Linux Mint (works also on Ubuntu and Ubuntu-based Linux distributions).

To add the repository to your apt sources, add the following lines to the /etc/apt/sources.list file:

deb http://ppa.launchpad.net/sevenmachines/dev/ubuntu natty main 
deb-src http://ppa.launchpad.net/sevenmachines/dev/ubuntu natty main

and update your apt sources:

sudo apt-get update

Then you can install insight:

sudo apt-get install insight

And now you are ready to use Insight as your debugger.

If you don't trust this PPA repository, you can also try it to install it from the sources (http://sources.redhat.com/insight/), but doesn't seem to very simple to install it. I wasn't able to build it on my Linux Mint 12.

Diploma Thesis : Inlining Assistance for large-scale object-oriented applications

Baptiste Wicht — Mon, 03 Oct 2011 06:44:17 GMT

One month ago, my diploma thesis has been accepted and I got my Bachelor of Science in Computer Science.

I made my diploma thesis at Lawrence Berkeley National Laboratory, Berkeley, California. I was in the team responsible of the developmenet of the ATLAS Software for the LHC in Cern. The title of my thesis is Inlining Assistance for large-scale object-oriented applications

The goal of this project was to create a C++ analyzer to find the best functions and call sites to inline. The input of the analyzer is a call graph generated by CallGrind of the Valgrind project.

The functions and call sites to inline are computed using a heuristic, called the temperature. This heuristic is based on the cost of calling the given function, the frequency of calls and the size of the function. The cost of calling a function is based on the number of parameters, the virtuality of the function and the shared object the function is located in.

The analyzer is also able to find clusters of call sites. A cluster is a set of hot call sites related to each other. It can also finds the functions that should be moved from one library to the other or the function that should not be virtual by testing the use of each function in a class hierarchy.

To achieve this project, it has been necessary to study in details how a function is called on the Linux platform. The inlining optimization has also been studied to know what were the advantages and the problems of this technique.

To retrieve the information about the sizes and the virtuality of the function, it has been necessary to read the shared libraries and executables files. For that, we used libelf. The virtuality of a function is calculated by reading each virtual table and searching for the function in the virtual tables content.

The graph manipulation is made by the Boost Graph Library. As it was an advanced library, it has helped me improving my skills in specific topics like templates, traits or Template Metaprogramming.

The analyzer is able to run on the Linux platform on any program that has been compiled using gcc.

How to install a specific version of GCC on Ubuntu 11.04 (natty)

Baptiste Wicht — Fri, 17 Jun 2011 06:18:29 GMT

Sometimes you need to install a specific version of gcc for some reasons, for example when you need to have the same compiler version as the one used by your team.

In that, the package manager doesn't help because not every version of gcc is packaged in every version of Ubuntu. So you must install it by hand and it can take a little time and there is some things that has to be done in order to work.

I'm talking here of Ubuntu 11.04 (natty), because this is the version I installed Ubuntu on. This procedure will certainly work but you could have a problem with some dependencies that are installed in natty and not in your version or in the contrary have a dependency already installed.

So this article will detail every step to install a specific version of gcc