Skip to main content

C++ Containers Benchmark: vector/list/deque and plf::colony

Already more than three years ago, I've written a benchmark of some of the STL containers, namely the vector, the list and the deque. Since this article was very popular, I decided to improve the benchmarks and collect again all the results. There are now more benchmarks and some problems have been fixed in the benchmark code. Moreover, I have also added a new container, the plf::colony. Therefore, there are four containers tested:

  • The std::vector: This is a dynamically-resized array of elements. All the elements are contiguous in memory. If an element is inserted or removed it at a position other than the end, the following elements will be moved to fill the gap or to open a gap. Elements can be accessed at random position in constant time. The array is resized so that it can several more elements, not resized at each insert operation. This means that insertion at the end of the container is done in amortized constant time.
  • The std::deque: The deque is a container that offer constant time insertion both at the front and at the back of the collection. In current c++ libraries, it is implementation as a collection of dynamically allocated fixed-size array. Not all elements are contiguous, but depending on the size of the data type, this still has good data locality. Access to a random element is also done in constant time, but with more overhead than the vector. For insertions and removal at random positions, the elements are shifted either to the front or to the back meaning that it is generally faster than the vector, by twice in average.
  • The std::list: This is a doubly-linked list. It supports constant time insertions at any position of the collection. However, it does not support constant time random access. The elements are obviously not contiguous, since they are all allocated in nodes. For small elements, this collection has a very big memory overhead.
  • The plf::colony: This container is a non-standard container which is unordered, it means that the insertion order will not necessarily be preserved. It provides strong iterators guarantee, pointers to non-erased element are not invalidated by insertion or erasure. It is especially tailored for high-insertion/erasure workloads. Moreover, it is also specially optimized for non-scalar types, namely structs and classes with relatively large data size (greater than 128 bits on the official documentation). Its implementation is more complicated than the other containers. It is also implemented as a list of memory blocks, but they are of increasingly large sizes. When elements are erased, there position is not removed, but marked as erased so that it can be reused for fast insertion later on. This container uses the same conventions as the standard containers and was proposed for inclusion to the standard library, which is the main reason why it's included in this benchmark. If you want more information, you can consult the official website.

In the text and results, the namespaces will be omitted. Note that I have only included sequence containers in my test. These are the most common containers in practices and these also the containers I'm the most familiar with. I could have included multiset in this benchmark, but the interface and purpose being different, I didn't want the benchmark to be confusing.

All the examples are compiled with g++-4.9.4 (-std=c++11 -march=native -O2) and run on a Gentoo Linux machine with an Intel Core i7-4770 at 3.4GHz.

For each graph, the vertical axis represent the amount of time necessary to perform the operations, so the lower values are the better. The horizontal axis is always the number of elements of the collection. For some graph, the logarithmic scale could be clearer, a button is available after each graph to change the vertical scale to a logarithmic scale.

The tests are done with several different data types. The trivial data types are varying in size, they hold an array of longs and the size of the array varies to change the size of the data type. The non-trivial data type is composed of a string (just long enough to avoid SSO (Small String Optimization) (even though I'm using GCC)). The non-trivial data types comes in a second version with noexcept move operations. Not all results are presented for each data types if there are not significant differences between in order to keep this article relatively short (it's already probably too long :P).

Read more…


Speed up TensorFlow inference by compiling it from source

The most simple way to install TensorFlow is to work in a virtual Python environment and simply to use either the TensorFlow official packages in pip or use one of the official wheels for distributions. There is one big problem with that technique and it's the fact that the binaries are precompiled so that they fit as many hardware configuration as possible. This is normal from Google since generating precompiled binaries for all the possible combinations of processor capabilities would be a nightmare. This is not a problem for GPU since the CUDA Libraries will take care of the difference from one graphics card to another. But it is a problem with CPU performance. Indeed, different processors have different capabilities. For instance, the vectorization capabilities are different from processor to processor (SSE, AVX, AVX2, AVX-512F, FMA, ...). All those options can make a significant difference in the performance of the programs. Although most of the machine learning training occurs on GPU most of the time, the inference is mostly done on the CPU. Therefore, it probably remains important to be as fast as possible on CPU.

So if you care about performance on CPU, you should install TensorFlow from sources directly yourself. This will allow compilation of the TensorFlow sources with -march=native which will enable all the hardware capabilities of machine on which you are compiling the library.

Depending on your problem, this may give you some nice speedup. In my case, on a very small Recurrent Neural Network, it made inference about 20% faster. On a larger problem and depending on your processor, you may gain much more than that. If you are training on CPU, this may make a very large difference in total time.

Installing TensorFlow is sometimes a bit cumbersome. You'll likely have to compile Bazel from sources as well and depending on your processor, it may take a long time to finish. Nevertheless, I have successfully compiled TensorFlow from sources on several machines now without too many problems. Just pay close attention to the options you are setting while configuring TensorFlow, for instance CUDA configuration if you want GPU support.

I hope this little trick will help you gain some time :)

Here is the link to compile TensorFlow from source.


Update on Expression Templates Library (ETL)

It's been a while since I've released the version 1.0 of ETL. There is some work to do before I release the next version, but I wanted to give you a quick update on what has been going on for ETL in the last months. There has been a lot of changes in the library and the next version will be a major update when I'm done with some refactorings and improvements.

Thanks to my thesis supervisor, the project now has a logo:

ETL Logo

There are quite a few new features, although probably nothing really major. The support for square root has been improved with cubic root and inverse root. Vectors can now be transformed using floor and ceil. Cross product of vector has been implemented as well. Batched outer product and batched bias averaging (for machine learning) are now supported. Reductions have also been improved with absolute sum and mean (asum/asum) support and min_index and max_index. argmax can now be used to get the max index in each sub dimensions. Matrix can now be decomposed into their Q/R decomposition rather than only their PALU decomposition. The matrices can now be sliced by getting only a sub part of the matrix. The pooling operators have also been improved with stride and padding support. Matrices and vectors can also be shuffled. Moreover, a few adapters are now available for hermitian matrices, symmetric matrices and lower and upper matrices. So far the support for these adapters is not huge, but they are guaranteed to validate their constraints.

Several operations have been optimized for speed. All the pooling and upsample operators are now parallelized and the most used kernel (2x2 pooling) is now more optimized. 4D convolution kernels (for machine learning) have been greatly improved. There are now very specialized vectorized kernels for classic kernel configurations (for instance 3x3 or 5x5) and the selection of implementations is now smarter than before. The support of padding now much better than before for small amount of padding. Moreover, for small kernels the full convolution can now be evaluated using the valid convolution kernels directly with some padding, for much faster overall performance. Matrix-matrix multiplication with transposed matrices is now much faster when using BLAS kernels. Indeed, the transposition is not performed but handled inside the kernels. Moreover, the performance of the transposition itself is also much faster. Finally, accesses to 3D and 4D matrices is now much faster than before.

The parallelization feature of ETL has been completely reworked. Before, there was a thread pool for each algorithm that was parallelized. Now, there is a global thread engine with one thread pool. Since parallelization is not nested in ETL, this improves performance slightly by greatly diminishing the number of threads that are created throughout an application.

Vectorization has also been greatly improved in ETL. Integer operations are now automatically vectorized on processors that support this. The automatic vectorizer now is able to use non-temporal stores for very large operations. A non-temporal store bypasses the cache, thus gaining some time. Since very large matrices do not fit in cache, this is a net gain. Moreover, the alignment detection in the automatic vectorizer has also been improved. Support for Fused-Multiply-Add (FMA) operations has also been integrated in the algorithms that can make use of it. The matrix-matrix multiplications and vector-matrix multiplications now have optimized vectorized kernels. They also have versions for column-major matrices now. The old egblas version of the gemm, based on BLIS kernels, has been removed since it was only supporting double-precision and was not faster than the new vectorized algorithm. I plan to reintegrate a version of the GEMM based on BLIS in the future but with more optimizations and support for all precisions and integers. The sum and the dot product now also have specialized vectorized implementations. The min and max operations are now automatically-vectorized.

The GPU has also been almost completely reworked. Now, operations can be chained without any copies between GPU and CPU. Several new operations have also been added with support to GPU. Moreover, to complement operations that are not available in any of the supported NVIDIA libraries, I've created a simple library that can be used to add a few more GPU operations. Nevertheless a lot of operations are still missing and only algorithms are available not expressions (such as c = a + b * 1.0) that are entirely computed on CPU. I have plans to improve that further, but probably not before the version 1.2.

There also have been a lot of refactorings in the code of the library. A lot of expressions now have less overhead and are specialized for performance. Moreover, temporary expressions are currently being reworked in order to be more simple and maintainable and easier to optimize in the future.

Finally, there also was quite a few bug fixes. Most of them have been found by the use of the library in the Deep Learning Library (DLL) project.


Home Automation: Power Meter and Integration of Zwave into Domoticz

I've improved a bit my home automation installation. It's been a while since the last upgrade, but unfortunately I cannot afford as many upgrades as I would like :P

For a long time I wanted to monitor the power consumption of a few of my appliances in my apartment. Especially my Linux servers so that I could try to improve the consumption and reduce my bill on the long run. Unfortunately, there are very few options for power meter in Switzerland due to the special type of plug we have. The only option I found is a Zwave power plug. For a while, I waited to see if I could find other options because Zwave devices are quite expensive and I would have rather preferred to stay with simpler and cheaper RF-433 appliances. Since I didn't find anything, I ordered a ZWave USB controller from Aeon Labs (the generation 5). I also ordered two Aeon Labs Swiss Smart Plug with power meter.

Here is an image of the Aeon Labs key:

Aeon Labs ZWave USB Key

And of the power meter in usage:

ZWave power meter

Integration of ZWave into Domoticz was extremely easy. I just plugged the USB key, restarted Domoticz (seems necessary for it to pick the new tty) and added new hardware "OpenZWave USB" with the correct serial port. From there, there are two main ways to add new devices. The first is to remove the USB key and use the synchronization button on both the key and the device close to each other. The other way is to use the "Include Node" option on Domoticz and then press the synchronization button on the device to detect the new device. I used the second option since it seemed simpler and it worked perfectly. I did that for my two plugs and it worked fine. Directly after this, 5 new devices were added for each of the plug. One for the voltage, one for the current , two for the usage (I don't know why there is two, but they are both reporting the same value) and one for the switch on/off. I was a bit afraid that only the On/Off part of the smart plug would work on Domoticz, but I had absolutely no problem.

Here is for instance the power usage of last 24 hours on my television system:

Power usage on television system

For now, I haven't integrated this information on any rule, but I plan to monitor this information in the coming weeks and try to improve my consumption, especially for my servers. I also plan to purchase more of these plugs once my home automation budget can be upgraded.

On another note, I also purchased a Chacon wall remote switch working in RF-433. Although it is quite cheap, I'm very disappointed by the quality of this switch. I add to straighten myself the pins that are attached to the battery because there was no contact. After that, it worked correctly and it is able to work with the RFLink module.

I have to say that I'm quite satisfied with ZWave devices with this experience. Even though I still feel it is way too expensive, it is high quality and have a good finishing. I'll probably purchase more ZWave devices in the future. I'm especially interested in The Aeotec 6 in 1 sensor for temperature humidity, motion, light, UV and vibration. This would allow me to have much information in each room with only one sensor in place of several sensors in each room like I currently have.

I still have a few Milight Bulbs and LEDS to install with a secondary Milight bridge that I will install in the coming week, but I probably won't do a post about this.


Publications: Deep Learning Features for Handwritten Keyword Spotting

After my previous post about my publication on CPU performance optimization, I wanted to talk a bit about two publications on Handwritten Keyword Spotting, in which we extract features with Convolutional RBM RBM

We published two different papers:

  • Keyword Spotting With Convolutional Deep Belief Networks and Dynamic Time Warping, in the Proceedings of the International Conference on Artificial Neural Networks (ICANN-2016), Barcelona, Spain
  • Mixed Handwritten and printed digit recognition in Sudoku With Convolutional Deep Belief Network (Link will come), in the Proceedings of the International Conference on Pattern Recognition (ICPR-2016), Cancun, Mexico

The second paper is mostly a large extension of the first one, so I'll focus on the complete version.

On a side note, I also co-authored a third paper:

We mostly used our existing system to generate features for a comparison between different set of features for handwritten keyword spotting. It was my first time in China and I enjoyed the stay a lot. I also had the chance to meet my girlfriend in Shenzen, all the more reason to mention this publication :)

Back on the main subject. The idea behind these publications is to a Convolutional Deep Belief Network (CDBN) to extract features from the images and then pass these features to either a Dynamic Time Warping (DTW) algorithm or an Hidden Markov Model (HMM). The following image describe the overall system:

Keyword Spotting System

The features are extracted from preprocessed normalized binary images. Using a sliding window, moving from left to right, one pixel at a time, the features are extracted on each window. The feature extractor is a Convolutional Deep Belief Network, trained fully unsupervised. The features are then normalized so that each feature group sum to one and then each has zero-mean and unit-variance. The network used for feature extraction is depicted in the following image:

Convolutional Deep Belief Network features

Two Convolutional Restricted Boltzmann Machines (CRBMs) are used, each followed by a max pooling layer.

Once the features are extracted, they can be passed to the classifier for keyword spotting scoring. We tested our features with two different approaches for word scoring. The first one is a template matching strategy, Dynamic Time Warping (DTW), is a very simple measure of distance between two sequences of different length. The two sequences are warped non-linearly to minimize the distance between each pair of features. A template from the training set is compared to the word image being evaluated. This works pretty well for simple data sets but fails when the writing styles of the test set are not known in the training set. The second classifier is more powerful and trained, a Hidden Markov Model (HMM). Character models are trained using the entire training set. From these character models, a keyword model as well as an unconstrained model (the filler model) are constructed. The probability of these two models is computed using Viterbi and the final score is computed using log-odds scoring of these two models using the filler model as a form of normalization.

This technique was evaluated on three datasets (George Washington (GW), Parzival (PAR) and IAM offline database (IAM)). Our features were compared with three reference feature sets, one heuristic and two local feature sets.

The results for DTW:

Keyword Spotting Results with Dynamic Time Warping

Overall, our features exhibit better performance than the other reference. Except for the Mean Average Precision on the PAR data set. The very low performance on PAR with DTW is explained by the fact mentioned earlier that it has poor generalization to unknown writing styles.

The results for HMM:

Keyword Spotting Results with Hidden Markov Model

With HMM, our features are always better than the other feature sets. However, the margin of improvement is smaller than when using DTW.

Overall, the proposed system proved quite powerful and was able to outperform the three tested feature sets on three datasets for keyword spotting.

You can find the C++ implementation on Github.

As for my thesis, I have finished the writings about a month ago and it is now in the hands on my supervisor.

If you want to have a look, the list of my publications is available on this website.

If you want more details on this project, don't hesitate to ask here or on Github, or read the papers :)

I hope the next post about my publications will be about the finalization of my thesis :)


Partial type erasing in Deep Learning Library (DLL) to improve compilation time

In a previous post, I compared the compilation time on my Deep Learning Library (DLL) project with different compilers. I realized that the compilation times were quickly going unreasonable for this library, especially for compiling the unit cases which clearly hurts the development of the library. Indeed, you want to be able to run the unit tests reasonably quickly after you integrated new changes.

Reduce the compilation time

The first thing I did was to split the compilation in three executables: one for the unit tests, one for the various performance tests and one for the various other miscellaneous tests. With this, it is much faster to compile only the unit test cases.

But this can be improved significantly more. In DLL a network is a variadic template containing the list of layers, in order. In DLL, there are two main different ways of declaring a neural networks. In the first version, the fast version, the layers directly know their sizes:

using network_t =
            dll::rbm_desc<28 * 28, 500, dll::momentum, dll::batch_size<64>>::layer_t,
            dll::rbm_desc<500    , 400, dll::momentum, dll::batch_size<64>>::layer_t,
            dll::rbm_desc<400    , 10,  dll::momentum, dll::batch_size<64>, dll::hidden<dll::unit_type::SOFTMAX>>::layer_t>,
        dll::trainer<dll::sgd_trainer>, dll::batch_size<64>>::dbn_t;

auto network = std::make_unique<network_t>();
network->pretrain(dataset.training_images, 10);
network->fine_tune(dataset.training_images, dataset.training_labels, 10);

In my opinion, this is the best way to use DLL. This is the fastest and the clearest. Moreover, the dimensions of the network can be validated at compile time, which is always better than at runtime. However, the dimensions of the network cannot be changed at runtime. For this, there is a different version, the dynamic version:

using network_t =
            dll::dyn_rbm_desc<dll::momentum, dll::hidden<dll::unit_type::SOFTMAX>>::layer_t>,
        dll::batch_size<64>, dll::trainer<dll::sgd_trainer>>::dbn_t;

auto network = std::make_unique<network_t>();

network->template layer_get<0>().init_layer(28 * 28, 500);
network->template layer_get<1>().init_layer(500, 400);
network->template layer_get<2>().init_layer(400, 10);
network->template layer_get<0>().batch_size = 64;
network->template layer_get<1>().batch_size = 64;
network->template layer_get<2>().batch_size = 64;

network->pretrain(dataset.training_images, 10);
network->fine_tune(dataset.training_images, dataset.training_labels, 10);

This is a bit more verbose, but the configuration can be changed at runtime with this system. Moreover, this is also faster to compile. On the other hand, there is some performance slowdown.

There is also a third version that is a hybrid of the first version:

using network_t =
            dll::rbm_desc<28 * 28, 500, dll::momentum, dll::batch_size<64>>::layer_t,
            dll::rbm_desc<500    , 400, dll::momentum, dll::batch_size<64>>::layer_t,
            dll::rbm_desc<400    , 10,  dll::momentum, dll::batch_size<64>, dll::hidden<dll::unit_type::SOFTMAX>>::layer_t>,
        dll::trainer<dll::sgd_trainer>, dll::batch_size<64>>::dbn_t;

auto network = std::make_unique<network_t>();
network->pretrain(dataset.training_images, 10);
network->fine_tune(dataset.training_images, dataset.training_labels, 10);

Only one line was changed compared to the first version, dbn_desc becomes dyn_dbn_desc. What this changes is that all the layers are automatically transformed into their dynamic versions and all the parameters are propagated at runtime. This is a form a type erasing since the sizes will not be propagated at compilation time. But this is simple since the types are simply transformed from one type to another directly. Behind the scene, it's the dynamic version using the front-end of the fast version. This is almost as fast to compile as the dynamic version, but the code is much better. It executes the same as the dynamic version.

If we compare the compilation time of the three versions when compiling a single network and 5 different networks with different architectures, we get the following results (with clang):

Model Time [s]
1 Fast 30
1 Dynamic 16.6
1 Hybrid 16.6
5 Fast 114
5 Dynamic 16.6
5 Hybrid 21.9

Even with one single network, the compilation time is reduced by 44%. When five different networks are compilation, time is reduced by 85%. This can be explained easily. Indeed, for the hybrid and dynamic versions, the layers will have the same type and therefore a lot of template instantiations will only be done once instead of five times. This makes a lot of difference since almost everything is template inside the library.

Unfortunately, this also has an impact on the runtime of the network:

Model Pretrain [s] Train [s]
Fast 195 114
Dynamic 203 123
Hybrid 204 122

On average, for dense models, the slowdown is between 4% and 8%. For convolutional models, it is between 10% and 25%. I will definitely work on trying to make the dynamic and especially the hybrid version faster in the future, most on the work should be on the matrix library (ETL) that is used.

Since for test cases, a 20% increase in runtime is not really a problem, tests being fast already, I decided to add an option to DLL so that everything can be compiled by default in hybrid model. By using a compilation flag, all the dbn_desc are becoming dyn_dbn_desc and therefore each used network is becoming a hybrid network. Without a single change in the code, the compilation time of the entire library can be significantly improved, as seen in the next section. This can also be used in user code to improve compilation time during debugging and experiments and can be turned off for the final training.

On my Continuous Integration system, I will build the system in both configurations. This is not really an issue, since my personal machine at home is more powerful than what I have available here.


On a first experiment, I measured the difference before and after this change on the three executables of the library, with gcc:

Model Unit [s] Perf [s] Misc [s]
Before 1029 192 937
After 617 143 619
Speedup 40.03% 25.52% 33.93%

It is clear that the speedups are very significant! The compilation is between 25% and 40% faster with the new option. Overall, this is a speedup of 36%! I also noticed that the compilation takes significantly less memory than before. Therefore, I decided to rerun the compiler benchmark on the library. In the previous experiment, zapcc was taking so much memory that it was impossible to use more than one thread. Let's see how it is faring now. The time to compile the full unit tests is computed for each compiler. Let's start in debug mode:

Debug -j1 -j2 -j3 -j4
clang-3.9 527 268 182 150
gcc-4.9.3 591 303 211 176
gcc-5.3.0 588 302 209 175
zapcc-1.0 375 187 126 121

This time, zapcc is able to scale to four threads without problems. Moreover, it is always the fastest compiler, by a significant margin, in this configuration. It is followed by clang and then by gcc for which both versions are about the same speed.

If we compile again in release mode:

Release -j1 -j2 -j3 -j4
clang-3.9 1201 615 421 356
gcc-4.9.3 1041 541 385 321
gcc-5.3.0 1114 579 412 348
zapcc-1.0 897 457 306 306

The difference in compilation time is very large, it's twice slower to compile with all optimizations enabled. It also takes significantly more memory. Indeed, zapcc was not able to compile with 4 threads. Nevertheless, even the results with three threads are better than the other compilers using four threads. zapcc is clearly the winner again on this test, followed by gcc4-9 which is faster than gcc-5.3 which is itself faster than clang. It seems that while clang is better at frontend than gcc, it is slower for optimizations. Note that this may also be an indication that clang performs more optimizations than gcc and may not be slower.


By using some form of type erasing to simplify the templates types at compile time, I was able to reduce the overall compilation time of my Deep Learning Library (DLL) by 36%. Moreover, this can be done by switching a simple compilation flag. This also very significantly reduce the memory used during the compilation, allowing zapcc to to compile with up to three threads, compared with only one before. This makes zapcc the fastest compiler again on this benchmark. Overall, this will make debugging much easier on this library and will save me a lot of time.

In the future, I plan to try to improve compilation time even more. I have a few ideas, especially in ETL that should significantly improve the compilation time but that will require a lot of time to implement, so that will likely have to wait a while. In the coming days, I plan to work on the performance of DLL, especially for stochastic gradient descent.

If you want more information on DLL, you can check out the dll Github repository.


Use clang-tidy for static analysis and integration in Sonarqube

clang-tidy is an extensive linter C++. It provides a complete framework for analysis of C++ code. Some of the checks are very simple but some of them are very complete and most of the checks from the clang-static-analyzer are integrated into clang-tidy.


If you want to see the list of checks available on clang-tidy, you can use the list-checks options:

clang-tidy -list-checks

You can then choose the tests you are interested in and perform an analysis of your code. For, it is highly recommended to use a Clang compilation database, you can have a look at Bear to generate this compilation database if you don't have it yet. The usage of clang-tidy, is pretty simple, you set the list of checks you want, the header on which you want to have warnings reported and the list of source files to analyse:

clang-tidy -checks='*' -header-filter="^include" -p . src/*.cpp

You'll very likely see a lot of warnings. And you will very likely see a lot of false positives and a lot of warnings you don't agree too. For insance, there are a lot of warnings from the CPP Core Guidelines and the Google Guidelines that I don't follow in my coding. You should not take the complete list of tests as rule, you should devise your own list of what you really want to fix in your code. If you want to disable one check X, you can use the - operation:

clang-tidy -checks='*,-X' -header-filter="^include" -p . src/*.cpp

You can also enable the checks one by one or parts of them with *:

clang-tidy -checks='google-*' -header-filter="^include" -p . src/*.cpp

One problem with the clang-tidy tool is that it is utterly slow, especially if you enable the clang-static-analyzer checks. Moreover, if you use it like it is set before, it will only use one thread for the complete set of files. This may not be an issue on small projects, but this will definitely be a big issue for large projects and template-heavy code (like my ETL project). You could create an implicit target into your Makefile to use it on each file independently and then use the -j option of make to make them in parallel, but it not really practical.

For this, I just discovered that clang propose a Python script, that does it all for us! On Gentoo, it is installed at /usr/share/ -checks='*' -header-filter="^include" -p . -j9

This will automatically run clang-tidy on each file from the compilation database and use 9 threads to perform the checks. This is definitely much faster. For me, this is the best way to run clang-tidy.

One small point I don't like is that the script always print the list of enabled checks. For, this I changed this line in the script:

invocation = [args.clang_tidy_binary, '-list-checks']


invocation = [args.clang_tidy_binary]

This makes it more quiet.

One thing I didn't mention is that clang-tidy is able to fix some of the errors directly if you use the -fix option. Personally, I don't like this, but for a large code base and a carefully selected set of checks, this could be really useful. Note that not all the checks are automatically fixable by clang-tidy.


I have run clang-tidy on my cpp-utils library and here some interesting results. I have not run all the checks, here is the command I used:

/usr/share/clang/ -p . -header-filter '^include/cpp_utils' -checks='cert-*,cppcoreguidelines-*,google-*,llvm-*,misc-*,modernize-*,performance-*,readility-*,-cppcoreguidelines-pro-type-reinterpret-cast,-cppcoreguidelines-pro-bounds-pointer-arithmetic,-google-readability-namespace-comments,-llvm-namespace-comment,-llvm-include-order,-google-runtime-references' -j9 2>/dev/null  | /usr/bin/zgrep -v "^clang-tidy"

Let's go over some warnings I got:

include/cpp_utils/assert.hpp:91:103: warning: consider replacing 'long' with 'int64' [google-runtime-int]
void assertion_failed_msg(const CharT* expr, const char* msg, const char* function, const char* file, long line) {

I got this one several times. It is indeed more portable to use int64 rather than long.

include/cpp_utils/aligned_allocator.hpp:53:9: warning: use 'using' instead of 'typedef' [modernize-use-using]
        typedef aligned_allocator<U, A> other;

This one is part of the modernize checks, indicating that one should use using rather than a typedef and I completely agree.

include/cpp_utils/aligned_allocator.hpp:79:5: warning: use '= default' to define a trivial default constructor [modernize-use-default]
    aligned_allocator() {}
                        = default;

Another one from the modernize checks that I really like. This is completely true.

I don't agree that every constructor with one argument should be explicit, sometimes you want implicit conversion. Nevertheless, this particular case is very interesting since it is variadic, it can have one template argument and as thus it can be implicitly converted from anything, which is pretty bad I think.

test/array_wrapper.cpp:15:18: warning: C-style casts are discouraged; use reinterpret_cast [google-readability-casting]
    float* mem = (float*) malloc(sizeof(float) * 8);
                 reinterpret_cast<float*>(         )

On this one, I completely agree, C-style casts should be avoided and much clearer C++ style casts should be preferred.

/home/wichtounet/dev/cpp_utils_test/include/cpp_utils/aligned_allocator.hpp:126:19: warning: thrown exception type is not nothrow copy constructible [cert-err60-cpp]
            throw std::length_error("aligned_allocator<T>::allocate() - Integer overflow.");

This is one of the checks I don't agree with. Even though it makes sense to prefer exception that are nothrow copy constructible, they should be caught by const reference anyway. Moreover, this is here an exception from the standard library.

/home/wichtounet/dev/cpp_utils_test/include/cpp_utils/aligned_allocator.hpp:141:40: warning: do not use const_cast [cppcoreguidelines-pro-type-const-cast]

In general, I agree that using const_cast should be avoided as much as possible. But there are some cases where they make sense. In this particular case, I don't modify the object itself but some memory before the object that is unrelated and I initialize myself.

I also had a few false positives, but overall nothing too bad. I'm quite satisfied with the quality of the results. I'll fix these warnings in the coming week.

Integration in Sonarqube

The sonar-cxx plugin just integrated support for clang-tidy in main. You need to build the version yourself, the 0.9.8-SNAPSHOT version. You then can use something like this in your file:


and sonar-cxx will parse the results and integrate the issues in your sonar report.

Here is an example:


You can see two of the warnings from clang-tidy :)

For now, I haven't integrate this in my Continuous Integration system because I'm still having issues with clang-tidy and the compilation database. Because the compilation contains absolute paths to the file and to the current directory, it cannot be shared directly between servers. I have to find a way to fix that so that clang-tidy can use on the other computer. I'll probably wait till the sonar-cxx 0.9.8 version is released before integrating all this in Sonarqube, but this is a great news for this plugin :)


clang-tidy is C++ linter that can analyze your code and checks for hundreds of problems in it. With it, I have found some very interesting problems in the code of my cpp_utils library. Moreover, you can now integrate it Sonarqube by using the sonar-cxx plugin. Since it is a bit slow, I'll probably not integrate it in my bigger projects, but I'll integrate at least in the cpp_utils library when sonar-cxx 0.9.8 will be released.


Disappointing zapcc performance on Deep Learning Library (DLL)

One week ago, zapcc 1.0 was released and I've observed it to be much faster than the other compilers in terms of compile time. This can be seen when I tested it on my Expression Templates Library (ETL). It was almost four times faster than clang 3.9 and about 2.5 times faster than GCC.

The ETL library is quite heavy to compile, but still reasonable. This is not the case for my Deep Learning Library (DLL) where compiling all the test cases takes a very long time. I have to admit that I have been going overboard with templates and such and I have now to pay the price. In practice, for the users of the library, this is not a big problem since only one or two neural networks will be compiled (and it will take hours to train), but in the test cases, there are hundreds of them and this is a huge pain. Anyway, enough with the ramble, I figured it would be very good to test zapcc on it and see what I can gain from using it.

In this article, when I speak of a compiler thread, I mean an instance of the processor, so it's really a process in the Linux world.


However, I soon realized that I would have more issues than I thought. The first problem is the memory consumed by zapcc. Indeed, it is based on clang and I always had problem with huge memory consumption from clang on this library and zapcc has even bigger memory consumption because some information is cached between runs. The amount of memory that zapcc is able to cache can be configured in the configuration file. By default, it can use 1.5Go of memory. When zapcc goes over the memory limit, it simply wipes out its caches. This means that all the gain for the next compilation will be lost, since the cache will have to be rebuilt from scratch. This is not a hard limit for the compilation itself. Indeed, if the compilation itself takes 3Go, it will still be able to complete it, but it is likely that the cache will be wiped after the compilation.

When I tried compiling using several threads, it soon used all my memory and crashed. The same occurs with clang but I can still compile with 3 or 4 threads without too much issues on this computer. The same also occurs with GCC but it can still handle 4 or 5 threads (depending on the order of the compilation units).

The tests are performed on my desktop computer at work, which is not really good... I have 12Go of RAM (I had to ask for extra...) and an old Sandy Bridge processor, but at least I have an SSD (also had to ask for extra).

I started with testing with only one compiler thread. For zapcc, I set the maximum memory limit to 8Go. Even with such a limit, the zapcc server restarted more than 10 times during the compilation of the 84 test cases. After this first experiment, I increased the number of threads to 2 for each compiler, using 4Go limit for zapcc. The limit is for each server and each parallel thread will spawn a new server, so the effective limit is the number of threads times the limit. Even with two threads, I was unable to finish a compilation with zapcc. This is quite disappoint for me since clang is able to run with 4 threads in parallel. Moreover, a big problem with that is that the servers are not always killed when there is no no more memory, they just hang and use all the memory of the computer, which is evidently really inconvenient for service processes. When this happens with clang or gcc, the compiler simply crashes and the memory is released and make is interrupted. Since zapcc is not able to work with more than one thread on this computer, the results are the ones with one thread. I was also surprised to be able to compile the library with clang and four threads, this was not possible before clang-3.9.

Compiler -j1 -j2 -j3 -j4
gcc-4.9.3 2250.95 1256.36 912.67 760.84
gcc-5.3.0 2305.37 1279.49 918.08 741.38
clang-3.9 2047.61 1102.93 899.13 730.42
zapcc-1.0 1483.73 1483.73 1483.73 1483.73
Difference against Clang -27.55% +25.69% +39.37% +50.77%
Speedup VS GCC-5.3 -35.66% +13.75% +38.09% +50.03%
Speedup VS GCC-4.9 -34.08% +15.30% +38.50% +48.75%

If we look at the results with only one thread, we can see that there still are some significant improvements when using zapcc, but nowhere near as good as what was seen in the compilation of ETL. Here, the compilation time is reduced by 34% compared to gcc and by 27% compared to clang. This is not bad, since it is faster than the other compilers, but I would have expected better speedups. We can see that g++-4.9 is slightly faster than g++-5.3, but this is not really a significant difference. I'm actually very surprised to find that clang is faster than g++ on this experiment. On ETL, it is always very significantly slower and before, it was also significantly slower on DLL. I was so used to this, that I stopped using it on this project. I may have to reconsider my position when working on this project.

Let's look at the results with more than two threads. Even with two threads, every compiler is faster than zapcc. Indeed, zapcc is slower than Clang by 25% and slower than GCC by about 15%. If we use more threads, the other compilers are becoming even faster and the slowdowns of zapcc are more important. When using four threads, zapcc is about 48% slower than gcc and about 50% slower than clang. This is really showing one big downside of zapcc that has a very large memory consumption. When it is used to compile really heavy template code, it is failing very early to use more processes. And even when there is enough memory, the speedups are not as great as for relatively simpler code.

One may argue that this is not a fair comparison since zapcc does not have the same numbers of threads. However, considering that this is the best zapcc can do on this machine, I would argue that this is a fair comparison in this limited experimental setting. If we were to have a big machine for compilation, which I don't have at work, the zapcc results would likely be more interesting, but in this specific limited case, it shows that zapcc suffers from its high memory consumption. It should also be taken into account that this experiment was done with almost nothing else running on the machine (no browser for instance) to have as much memory as possible available for the compilers. This is not a common use case. Most of the days, when I compile something, I have my browser open, which makes a large difference in memory available, and several other applications (but consoles and vim instances do not really consume memory :D).

This experiment made me realize that the compilation times for this library were quickly becoming crazy. Most of the time, the complete test suite is only compiled on my Continuous Integration machine at home which has a much faster processor and much more RAM. Therefore, it is relatively fast since it uses more threads to compile. Nevertheless, this is not a good point that the unit tests takes so much time to compile. I plan to split the test cases in several sets. Because, currently the real unit tests are compiled with the performance tests and other various tests. I'll probably end up generating three executables. This will help greatly during development. Moreover, I also have a technique to decrease the compilation time by erasing some template parameters at compilation time. This is already ready, but has currently a runtime overhead that I will try to remove and then use this technique everywhere to get back to reasonable compilation times. I'll also try to see if I can find obvious compilation bottlenecks in the code.


To conclude, while zapcc brings some very interesting compilation speedups in some cases like in my ETL library, it also has some downsides, namely huge memory consumption. This memory consumption may prevent the use of several compiler threads and render zapcc much less interesting than other compilers.

When trying to compile my DLL library on a machine with 12Go of RAM with two zapcc threads, it was impossible for me to make it complete. While zapcc was faster with one thread than the other compilers, they were able to use up to four threads and in the end zapcc was about twice slower than clang.

I knew that zapcc memory consumption was very large, but I would have not have expected something so critical. Another feature that would be interesting in zapcc would be to set a max memory hard limit for the server instead of simply a limit on the cache they are able to keep in memory. This would prevent hanging the complete computer when something goes wrong.

I had a good surprise with clang that was actually faster than GCC and also able to work with four threads in parallel. This was not the case with previous version of clang. On ETL, it is still significantly slower than GCC though.

For now, I'll continue using clang on this DLL project and use zapcc only on my ETL project. I'll also focus on improving the compilation time on this project and make it reasonable again.


Migrated from owncloud 5 to Nextcloud 11

For several years now I've been using Owncloud running on one of my servers. I'm using simply using as a simple synchronization, I don't use any of the tons of fancy features they keep adding. Except from several synchronization issues, I haven't had too much issues with it.

However, I have had a very bad time with updates of Owncloud. The last time I tried, already long ago, was to upgrade from 5.0 to 6.0 and I never succeeded without losing all the configuration and having to do the resync. Therefore, I've still an Owncloud 5.0 running. From this time, I had to say that I've been lazy and didn't try again to upgrade it. Recently, I've received several mails indicating that this is a security threat.

Since I was not satisfied with updates in Owncloud and its security has been challenged recently, I figured it would be a good moment to upgrade to Nextcloud which is a very active fork of Owncloud that was forked by developers of Owncloud.

I haven't even tried to do an upgrade from such an old version to the last version of Nextcloud, it was doomed to fail. Therefore, I made a new clean installation. Since I only use the sync feature of the tool, it does not really matter, it is just some time lost to sync everything again, but nothing too bad.

I configured a new PostgreSQL on one of my servers for the new database and then installed Nextcloud 11 on Gentoo. It's a bit a pain to have a working Nginx configuration for Nextcloud, I don't advice to do it by hand, better take one from the official documentation, you'll also gain some security. One very bad thing in the installation process is that you cannot choose the database prefix, it's set like Owncloud. The problem with that is that you cannot install both Owncloud and Nextcloud on the same database which would be more practical for testing purpose. It's a bit retarded in my opinion, but not a big problem in the end. Other than these two points, everything went well and it was installation pretty nicely. Then, you should have your user ready to go.

Nextcloud view

As for the interface, I don't think there is a lot to tell here. Most of it is what you would except from this kind of tool. Moreover, I very rarely use the web interface or any of the feature that are not the sync feature. One thing that is pretty cool I think is the monitoring graphs in the Admin section of the interface. You can the number of users connected, the memory used and the CPU load. It's pretty useful if you share your Nextcloud between a lot of different users.

I didn't have any issue with the sync either. I used the nextcloud-client package on Gentoo directly and it worked perfectly directly. It took about 10 minutes to sync everything again (about 5GB). I'll have to do the same thing on my other computer as well, but I don't think I'll have any issue.

So far, I cannot say that this is better than Owncloud, I just hope the next upgrade will fare better than they did on Owncloud. Moreover, I also hope that the security that they promise is really here and I won't have any problem with it. I'll see in the future!


Release of zapcc 1.0 - Fast C++ compiler

If you remember, I recently wrote about zapcc C++ compilation speed against gcc 5.4 and clang 3.9 in which I was comparing the beta version of zapcc against gcc and clang.

I just been informed that zapcc was just released in version 1.0. I though it was a good occasion to test it again. It will be compared against gcc-4.9, gcc-5.3 and clang-3.9. This version is based on the trunk of clang-5.0.

Again, I will use my Expression Template Library (ETL) project. This is a purely header-only library with lots of templates. I'm going to compile the full test cases. This is a perfect example for long compilation times.

The current tests are made on the last version of the library and with slightly different parameters for compilation, therefore the absolute times are not comparable, but the speedups should be comparable.

Just like last time, I have configured zapcc to let is use 2Go RAM per caching server, which is the maximum allowed. Moreover, I killed the servers before each tests.

Debug results

Let's start with a debug build, with no optimizations enabled. Every build will use four threads. This is the equivalent of doing make -j4 debug/bin/etl_test without the link step.

g++-4.9.3 190.09s
g++-5.3.0 200.92s
clang++-3.9 313.85
zapcc++ 81.25
Speedup VS Clang 3.86
Speedup VS GCC-5.3 2.47
Speedup VS GCC-4.9 2.33

The speedups are even more impressive than last time! zapcc is almost four times fast than clang-3.9 and around 2.5 times faster than GCC-5.3. Interestingly, we can see that gcc-5.3 is slighly slower than GCC-4.9.

It seems that they have the compiler even faster!

Release results

Let's look now how the results are looking with optimizations enabled. Again, every build will use four threads. This is the equivalent of doing make -j4 release_debug/bin/etl_test without the link step.

g++-4.9.3 252.99
g++-5.3.0 264.96
clang++-3.9 361.65
zapcc++ 237.96
Speedup VS Clang 1.51
Speedup VS GCC-5.3 1.11
Speedup VS GCC-4.9 1.06

We can see that this time the speedups are not as interesting as they were. Very interestingly, it's the compiler that suffers the more from the optimization overhead. Indeed, zapcc is three times slower in release mode than it was in debug mode. Nevertheless, it still manages to beat the three other compilers, by about 10% for Gcc and 50% than clang, which is already interesting.


To conclude, we have observed that zapcc is always faster than the three compilers tested in this experiment. Moreover, in debug mode, the speedups are very significant, it was almost 4 times faster than clang and around 2.5 faster than gcc.

I haven't seen any problem with the tool, it's like clang and it should generate code of the same performance, but just compile it much faster. One problem I have with zapcc is that it is not based on an already released version of clang but on the trunk. That means it is hard to be compare with the exact same version of clang and it is also a risk of running into clang bugs.

Although the prices have not been published yet, it is indicated on the website that zapcc is free for non-commercial entities. Which is really great.

If you want more information, you can go to the official website of zapcc