Skip to main content

Simplify your type traits with C++14 variable templates
   Posted:


Often if you write templated code, you have to write and use a lot of different traits. In this article, I'll focus on the traits that are representing values, typically a boolean value. For instance, std::is_const, std::is_same or std::is_reference are type traits provided by the STL. They are giving you some information at compile time for a certain type. If you need to write a type traits, let's say is_float, here is how you would maybe do it in C++11:

template <typename T>
struct is_float {
    static constexpr bool value = std::is_same<T, float>::value;
};

or a bit nicer with a template type alias and std::integral constant:

template <typename T>
using is_float = std::integral_constant<bool, std::is_same<T, float>::value>;

or since is_same is itself a type traits, you can also directly alias it:

template <typename T>
using is_float = std::is_same<T, float>;

This makes for some very nice syntax, but we still have a type rather than a value.

Note that in some cases, you cannot use the using technique since it cannot be specialized and you often need specialization to write some more advanced traits.

And then you would use your traits to do something specific based on that information. For instance with a very basic example:

template <typename T>
void test(T t){
    if (is_float<T>::value){
        std::cout << "I'm a float" << std::endl;
    } else {
        std::cout << "I'm not a float" << std::endl;
    }
}

Really nothing fancy here, but that will be enough as examples.

Even though all this works pretty, it can be made better on two points. First, every time you use a traits, you need to use the value member (via ::value). Secondly, every time you declare a new traits, you have to declare a new type or a type alias. But all you want is a boolean value.

C++14 introduced a new feature, variable templates. As their name indicates, they are variables, parametrized with a type. This allows us to write type traits without using a type alias or struct, meaning we have a real value instead of a type. If we rewrite our is_float traits with variable templates, we have the following:

template <typename T>
constexpr bool is_float = std::is_same<T, float>::value;

I think it's much nicer, the intent is clearly stated and there is no unnecessary code. Moreover, it's also nicer to use:

template <typename T>
void test(T t){
    if (is_float<T>){
        std::cout << "I'm a float" << std::endl;
    } else {
        std::cout << "I'm not a float" << std::endl;
    }
}

No more ::value everywhere :) I think it's really cool.

Note that, unlike type alias template, they can be specialized, either fully or partially, so no more limitation on that side.

Interestingly, variable templates are used in C++17 to provide helpers for each type traits with values. For instance, std::is_same will have a std::is_same_v helper that is a variable template. With that, we can simplify our traits a bit more:

template <typename T>
constexpr bool is_float = std::is_same_v<T, float>;

Personally, I replaced all the type traits inside ETL using variable templates. If you don't want to do it, you can also introduce helpers like in the C++17 STL and start using the wrappers when you see fit so that you don't break any code.

If you want to use this feature, you need a C++14 compiler, such as any version from GCC5 family or clang 3.6. Although I haven't tested, it should also work on Microsoft VS2015 Update 2.

Unfortunately there is a bug in both clang (fixed in clang 3.7) and GCC (fixed in GCC 6 only) that you may encounter if you start using variable templates in template classes or variable templates used in another variable templates. If you plan to use variable templates inside a template, such as something like this:

template <typename T>
struct outer_traits {
    template <typename X>
    static constexpr bool sub_traits = std::is_same<T, X>::value;
};

template <typename T, typename X>
constexpr bool outer_helper = outer_traits<T>::template sub_traits<X>;

int main(){
    std::cout << outer_helper<float, float>;

    return 0;
}

You will encounter a not-helpful at all error message with GCC5 family, such as:

test.cpp: In instantiation of ‘constexpr const bool outer_helper<float, float>’:
test.cpp:14:22:   required from here
test.cpp:11:20: error: ‘template<class X> constexpr const bool outer_traits<float>::sub_traits<X>’ is not a function template
     constexpr bool outer_helper = outer_traits<T>::template sub_trait
                    ^
test.cpp:11:20: error: ‘sub_traits<X>’ is not a member of ‘outer_traits<float>’

It comes from a bug in the handling of variable templates as dependent names. If you don't come in this cases, you can use GCC5 family directly, otherwise, you'll have to use GCC6 family only.

I hope this can help some of you to improve your type traits or at least to discover the power of the new variable templates. Personally, I've rewritten all the traits from the ETL library using this new feature and I'm pretty satisfied with the result. Of course, that means that the compiler support was reduced, but since I don't have many users, it's not a real issue.

Comments

How to fix mdadm RAID5 / RAID6 growing stuck at 0K/s ?
   Posted:


I just started growing again my RAID6 array from 12 to 13 disks and I encountered a new issue. The reshape started, but with a speed of 0K/s. After some searching, I found a very simple solution:

echo max > /sys/block/md0/md/sync_max

And the reshape started directly at 50M/s :)

The solution is the same if you are growing any type of RAID level with parity (RAID5, RAID6, ...).

Normally, the issues I have are related to speed not very good. I've written a post in the post about how to speed up RAID5 / RAID6 growing with mdadm. Although RAID5 / RAID6 growing, or another reshape operation, will never be very fast, you can still speed up the process a lot from a few days to a few hours. Currently, my reshape is working at 48M/s and I'm looking at around 16 hours of reshape, but I have 13 disks of 3To, so it's not so bad.

I hope this very simple tip can be helpful to some of you :)

Comments

DLL: Blazing Fast Neural Network Library
   Posted:


A few weeks ago, I talked about all the new features of my Deep Learning Library (DLL) project. I've mentioned that, on several experiments, DLL was always significantly faster than some popular deep learning frameworks such as TensorFlow. I'll now go into more details into this comparison and provide all the results. So far, the paper we wrote about these results has not been published, so I'll not provide the paper directly yet.

For those that may not know, DLL is the project I've been developing to support my Ph.D. thesis. This is a neural network framework that supports Fully-Connected Neural Network (FCNN), Convolutional Neural Network (CNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Convolutional RBM (CRBM) and Convolutional DBN (CDBN). It also supports a large variety of options such as Dropout, Batch Normalization and Adaptive Learning Rates. You can read read the previous post if you want more information about the new features of the framework. And, as those of you that read my blog frequently may know, I'm a bit obsessed with performance optimization, so I've spent a considerable amount of time optimizing the performance of neural network training, on CPU. Since, at the beginning of my thesis, I had no access to GPU for training, I've focused on CPU. Although there is now support for GPU, the gains are not yet important enough.

Evaluation

To see how fast, or not, the library was, it was compared against five popular machine learning libraries:

  1. Caffe, installed from sources
  2. TensorFlow 1.0, from pip
  3. Keras 2.0, from pip
  4. Torch, installed from sources
  5. DeepLearning4J 0.7, from Maven

I've run four different experiments with all these frameworks and compared the efficiency of each of them for training the same neural networks with the same options. In each case, the training or testing error have also been compared to ensure that each framework is doing roughly the same. I wont present here the details, but in each experiment DLL showed around the same accuracies as the other frameworks. I will only focus on the speed results in this article.

Each experiment is done once with only CPU and once with a GPU. For DLL, I only report the CPU time in both modes, since it's more stable and more optimized.

The code for the evaluation is available online on the Github repository of the frameworks project.

MNIST: Fully Connected Neural Network

The first experiment is performed on The MNIST data set. It consists of 60'000 grayscale images of size 28x28. The goal is to classify each image of a digit from 0 to 9. To solve this task, I trained a very small fully-connected neural network with 500 hidden units in the first layer, 250 in the second and 10 final hidden units (or output units) for classification. The first two layers are using the logistic sigmoid activation function and the last layer is using the softmax activation function. The network is trained for 50 epochs with a categorical cross entropy loss, with mini-batches of 100 images. Here are results of this experiment:

Training time performance for the different frameworks on the Fully-Connected Neural Network experiment, on MNIST.

Training time performance for the different frameworks on the Fully-Connected Neural Network experiment, on MNIST. All the times are in seconds.

In DLL mode, the DLL framework is the clear winner here! It's about 35% faster than TensorFlow and Keras which are coming at the second place. DLL is more than four times slower than DLL and the last two frameworks (Caffe and DeepLearning4J) are five times slower than DLL! Once we add a GPU to the system, the results are very different. Caffe is now the fastest framework, three times faster than DLL. DLL is less than two times slower than Keras and TensorFlow. Interestingly, DLL is still faster than Torch and DeepLearning4J.

MNIST: Convolutional Neural Network

Although a Fully-Connected Neural Network is an interesting tool, the trend now is to use Convolutional Neural Network which have proved very efficient at solving a lot of problems. The second experiment is also using the same data set. Again, it's a rather small network. The first layer is a convolutional layer with 8 5x5 kernels, followed by max pooling layer with 2x2 kernel. They are followed by one more convolutional layers with 8 5x5 kernels and a 2x2 max pooling layer. These first four layers are followed by two fully-connected layers, the first with 150 hidden units and the last one with 10 output units. The activation functions are the same as for the first network, as is the training procedure. This takes significantly longer to train than the first network because of the higher complexity of the convolutional layers compared to the fully-connected layers even though they have much less weights. The results are present in the next figure:

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on MNIST.

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on MNIST. All the times are in seconds.

Again, on CPU, DLL is the clear winner, by a lot! It's already 3.6 times faster than the second frameworks Keras and TensorFlow, more than four times faster than Caffe and Torch and 8 times faster than DeepLearning4J that is proving very slow on this experiment. Once a GPU is added, Keras and TensorFlow are about twice faster than DLL. However, DLL is still faster than the other frameworks even though they are taking advantage of the GPU.

CIFAR-10

The second data set that is tested is the CIFAR-10 data set. It's an object recognition with 10 classes for classification. The training set is composed of 50'000 colour images for 32x32 pixels. The network that is used for this data set is similar in architecture than the first network, but has more parameters. The first convolutional layer now has 12 5x5 kernels and the second convolutional layer has 24 3x3 kernels. The pooling layers are the same. The first fully-connected has 64 hidden units and the last one has 10 output units. The last layer again use a softmax activation function while the other layers are using Rectifier Linear Units (ReLU). The training is done in the same manner as for the two first networks. Unfortunately, it was not possible to train DeepLearning4J on this data set, even though there is official support for this data set. Since I've had no answer to my question regarding this issue, the results are simply removed from this experiment. It may not seem so but it's considerably longer to train this network because of the larger number of input channels and larger number of convolutional kernels in each layer. Let's get to the results now:

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on CIFAR-10.

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on CIFAR-10. All the times are in seconds.

DLL is still the fastest on CPU, but the margin is less than before. It's about 40% faster than TensorFlow and Keras, twice faster than Torch and 2.6 times faster than Caffe. Once a GPU is added, DLL is about as fast as Torch but slower than the other three frameworks. TensorFlow and Keras are about four times faster than DLL while Caffe is about twice faster than DLL. We can see that with this larger network, the GPU becomes more interesting and that there is a smaller margin for improvements compared to the other frameworks.

ImageNet

The last experiment is made on the ImageNet data set. I used the ILSVRC 2012 subset, that consists "only" of about 1.2 million images for training. I've resized all the images to 256x256 pixels, this makes for 250 times more colour values than a MNIST image. This dimension and the number of images makes it impractical to keep the dataset in memory. The images must be loaded in batch from the disk. No random cropping or mirroring was performed. The network is much larger to solve this task. The network starts with 5 pairs of convolutional layers and max pooling layers. The convolutional layers have 3x3 kernels, 16 for the first two layers and 32 for the three following one. The five max pooling layers use 2x2 kernels. Each convolutional layer uses zero-padding so that their output features are the same dimensions as the input. They are followed by two fully-connected layer. The first one with 2048 hidden units and the last one with 1000 output units (one for each class). Except for the last layer, using softmax, the layers all uses ReLU. The network is trained with mini-batches of 128 images (except for DeepLearning4J and Torch, which can only use 64 images on the amount of RAM available on my machine). To ease the comparison, I report the time necessary to train one batch of data (or two for DeepLearning4J and Torch). The results, presented in logarithmic scale because of DeepLearning4J disastrous results, are as follows:

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on ImageNet.

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on ImageNet. The times are the time necessary to train a batch of 128 images. All the times are in milliseconds.

For this final experiment, DLL is again significantly faster than all the other frameworks. It's about 40% faster than Keras, twice faster than TensorFlow and Caffe and more than three times faster than Torch. Although 40% may seem not that much, don't forget that this kind of training may take days, so it can save you a lot of time. All the frameworks are much faster than DeepLearning4J. Based on several posts on the internet, I suspect that this comes from the model of GPU I have been used (GTX 960), but all the other frameworks seem to handle this card pretty well.

Conclusion

I hope this is not too much of a bragging post :P We can see that my efforts to make the code as fast as possible have paid :) As was shown in the experiments, my DLL framework is always the fastest framework when the neural network is trained on CPU. I'm quite pleased with the results since I've done a lot of work to optimize the speed as much as possible and since I'm competing with well-known libraries that have been developed by several persons. Moreover, the accuracies of the trained networks is similar to that of the networks trained with the other frameworks. Even when the other frameworks are using GPU, the library still remains competitive, although never the fastest.

In the next step (I've no idea when I'll have the time though), I will want to focus on GPU speed. This will mostly come from a better support of the GPU in the ETL library on which DLL is based. I have many ideas to improve it a lot, but it will take me a lot of time.

If you want more information on the DLL library, you can have a look at its Github repository and especially at the few examples. You can also have a look at my posts about DLL. Finally, don't hesitate to comment or contact me through Github issues if you have comments or problems with this post, the library or anything ;)

Comments

Compiler benchmark GCC and Clang on C++ library (ETL)
   Posted:


It's been a while since I've done a benchmark of different compilers on C++ code. Since I've recently released the version 1.1 of my ETL project (an optimized matrix/vector computation library with expression templates), I've decided to use it as the base of my benchmark. It's a C++14 library with a lot of templates. I'm going to compile the full test suite (124 test cases). This is done directly on the last release (1.1) code. I'm going to compile once in debug mode and once in release_debug (release plus debug symbols and assertions) and record the times for each compiler. The tests were compiled with support for every option in ETL to account to maximal compilation time. Each compilation was made using four threads (make -j4). I'm also going to test a few of the benchmarks to see the difference in runtime performance between the code generated by each compiler. The benchmark will be compiled in release mode and its compilation time recorded as well.

I'm going to test the following compilers:

  • GCC-4.9.4
  • GCC-5.4.0
  • GCC-6.3.0
  • GCC-7.1.0
  • clang-3.9.1
  • clang-4.0.1
  • zapcc-1.0 (commercial, based on clang-5.0 trunk)

All have been installed directly using Portage (Gentoo package manager) except for clang-4.0.1 that has been installed from sources and zapcc since it does not have a Gentoo package. Since clang package on Gentoo does not support multislotting, I had to install one version from source and the other from the package manager. This is also the reason I'm testing less versions of clang, simply less practical.

For the purpose of these tests, the exact same options have been used throughout all the compilers. Normally, I use different options for clang than for GCC (mainly more aggressive vectorization options on clang). This may not lead to the best performance for each compiler, but allows for comparison between the results with defaults optimization level. Here are the main options used:

  • In debug mode: -g
  • In release_debug mode: -g -O2
  • In release mode: -g -O3 -DNDEBUG -fomit-frame-pointer

In each case, a lot of warnings are enabled and the ETL options are the same.

All the results have been gathered on a Gentoo machine running on Intel Core i7-2600 (Sandy Bridge...) @3.4GHz with 4 cores and 8 threads, 12Go of RAM and a SSD. I do my best to isolate as much as possible the benchmark from perturbations and that my benchmark code is quite sound, it may well be that some results are not totally accurate. Moreover, some of the benchmarks are using multithreading, which may add some noise and unpredictability. When I was not sure about the results, I ran the benchmarks several time to confirm them and overall I'm confident of the results.

Compilation Time

Let's start with the results of the performance of the compilers themselves:

Compiler Debug Release_Debug Benchmark
g++-4.9.4 402s 616s 100s
g++-5.4.0 403s 642s 95s
g++-6.3.0 399s 683s 102s
g++-7.1.0 371s 650s 105s
clang++-3.9.1 380s 807s 106s
clang++-4.0.1 260s 718s 92s
zapcc++-1.0 221s 649s 108s

Note: For Release_Debug and Benchmark, I only use three threads with zapcc, because 12Go of RAM is not enough memory for four threads.

There are some very significant differences between the different compilers. Overall, clang-4.0.1 is by far the fastest free compiler for Debug mode. When the tests are compiled with optimizations however, clang is falling behind. It's quite impressive how clang-4.0.1 manages to be so much faster than clang-3.9.1 both in debug mode and release mode. Really great work by the clang team here! With these optimizations, clang-4.0.1 is almost on par with gcc-7.1 in release mode. For GCC, it seems that the cost of optimization has been going up quite significantly. However, GCC 7.1 seems to have made optimization faster and standard compilation much faster as well. If we take into account zapcc, it's the fastest compiler on debug mode, but it's slower than several gcc versions on release mode.

Overall, I'm quite impressed by the performance of clang-4.0.1 which seems really fast! I'll definitely make more tests with this new version of the compiler in the near future. It's also good to see that g++-7.1 also did make the build faster than gcc-6.3. However, the fastest gcc version for optimization is still gcc-4.9.4 which is already an old branch with low C++ standard support.

Runtime Performance

Let's now take a look at the quality of the generated code. For some of the benchmarks, I've included two versions of the algorithm. std is the most simple algorithm (the naive one) and vec is the hand-crafted vectorized and optimized implementation. All the tests were done on single-precision floating points.

Dot product

The first benchmark that is run is to compute the dot product between two vectors. Let's look first at the naive version:

dot (std) 100 500 1000 10000 100000 1000000 2000000 3000000 4000000 5000000 10000000
g++-4.9.4 64.96ns 97.12ns 126.07ns 1.89us 25.91us 326.49us 1.24ms 1.92ms 2.55ms 3.22ms 6.36ms
g++-5.4.0 72.96ns 101.62ns 127.89ns 1.90us 23.39us 357.63us 1.23ms 1.91ms 2.57ms 3.20ms 6.32ms
g++-6.3.0 73.31ns 102.88ns 130.16ns 1.89us 24.314us 339.13us 1.47ms 2.16ms 2.95ms 3.70ms 6.69ms
g++-7.1.0 70.20ns 104.09ns 130.98ns 1.90us 23.96us 281.47us 1.24ms 1.93ms 2.58ms 3.19ms 6.33ms
clang++-3.9.1 64.69ns 98.69ns 128.60ns 1.89us 23.33us 272.71us 1.24ms 1.91ms 2.56ms 3.19ms 6.37ms
clang++-4.0.1 60.31ns 96.34ns 128.90ns 1.89us 22.87us 270.21us 1.23ms 1.91ms 2.55ms 3.18ms 6.35ms
zapcc++-1.0 61.14ns 96.92ns 125.95ns 1.89us 23.84us 285.80us 1.24ms 1.92ms 2.55ms 3.16ms 6.34ms

The differences are not very significant between the different compilers. The clang-based compilers seem to be the compilers producing the fastest code. Interestingly, there seem to have been a big regression in gcc-6.3 for large containers, but that has been fixed in gcc-7.1.

dot (vec) 100 500 1000 10000 100000 1000000 2000000 3000000 4000000 5000000 10000000
g++-4.9.4 48.34ns 80.53ns 114.97ns 1.72us 22.79us 354.20us 1.24ms 1.89ms 2.52ms 3.19ms 6.55ms
g++-5.4.0 47.16ns 77.70ns 113.66ns 1.72us 22.71us 363.86us 1.24ms 1.89ms 2.52ms 3.19ms 6.56ms
g++-6.3.0 46.39ns 77.67ns 116.28ns 1.74us 23.39us 452.44us 1.45ms 2.26ms 2.87ms 3.49ms 7.52ms
g++-7.1.0 49.70ns 80.40ns 115.77ns 1.71us 22.46us 355.16us 1.21ms 1.85ms 2.49ms 3.14ms 6.47ms
clang++-3.9.1 46.13ns 78.01ns 114.70ns 1.66us 22.82us 359.42us 1.24ms 1.88ms 2.53ms 3.16ms 6.50ms
clang++-4.0.1 45.59ns 74.90ns 111.29ns 1.57us 22.47us 351.31us 1.23ms 1.85ms 2.49ms 3.12ms 6.45ms
zapcc++-1.0 45.11ns 75.04ns 111.28ns 1.59us 22.46us 357.32us 1.25ms 1.89ms 2.53ms 3.15ms 6.47ms

If we look at the optimized version, the differences are even slower. Again, the clang-based compilers are producing the fastest executables, but are closely followed by gcc, except for gcc-6.3 in which we can still see the same regression as before.

Logistic Sigmoid

The next test is to check the performance of the sigmoid operation. In that case, the evaluator of the library will try to use parallelization and vectorization to compute it. Let's see how the different compilers fare:

sigmoid 10 100 1000 10000 100000 1000000
g++-4.9.4 8.16us 5.23us 6.33us 29.56us 259.72us 2.78ms
g++-5.4.0 7.07us 5.08us 6.39us 29.44us 266.27us 2.96ms
g++-6.3.0 7.13us 5.32us 6.45us 28.99us 261.81us 2.86ms
g++-7.1.0 7.03us 5.09us 6.24us 28.61us 252.78us 2.71ms
clang++-3.9.1 7.30us 5.25us 6.57us 30.24us 256.75us 1.99ms
clang++-4.0.1 7.47us 5.14us 5.77us 26.03us 235.87us 1.81ms
zapcc++-1.0 7.51us 5.26us 6.48us 28.86us 258.31us 1.95ms

Interestingly, we can see that gcc-7.1 is the fastest for small vectors while clang-4.0 is the best for producing code for larger vectors. However, except for the biggest vector size, the difference is not really significantly. Apparently, there is a regression in zapcc (or clang-5.0) since it's slower than clang-4.0 at the same level as clang-3.9.

y = alpha * x + y (axpy)

The third benchmark is the well-known axpy (y = alpha * x + y). This is entirely resolved by expressions templates in the library, no specific algorithm is used. Let's see the results:

saxpy 10 100 1000 10000 100000 1000000
g++-4.9.4 38.1ns 61.6ns 374ns 3.65us 40.8us 518us
g++-5.4.0 35.0ns 58.1ns 383ns 3.87us 43.2us 479us
g++-6.3.0 34.3ns 59.4ns 371ns 3.57us 40.4us 452us
g++-7.1.0 34.8ns 59.7ns 399ns 3.78us 43.1us 547us
clang++-3.9.1 32.3ns 53.8ns 297ns 3.21us 38.3us 466us
clang++-4.0.1 32.4ns 59.8ns 296ns 3.31us 38.2us 475us
zapcc++-1.0 32.0ns 54.0ns 333ns 3.32us 38.7us 447us

Even on the biggest vector, this is a very fast operation, once vectorized and parallelized. At this speed, some of the differences observed may not be highly significant. Again clang-based versions are the fastest versions on this code, but by a small margin. There also seems to be a slight regression in gcc-7.1, but again quite small.

Matrix Matrix multiplication (GEMM)

The next benchmark is testing the performance of a Matrix-Matrix Multiplication, an operation known as GEMM in the BLAS nomenclature. In that case, we test both the naive and the optimized vectorized implementation. To save some horizontal space, I've split the tables in two.

sgemm (std) 10 20 40 60 80 100
g++-4.9.4 7.04us 50.15us 356.42us 1.18ms 3.41ms 5.56ms
g++-5.4.0 8.14us 74.77us 513.64us 1.72ms 4.05ms 7.92ms
g++-6.3.0 8.03us 64.78us 504.41us 1.69ms 4.02ms 7.87ms
g++-7.1.0 7.95us 65.00us 508.84us 1.69ms 4.02ms 7.84ms
clang++-3.9.1 3.58us 28.59us 222.36us 0.73ms 1.77us 3.41ms
clang++-4.0.1 4.00us 25.47us 190.56us 0.61ms 1.45us 2.80ms
zapcc++-1.0 4.00us 25.38us 189.98us 0.60ms 1.43us 2.81ms
sgemm (std) 200 300 400 500 600 700 800 900 1000 1200
g++-4.9.4 44.16ms 148.88ms 455.81ms 687.96ms 1.47s 1.98s 2.81s 4.00s 5.91s 9.52s
g++-5.4.0 63.17ms 213.01ms 504.83ms 984.90ms 1.70s 2.70s 4.03s 5.74s 7.87s 14.905
g++-6.3.0 64.04ms 212.12ms 502.95ms 981.74ms 1.69s 2.69s 4.13s 5.85s 8.10s 14.08s
g++-7.1.0 62.57ms 210.72ms 499.68ms 974.94ms 1.68s 2.67s 3.99s 5.68s 7.85s 13.49s
clang++-3.9.1 27.48ms 90.85ms 219.34ms 419.53ms 0.72s 1.18s 1.90s 2.44s 3.36s 5.84s
clang++-4.0.1 22.01ms 73.90ms 175.02ms 340.70ms 0.58s 0.93s 1.40s 1.98s 2.79s 4.69s
zapcc++-1.0 22.33ms 75.80ms 181.27ms 359.13ms 0.63s 1.02s 1.52s 2.24s 3.21s 5.62s

This time, the differences between the different compilers are very significant. The clang compilers are leading the way by a large margin here, with clang-4.0 being the fastest of them (by another nice margin). Indeed, clang-4.0.1 is producing code that is, on average, about twice faster than the code generated by the best GCC compiler. Very interestingly as well, we can see a huge regression starting from GCC-5.4 and that is still here in GCC-7.1. Indeed, the best GCC version, in the tested versions, is again GCC-4.9.4. Clang is really doing an excellent job of compiling the GEMM code.

sgemm (vec) 10 20 40 60 80 100
g++-4.9.4 264.27ns 0.95us 3.28us 14.77us 23.50us 60.37us
g++-5.4.0 271.41ns 0.99us 3.31us 14.811us 24.116us 61.00us
g++-6.3.0 279.72ns 1.02us 3.27us 15.39us 24.29us 61.99us
g++-7.1.0 273.74ns 0.96us 3.81us 15.55us 31.35us 71.11us
clang++-3.9.1 296.67ns 1.34us 4.18us 19.93us 33.15us 82.60us
clang++-4.0.1 322.68ns 1.38us 4.17us 20.19us 34.17us 83.64us
zapcc++-1.0 307.49ns 1.41us 4.10us 19.72us 33.72us 84.80us
sgemm (vec) 200 300 400 500 600 700 800 900 1000 1200
g++-4.9.4 369.52us 1.62ms 2.91ms 7.17ms 11.74ms 22.91ms 34.82ms 51.67ms 64.36ms 111.15ms
g++-5.4.0 387.54us 1.60ms 2.97ms 7.36ms 12.11ms 24.37ms 35.37ms 52.27ms 65.72ms 112.74ms
g++-6.3.0 384.43us 1.74ms 3.12ms 7.16ms 12.44ms 24.15ms 34.87ms 52.59ms 70.074ms 119.22ms
g++-7.1.0 458.05us 1.81ms 3.44ms 7.86ms 13.43ms 24.70ms 36.54ms 53.47ms 66.87ms 117.25ms
clang++-3.9.1 494.52us 1.96ms 4.80ms 8.88ms 18.20ms 29.37ms 41.24ms 60.72ms 72.28ms 123.75ms
clang++-4.0.1 511.24us 2.04ms 4.11ms 9.46ms 15.34ms 27.23ms 38.27ms 58.14ms 72.78ms 128.60ms
zapcc++-1.0 492.28us 2.03ms 3.90ms 9.00ms 14.31ms 25.72ms 37.09ms 55.79ms 67.88ms 119.92ms

As for the optimized version, it seems that the two families are reversed. Indeed, GCC is doing a better job than clang here, and although the margin is not as big as before, it's still significant. We can still observe a small regression in GCC versions because the 4.9 version is again the fastest. As for clang versions, it seems that clang-5.0 (used in zapcc) has had some performance improvements for this case.

For this case of matrix-matrix multiplication, it's very impressive that the differences in the non-optimized code are so significant. And it's also impressive that each family of compilers has its own strength, clang being seemingly much better at handling unoptimized code while GCC is better at handling vectorized code.

Convolution (2D)

The last benchmark that I considered is the case of the valid convolution on 2D images. The code is quite similar to the GEMM code but more complicated to optimized due to cache locality.

sconv2_valid (std) 100x50 105x50 110x55 115x55 120x60 125x60 130x65 135x65 140x70
g++-4.9.4 27.93ms 33.68ms 40.62ms 48.23ms 57.27ms 67.02ms 78.45ms 92.53ms 105.08ms
g++-5.4.0 37.60ms 44.94ms 54.24ms 64.45ms 76.63ms 89.75ms 105.08ms 121.66ms 140.95ms
g++-6.3.0 37.10ms 44.99ms 54.34ms 64.54ms 76.54ms 89.87ms 105.35ms 121.94ms 141.20ms
g++-7.1.0 37.55ms 45.08ms 54.39ms 64.48ms 76.51ms 92.02ms 106.16ms 125.67ms 143.57ms
clang++-3.9.1 15.42ms 18.59ms 22.21ms 26.40ms 31.03ms 36.26ms 42.35ms 48.87ms 56.29ms
clang++-4.0.1 15.48ms 18.67ms 22.34ms 26.50ms 31.27ms 36.58ms 42.61ms 49.33ms 56.80ms
zapcc++-1.0 15.29ms 18.37ms 22.00ms 26.10ms 30.75ms 35.95ms 41.85ms 48.42ms 55.74ms

In that case, we can observe the same as for the GEMM. The clang-based versions are much producing significantly faster code than the GCC versions. Moreover, we can also observe the same large regression starting from GCC-5.4.

sconv2_valid (vec) 100x50 105x50 110x55 115x55 120x60 125x60 130x65 135x65 140x70
g++-4.9.4 878.32us 1.07ms 1.20ms 1.68ms 2.04ms 2.06ms 2.54ms 3.20ms 4.14ms
g++-5.4.0 853.73us 1.03ms 1.15ms 1.36ms 1.76ms 2.05ms 2.44ms 2.91ms 3.13ms
g++-6.3.0 847.95us 1.02ms 1.14ms 1.35ms 1.74ms 1.98ms 2.43ms 2.90ms 3.12ms
g++-7.1.0 795.82us 0.93ms 1.05ms 1.24ms 1.60ms 1.77ms 2.20ms 2.69ms 2.81ms
clang++-3.9.1 782.46us 0.93ms 1.05ms 1.26ms 1.60ms 1.84ms 2.21ms 2.65ms 2.84ms
clang++-4.0.1 767.58us 0.92ms 1.04ms 1.25ms 1.59ms 1.83ms 2.20ms 2.62ms 2.83ms
zapcc++-1.0 782.49us 0.94ms 1.06ms 1.27ms 1.62ms 1.83ms 2.24ms 2.65ms 2.85ms

This time, clang manages to produce excellent results. Indeed, all the produced executables are significantly faster than the versions produced by GCC, except for GCC-7.1 which is producing similar results. The other versions of GCC are falling behind it seems. It seems that it was only for the GEMM that clang was having a lot of troubles handling the optimized code.

Conclusion

Clang seems to have recently done a lot of optimizations regarding compilation time. Indeed, clang-4.0.1 is much faster for compilation than clang-3.9. Although GCC-7.1 is faster than GCC-6.3, all the GCC versions are slower than GCC-4.9.4 which is the fastest at compiling code with optimizations. GCC-7.1 is the fastest GCC version for compiling code in debug mode.

In some cases, there is almost no difference between different compilers in the generated code. However, in more complex algorithms such as the matrix-matrix multiplication or the two-dimensional convolution, the differences can be quite significant. In my tests, Clang have shown to be much better at compiling unoptimized code. However, and especially in the GEMM case, it seems to be worse than GCC at handling hand-optimized. I will investigate that case and try to tailor the code so that clang is having a better time with it.

For me, it's really weird that the GCC regression, apparently starting from GCC-5.4, has still not been fixed in GCC 7.1. I was thinking of dropping support for GCC-4.9 in order to go full C++14 support, but now I may have to reconsider my position. However, seeing that GCC is generally the best at handling optimized code (especially for GEMM), I may be able to do the transition, since the optimized code will be used in most cases.

As for zapcc, although it is still the fastest compiler in debug mode, with the new speed of clang-4.0.1, its margin is quite small. Moreover, on optimized build, it's not as fast as GCC. If you use clang and can have access to zapcc, it's still quite a good option to save some time.

Overall, I have been quite pleased by clang-4.0.1 and GCC-7.1, the most recent versions I have been testing. It seems that they did quite some good work. I will definitely run some more tests with them and try to adapt the code. I'm still considering whether I will drop support for some older compilers.

I hope this comparison was interesting :) My next post will probably be about the difference in performance between my machine learning framework and other frameworks to train neural networks.

Comments

Expression Templates Library (ETL) 1.1
   Posted:


ETL Logo

It took me longer than I thought, but I'm glad to announce the release of the version 1.1 of my Expression Templates Library (ETL) project. This is a major new release with many improvements and new features. It's been almost one month since the last, and first, release (1.0) was released. I should have done some minor releases in the mean time, but at least now the library is in a good shape for major version.

It may be interesting to note that my machine learning framework (DLL), based on the ETL library, has shown to be faster than all the tested popular frameworks (Tensorflow, Keras, Caffee, Torch, DeepLearning4J) for training various neural networks on CPU. I'll post more details on another post on the coming weeks, but that shows that special attention to performance has been done in this library and that it is well adapted to machine learning.

For those of you that don't follow my blog, ETL is a library providing Expression Templates for computations on matrix and vector. For instance, if you have three matrices A, B and C you could write C++ code like this:

C = (2.0 * (A + B)) / sum(A)

Or given vectors b, v, h and a matrix W, you could write code like this:

h = sigmoid(b + v * W)

The goal of such library is two-fold. First, this makes the expression more readable and as close to math as possible. And then, it allows the library to compute the expressions as fast as possible. In the first case, the framework will compute the sum using a vectorized algorithm and then compute the overall expression using yet again vectorized code. The expression can also be computed in parallel if the matrices are big enough. In the second case, the vector-matrix multiplication will be computed first using either hand-code optimized vectorized or a BLAS routine (depending on configuration options). Then, all the expression will be executed using vectorized code.

Features

Many new features have been integrated into the library.

The support for machine learning operations has been improved. There are now specific helpers for machine learning in the etl::ml namespace which have names that are standard to machine learning. A real transposed convolution has been implemented with support for padding and stride. Batched outer product and batched bias averaging are also now supported. The activation function support has also been improved and the derivatives have been reviewed. The pooling operators have also been improved with stride and padding support. Unrelated to machine learning, 2D and 3D pooling can also be done in higher dimensional matrix now.

New functions are also available for matrices and vectors. The support for square root has been improved with cubic root and inverse root. Support has also been added for floor and ceil. Moreover, comparison operators are now available as well as global functions such as approx_equals.

New reductions have also been added with support for absolute sum and mean (asum/asum) and for min_index and max_index, which returns the index of the minimum element, respectively the maximum. Finally, argmax can now be used to get the max index in each sub dimensions of a matrix. argmax on a vector is equivalent to max_index.

Support for shuffling has also been added. By default, shuffling a vector means shuffling all elements and shuffling a matrix means shuffling by shuffling the sub matrices (only the first dimension is shuffled), but shuffling a matrix as a vector is also possible. Shuffle of two vectors or two matrices in parallel, is also possible. In that case, the same permutation is applied to both containers. As a side note, all operations using random generation are also available with an addition parameter for the random generator, which can help to improve reproducibility or simply tune the random generator.

I've also included support for adapters matrices. There are adapters for hermitian matrices, symmetric matrices and lower and upper triangular matrices. For now, the framework does not take advantage of this information, this will be done later, but the framework guarantee the different constrain on the content.

There are also a few new more minor features. Maybe not so minor, matrices can now be sliced into sub matrices. With that a matrix can be divided into several sub matrices and modifying the sub matrices will modify the source matrix. The sub matrices are available in 2D, 3D and 4D for now. There are also some other ways of slicing matrix and vectors. It is possible to obtain a slice of its memory or obtain a slice of its first dimension. Another new feature is that it is now possible compute the cross product of vectors now. Matrices can be decomposed into their Q/R decomposition rather than only their PALU decomposition. Special support has been integrated for matrix and vectors of booleans. In that case, they support logical operators such as and, not and or.

Performance

I've always considered the performance of this library to be a feature itself. I consider the library to be quite fast, especially its convolution support, even though there is still room for improvement. Therefore, many improvements have been made to the performance of the library since the last release. As said before, this library was used in a machine learning framework which then proved faster than most popular neural network frameworks on CPU. I'll present here the most important new improvements to performance, in no real particular order, every bit being important in my opinion.

First, several operations have been optimized to be faster.

Multiplication of matrices or matrices and vectors are now much faster if one of the matrix is transposed. Instead of performing the slow transposition, different kernels are used in order to maximize performance without doing any transposition, although sometimes transposition is performed when it is faster. This leads to very significant improvements, up to 10 times faster in the best case. This is performed for vectorized kernels and also for BLAS and CUBLAS calls. These new kernels are also directly used when matrices of different storage order are used. For instance, multiplying a column major matrix with a row major matrix and storing the result in a column major matrix is now much more efficient than before. Moreover, the performance of the transpose operation itself is also much faster than before.

A lot of machine learning operations have also been highly optimized. All the pooling and upsample operators are now parallelized and the most used kernel (2x2 pooling) is now more optimized. 4D convolution kernels (for machine learning) have been greatly improved. There are now very specialized vectorized kernels for classic kernel configurations (for instance 3x3 or 5x5) and the selection of implementations is now smarter than before. The support of padding is now much better than before for small amount of padding. Moreover, for small kernels the full convolution can now be evaluated using the valid convolution kernels directly with some padding, for much faster overall performance. The exponential operation is now vectorized which allows operations such as sigmoid or softmax to be much faster.

Matrices and vector are automatically using aligned memory. This means that vectorized code can use aligned operations, which may be slightly faster. Moreover, matrices and vectors are now padded to a multiple of the vector size. This allows to remove the final non-vectorized remainder loop from the vectorized code. This is only done for the end of matrices, when they are accessed in flat way. Contrary to some frameworks, inner dimensions of the matrix are not padded. Finally, accesses to 3D and 4D matrices is now much faster than before.

Then, the parallelization feature of ETL has been completely reworked. Before, there was a thread pool for each algorithm that was parallelized. Now, there is a global thread engine with one thread pool. Since parallelization is not nested in ETL, this improves performance slightly by greatly diminishing the number of threads that are created throughout an application. Another big difference in parallel dispatching is that now it can detect good split based on alignment so that each split are aligned. This then allows the vectorization process to use aligned stores and loads instead of unaligned ones which may be faster on some processors.

Vectorization has also been greatly improved in ETL. Integer operations are now automatically vectorized on processors that support this. Before, only floating points operations were vectorized. The automatic vectorizer now is able to use non-temporal stores for very large operations. A non-temporal store bypasses the cache, thus gaining some time. Since very large matrices do not fit in cache anyway and the cache would end up being overwritten anyway, this is a net gain. Moreover, the alignment detection in the automatic vectorizer has also been improved. Support for Fused-Multiply-Add (FMA) operations has also been integrated in the algorithms that can make use of it (multiplications and convolutions). The matrix-matrix multiplications and vector-matrix multiplications now have highly optimized vectorized kernels. They also have versions for column-major matrices now. I plan to reintegrate a version of the GEMM based on BLIS in the future but with more optimizations and support for all precisions and integers, For my version is still slower than the simple vectorized version. The sum and the dot product operations now also have specialized vectorized implementations. The min and max operations are now automatically-vectorized. Several others algorithms have also their own vectorized implementations.

Last, but not least, the GPU support has also been almost completely reworked. Now, several operations can be chained without any copies between GPU and CPU. Several new operations have also been added with support to GPU (convolutions, pooling, sigmoid, ReLU, ...). Moreover, to complement operations that are not available in any of the supported NVIDIA libraries, I've created a simple library that can be used to add a few more GPU operations. Nevertheless a lot of operations are still missing and only algorithms are available not expressions (such as c = a + b * 1.0) that are entirely computed on CPU. I have plans to improve that further, probably for version 1.2. The different contexts necessary for NVIDIA library can now be cached (using an option from ETL), leading to much faster code. Only the main handle can be cached so far, I plan to try to cache all the descriptors, but I don't know yet when that will be ready. Finally, an option is also available to reuse GPU memory instead of directly releasing it to CUDA. This is using a custom memory pool and can save some time. Since this needs to be cleaned (by a call to etl::exit() or using ETL_PROLOGUE), this is only activated on demand.

Other changes

There also have been a lot of refactorings in the code of the library. A lot of expressions now have less overhead and are specialized for performance. Moreover, temporary expressions have been totally reworked to be more simple and maintainable and easier to optimize in the future. It's also probably easier to add new expressions to the framework now, although that could be even more simple. There are also less duplicated code now in the different expressions. Especially, now there are now more SSE and AVX variants in the code. All the optimized algorithms are now using the vectorization system of the library.

I also tried my best to reduce the compilation time, based on the unit tests. This is still not great but better than before. For user code, the next version should be much faster to compile since I plan to disable forced selection of implementations by default and only enable it on demand.

Finally, there also was quite a few bug fixes. Most of them have been found by the use of the library in the Deep Learning Library (DLL) project. Some were very small edge cases. For instance, the transposition algorithm was not working on GPU on rectangular column major matrices. There also was a slight bug in the Q/R decomposition and in the pooling of 4D matrices.

What's next ?

Next time, I may do some minor release, but I don't yet have a complete plan. For the next major release (1.2 probably), here is what is planned:

  • Review the system for selection of algorithms to reduce compilation time
  • Review the GPU system to allow more complete support for standard operators
  • Switch to C++17: there are many improvements that could be done to the code with C++17 features
  • Add support for convolution on mixed types (float/double)
  • More tests for sparse matrix
  • More algorithms support for sparse matrix
  • Reduce the compilation time of the library in general
  • Reduce the compilation and execution time of the unit tests

These are pretty big changes, especially the first two, so maybe it'll be split into several releases. It will really depend on the time I have. As for C++17, I really want to try it and I have a lot of points that could profit from the switch, but that will means setting GCC 7.1 and Clang 3.9 as minimum requirement, which may not be reasonable for every user.

Download ETL

You can download ETL on Github. If you only interested in the 1.1 version, you can look at the Releases pages or clone the tag 1.1. There are several branches:

  • master Is the eternal development branch, may not always be stable
  • stable Is a branch always pointing to the last tag, no development here

For the future release, there always will tags pointing to the corresponding commits. I'm not following the git flow way, I'd rather try to have a more linear history with one eternal development branch, rather than an useless develop branch or a load of other branches for releases.

The documentation is a bit sparse. There are a few examples and the Wiki, but there still is work to be done. If you have questions on how to use or configure the library, please don't hesitate.

Don't hesitate to comment this post if you have any comment on this library or any question. You can also open an Issue on Github if you have a problem using this library or propose a Pull Request if you have any contribution you'd like to make to the library.

Hope this may be useful to some of you :)

Comments

Update on Deep Learning Library (DLL): Dropout, Batch Normalization, Adaptive Learning Rates, ...
   Posted:


It's been a while since I've posted something on this, especially since I had one month vacation. This year I've been able to integrate a great number of changes into my Deep Learning Library (DLL) project. It has seen a lot of refactorings and a lot of new features making it look like a real neural network library now. In this post, I'll try to outline the last new features and changes of the library.

For those that don't know, DLL is a library for neural network training, written in C++ and for C++. You can train Fully-Connected Neural Networks and Convolutional Neural Networks. The focus of the framework is on speed and easy use in C++.

As for my ETL project and again thanks to my thesis supervisor, the project now has a logo:

DLL Logo

Adaptive Learning Rates

Before, the framework only supported simple SGD and Momentum updates for the different parameters of the network. Moreover, it was not very well extendable. Therefore, I reviewed the system to be able to configure an optimizer for each network to train. Once that was done, the first thing I did was to add support for Nesterov Accelerated Gradients (NAG) as a third optimizer. After this, I realized it was then easy to integrate support for more advanced optimizers including support for adaptive learning rates. This means that the learning rate will be adapted for each parameter depending on what the network is learning. Some of the optimizers even don't need any learning rate. So far, I've implemented support for the following optimizers: Adagrad, RMSProp, Adam (with and without bias correction), Adamax (Adam with infinite norm), Nadam (Adam with Nesterov momentum) and Adadelta (no more learning rate). The user can now choose the optimizer of its choice, for instance NADAM, as a parameter of the network:

// Use a Nadam optimizer
dll::updater<dll::updater_type::NADAM>

Another improvement in the same domain is that the learning rate can also be decayed over time automatically by the optimizer.

If you want more information on the different optimizers, you can have a look at this very good article: An overview of gradient descent optimization algorithms from Sebastian Ruder.

Better loss support

Before, DLL was automatically using Categorical Cross Entropy Loss, but it was not possible to change it and it was not even possible to see the loss over time. Now, the current value of the loss is displayed after each epoch of training and the loss used for training is now configurable. So far, only three different losses are supported, but it it not difficult to add new loss to the system. The three losses supported are: Categorical Cross Entropy Loss, Binary Cross Entropy Loss and Mean Squared Error Loss.

Again, each network can specify the loss to use:

// Use a Binary Cross Entropy Loss
dll::loss<dll::loss_function::BINARY_CROSS_ENTROPY>

Dropout

Dropout is a relatively new technique for neural network training. This is especially made to reduce overfitting since a large number of sub networks will be trained and it should prevent co-adaptation between different neurons. This technique is relatively simple. Indeed, it simply randomly sets to zero some of the input neurons of layers. At each batch, a new mask will be used and this should lead to a large number of sub networks being trained.

Here is example of a MLP with Dropout (p=0.5):

using network_t = dll::dyn_dbn_desc<
    dll::dbn_layers<
        dll::dense_desc<28 * 28, 500>::layer_t,
        dll::dropout_layer_desc<50>::layer_t,
        dll::dense_desc<500, 250>::layer_t,
        dll::dropout_layer_desc<50>::layer_t,
        dll::dense_desc<250, 10, dll::activation<dll::function::SOFTMAX>>::layer_t>
    , dll::updater<dll::updater_type::MOMENTUM>     // Momentum
    , dll::batch_size<100>                          // The mini-batch size
    , dll::shuffle                                  // Shuffle before each epoch
>::dbn_t;

Batch Normalization

Batch Normalization is another new technique for training neural networks. This technique will ensure that each of the layer will receive inputs that look kind of similar. This is a very large advantage since then you reduce the different in impact of hyper parameters on different layers. Google reported much faster training with this technique by getting rid of Dropout and by increasing the learning rate of training.

Here is an example of using Batch Normalization in a CNN:

using network_t = dll::dyn_dbn_desc<
    dll::dbn_layers<
        dll::conv_desc<1, 28, 28, 8, 5, 5>::layer_t,
        dll::batch_normalization_layer_4d_desc<8, 24, 24>::layer_t,
        dll::mp_layer_2d_desc<8, 24, 24, 2, 2>::layer_t,
        dll::conv_desc<8, 12, 12, 8, 5, 5>::layer_t,
        dll::batch_normalization_layer_4d_desc<8, 8, 8>::layer_t,
        dll::mp_layer_2d_desc<8, 8, 8, 2, 2>::layer_t,
        dll::dense_desc<8 * 4 * 4, 150>::layer_t,
        dll::batch_normalization_layer_2d_desc<150>::layer_t,
        dll::dense_desc<150, 10, dll::activation<dll::function::SOFTMAX>>::layer_t>
    , dll::updater<dll::updater_type::ADADELTA>     // Adadelta
    , dll::batch_size<100>                          // The mini-batch size
    , dll::shuffle                                  // Shuffle the dataset before each epoch
>::dbn_t;

You may notice that the layer is set as 4D so should only be used after convolutional layer (or after the input). If you want to use it after fully-connected layers, you can use the 2D version that works the same way.

Better dataset support

At the beginning, I designed DLL so that the user could directly pass data for training in the form of STL Containers such as the std::vector. This is good in some cases, but in some cases, the user does not know how to read the data , or does not want to be bothered with it. Therefore, several data sets reader are now available. Moreover, the entire system has been reworked to use generators for data. A generator is simply a concept that has some data to produce. The advantage of this new system is data augmentation is now supported every where and much more efficiently than before. It is now possible to perform random cropping and mirroring of images for instance. Moreover, the data augmentation can be done in a secondary thread so as to be sure that there is always enough data available for the training.

The library now has a powerful dataset reader for both MNIST and CIFAR-10 and the reader for ImageNet is almost ready. The project has already been used and tested with these three datasets now. Moreover, the support for directly passing STL containers has been maintained. In this case, a generator is simply created around the data provided in the container and the generator is then passed to the system for training.

Here for instance is how to read MNIST data and scale (divide) all pixel values by 255:

// Load the dataset
auto dataset = dll::make_mnist_dataset(0, dll::batch_size<100>{}, dll::scale_pre<255>{});
dataset.display();

// Train the network
net->fine_tune(dataset.train(), 25);

// Test the network
net->evaluate(dataset.test());

Much faster performance

I've spent quite a lot of time improving the performance of the framework. I've focused on every part of training in order to make training of neural networks as fast as possible. I've also made a comparison of the framework against several popular machine learning framework (Caffe, TensorFlow, Keras, Torch and DeepLearning4J). For instance, here are the results on a small CNN experiment on MNIST with all the different frameworks in CPU mode and in GPU mode:

DLL Comparison Against other frameworks

As you can see, DLL is by far the fastest framework on CPU. On GPU, there is still some work to be done, but this is already ongoing (although a lot of work remains). This is confirmed on each of the four experiments performed on MNIST, CIFAR-10 and ImageNet, although the margin is smaller for larger networks (still about 40% faster than TensorFlow and Keras which are the fastest framework after DLL on CPU on my tests).

Overall, DLL is between 2 and 4 times faster than before and is always the fastest framework for neural network training when training is performed on CPU.

I proposed a talk about these optimizations and performance for Meeting C++ this year, but it has unfortunately not been accepted. We also have submitted a publication about the framework to a conference later this year.

Examples

The project now has a few examples (available here), well-designed and I try to update them with the latest updates of the framework.

For instance, here is the CNN example for MNIST (without includes):

int main(int /*argc*/, char* /*argv*/ []) {
    // Load the dataset
    auto dataset = dll::make_mnist_dataset(0, dll::batch_size<100>{}, dll::scale_pre<255>{});

    // Build the network

    using network_t = dll::dyn_dbn_desc<
        dll::dbn_layers<
            dll::conv_desc<1, 28, 28, 8, 5, 5>::layer_t,
            dll::mp_layer_2d_desc<8, 24, 24, 2, 2>::layer_t,
            dll::conv_desc<8, 12, 12, 8, 5, 5>::layer_t,
            dll::mp_layer_2d_desc<8, 8, 8, 2, 2>::layer_t,
            dll::dense_desc<8 * 4 * 4, 150>::layer_t,
            dll::dense_desc<150, 10, dll::activation<dll::function::SOFTMAX>>::layer_t>
        , dll::updater<dll::updater_type::MOMENTUM>     // Momentum
        , dll::batch_size<100>                          // The mini-batch size
        , dll::shuffle                                  // Shuffle the dataset before each epoch
    >::dbn_t;

    auto net = std::make_unique<network_t>();

    net->learning_rate = 0.1;

    // Display the network and dataset
    net->display();
    dataset.display();

    // Train the network
    net->fine_tune(dataset.train(), 25);

    // Test the network on test set
    net->evaluate(dataset.test());

    return 0;
}

Reproducible results

And last, but maybe not least, I've finally united all the random number generation code. This means that DLL can now set a global seed and that two training of the same network and data with the same seed will now produce exactly the same result.

The usage is extremely simple:

dll::set_seed(42);

Conclusion

After all these changes, I truly feel that the library is now in a much better state and could be useful in several projects. I hope that this will be useful to some more people. Moreover, as you can see by the performance results, the framework is now extremely efficient at training neural networks on CPU.

If you want more information, you can consult the dll Github Repository. You can also add a comment to this post. If you find any problem on the project or have specific question or request, don't hesitate to open an issue on Github.

Comments

Jenkins Tip: Send notifications on fixed builds in declarative pipeline
   Posted:


In my previous post, I presented a few news about Jenkins and about the fact that I switched to declarative pipelines and Github Organization support for my projects.

The main issue I had with this system is that I lost the ability to get notifications on build that recover. Normally, I would get an email indicating that build X was back to normal, but I haven't found a way to solve that for declarative pipeline.

By following a few posts on StackOverflow, I now have the solution and it is the same problem that was already present in scripted pipelines. Namely, the status of the current build is not set early enough for the notification. Basically, you have to set the notification yourself. Here is what a declarative pipeline looks like:

pipeline {
    agent any

    stages {

        // Normal Stages

        stage ('success'){
            steps {
                script {
                    currentBuild.result = 'SUCCESS'
                }
            }
        }
    }

    post {
        failure {
            script {
                currentBuild.result = 'FAILURE'
            }
        }

        always {
            step([$class: 'Mailer',
                notifyEveryUnstableBuild: true,
                recipients: "[email protected]",
                sendToIndividuals: true])
        }
    }
}

There are two important things here. First, a new stage (success) is added that simply set the result of the current build to SUCCESS once it is done. It must be the last stage on the pipeline. This could also be added as the last step of the last stage instead of adding a new stage, but I think it's clearer like this. The second thing is the failure block in which the result of the current build is set to FAILURE. With these two things, the Mailer plugin now sends notification when a build has been fixed.

I hope that will help some of you. I personally think that it should be much easier than that. All this boilerplate is polluting the pipeline that should be kept more maintainable, but for now it seems, it's the nicest way to achieve that, short of handling all conditions in the post block and sending mail directly there, but that would result in even more boilerplate code.

Comments

Jenkins Declarative Pipeline and Awesome Github Integration
   Posted:


This post is about some news about Jenkins and how I've updated my Jenkins usage. This may be a bit of an enthusiastic post ;)

At the beginning of Jenkins, the best way to define the commands to be executed for your builds was simply to write the commands in the Jenkins interface. This worked quite well. Later on, Jenkins introduced the notion of Pipeline. Instead of a single set of commands to be executed, the build was defined in multi-stages pipeline of commands. This is defined as a Groovy script. One big advantage of this is that all the code for creating the build is inside the repository. This has the advantage that each build is reproducible. This enabled to define complex pipelines of commands for your builds. Moreover, this also allows to have a clean view of which steps are failing and which steps are taking how much of the time of the build. For instance, here it's a view of the pipeline steps for my DLL project:

Jenkins stage view for DLL pipeline.

I think that's pretty cool :)

They recently added a new feature, the declarative pipelines. Instead of scripting the Pipeline in Groovy, the new system uses its own syntax, completely declarative, to put blocks together and add ways of doing actions at specific points and setting environment and so on. I think the new syntax is much nicer than the Groovy scripted Pipeline way, so I started converting my scripts. I'll give an example in a few paragraphs. But first, I'd like to talk about Github integration. Before, every time I created a new project, I add to add it to Jenkins by creating a new project, updating the link to the Github project and a few things in order to add it. This is not so bad but what if you want to build on several branches and keep track of the status of the branches and maybe of the Pull Requests as well. All of this is now very simple. You can now declare the Github organizations (and users) you are of building projects from and the projects inside the organization will be automatically detected as long as they have a Jenkinsfile inside. That means that you'll never have to create a project yourself or handle branches. Indeed, all the created projects can now handle multiples. For instance, here is the status of the two current branches of my dll project:

Jenkins branches for DLL Github project

It's maybe not the best example since one branch is failing and the other is unstable, but you can see that you can track the builds for each branch in a nice way.

A very good feature of this integration is that Jenkins will now automatically marks commits on your Github with the status of your builds at no cost! For instance, here is the status on my ETL project after I configured on Jenkins and made the first builds:

Jenkins marking commits in Github

Pretty cool I think :)

Another nice thing in Jenkins is the Blue Ocean interface. This is an alternative interface, especially well-suited for multi-branch projects and pipelines. It looks much more modern and I think it's quite good. Here are a few views of it:

The Activity view for the last events of the project:

Jenkins Blue Ocean Activity view for DLL project

The Branches view for the status of each branch:

Jenkins Blue Ocean Branches view for DLL project

The view of the status of a build:

Jenkins Blue Ocean view of a build for DLL project

The status of the tests for a given build:

Jenkins Blue Ocean view of a build tests for DLL project

It's likely that it won't appeal to everyone, but I think it's pretty nice.

If we get back to the declarative Pipeline, here is the declarative pipeline for my Expression Templates Library (ETL) project:

pipeline {
    agent any

    environment {
       CXX = "g++-4.9.4"
       LD = "g++-4.9.4"
       ETL_MKL = 'true'
    }

    stages {
        stage ('git'){
            steps {
                checkout([
                    $class: 'GitSCM',
                    branches: scm.branches,
                    doGenerateSubmoduleConfigurations: false,
                    extensions: scm.extensions + [[$class: 'SubmoduleOption', disableSubmodules: false, recursiveSubmodules: true, reference: '', trackingSubmodules: false]],
                    submoduleCfg: [],
                    userRemoteConfigs: scm.userRemoteConfigs])
            }
        }

        stage ('pre-analysis') {
            steps {
                sh 'cppcheck --xml-version=2 -j3 --enable=all --std=c++11 `git ls-files "*.hpp" "*.cpp"` 2> cppcheck_report.xml'
                sh 'sloccount --duplicates --wide --details include/etl test workbench > sloccount.sc'
                sh 'cccc include/etl/*.hpp test/*.cpp workbench/*.cpp || true'
            }
        }

        stage ('build'){
            steps {
                sh 'make clean'
                sh 'make -j6 release'
            }
        }

        stage ('test'){
            steps {
                sh 'ETL_THREADS=-j6 ETL_GPP=g++-4.9.4 LD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}:/opt/intel/mkl/lib/intel64:/opt/intel/lib/intel64\" ./scripts/test_runner.sh'
                archive 'catch_report.xml'
                junit 'catch_report.xml'
            }
        }

        stage ('sonar-master'){
            when {
                branch 'master'
            }
            steps {
                sh "/opt/sonar-runner/bin/sonar-runner"
            }
        }

        stage ('sonar-branch'){
            when {
                not {
                    branch 'master'
                }
            }
            steps {
                sh "/opt/sonar-runner/bin/sonar-runner -Dsonar.branch=${env.BRANCH_NAME}"
            }
        }

        stage ('bench'){
            steps {
                build job: 'etl - benchmark', wait: false
            }
        }
    }

    post {
        always {
            step([$class: 'Mailer',
                notifyEveryUnstableBuild: true,
                recipients: "[email protected]",
                sendToIndividuals: true])
        }
    }
}

There is nothing really fancy about, it's probably average. Moreover, since I'm not an expert on pipelines and I've just discovered declarative pipelines, it may not be optimal, but it works. As you'll see there are some problems I haven't been able to fix.

The first part declares the environment variables for the build. Then, the multiple build stages are listed. The first stage checkout the code from the SCM. This ugly piece of code is here to allow to checkout the submodules. It is the only solution I have found so far. It's very ugly but it works. The second steps is simply some basic static analysis. The next step is the classical build step. Then, the tests are run. In that case, I'm using a script because the tests are compiled with several different sets of options and it was much easier to put that in a script that in the Pipeline. Moreover, that also means I can run them standalone. The variables in the line to run the script is another problem I haven't been able to fix so far. If I declare these variables in an environment block, they are not passed to the script for some reason, so I had to use this ugly line. The next two blocks are for Sonar analysis. If you start with Sonar, you can simply the second block that passed the branch information to Sonar. Unfortunately, Sonar is very limited in terms of Git branches. Each branch is considered as another totally different project. That means the false positives defined in the master branch will not be used in the second branch. Therefore, I kept a clean master and several different projects for the other branches. Once Sonar improves this branch handling stuff, if they ever do, I'll be able to get rid of one of these conditional stages. The last stage is simple running the benchmark job. Finally, the post block is using the Mailer plugin to send failed builds information. Again, there is a problem here since this does not send "Back to normal" information as it used to do before. I've asked this question on StackOverflow, but haven't received an answer so far. I'll post a better solution on this blog once I have one. If any of you have some solutions to these problems, don't hesitate to post in the comments below or to contact me on Github.

Here it is. I really think Jenkins is getting even greater now with all this cool stuff and I advice you to try it out!

Comments

C++ Containers Benchmark: vector/list/deque and plf::colony
   Posted:


Already more than three years ago, I've written a benchmark of some of the STL containers, namely the vector, the list and the deque. Since this article was very popular, I decided to improve the benchmarks and collect again all the results. There are now more benchmarks and some problems have been fixed in the benchmark code. Moreover, I have also added a new container, the plf::colony. Therefore, there are four containers tested:

  • The std::vector: This is a dynamically-resized array of elements. All the elements are contiguous in memory. If an element is inserted or removed it at a position other than the end, the following elements will be moved to fill the gap or to open a gap. Elements can be accessed at random position in constant time. The array is resized so that it can several more elements, not resized at each insert operation. This means that insertion at the end of the container is done in amortized constant time.
  • The std::deque: The deque is a container that offer constant time insertion both at the front and at the back of the collection. In current c++ libraries, it is implementation as a collection of dynamically allocated fixed-size array. Not all elements are contiguous, but depending on the size of the data type, this still has good data locality. Access to a random element is also done in constant time, but with more overhead than the vector. For insertions and removal at random positions, the elements are shifted either to the front or to the back meaning that it is generally faster than the vector, by twice in average.
  • The std::list: This is a doubly-linked list. It supports constant time insertions at any position of the collection. However, it does not support constant time random access. The elements are obviously not contiguous, since they are all allocated in nodes. For small elements, this collection has a very big memory overhead.
  • The plf::colony: This container is a non-standard container which is unordered, it means that the insertion order will not necessarily be preserved. It provides strong iterators guarantee, pointers to non-erased element are not invalidated by insertion or erasure. It is especially tailored for high-insertion/erasure workloads. Moreover, it is also specially optimized for non-scalar types, namely structs and classes with relatively large data size (greater than 128 bits on the official documentation). Its implementation is more complicated than the other containers. It is also implemented as a list of memory blocks, but they are of increasingly large sizes. When elements are erased, there position is not removed, but marked as erased so that it can be reused for fast insertion later on. This container uses the same conventions as the standard containers and was proposed for inclusion to the standard library, which is the main reason why it's included in this benchmark. If you want more information, you can consult the official website.

In the text and results, the namespaces will be omitted. Note that I have only included sequence containers in my test. These are the most common containers in practices and these also the containers I'm the most familiar with. I could have included multiset in this benchmark, but the interface and purpose being different, I didn't want the benchmark to be confusing.

All the examples are compiled with g++-4.9.4 (-std=c++11 -march=native -O2) and run on a Gentoo Linux machine with an Intel Core i7-4770 at 3.4GHz.

For each graph, the vertical axis represent the amount of time necessary to perform the operations, so the lower values are the better. The horizontal axis is always the number of elements of the collection. For some graph, the logarithmic scale could be clearer, a button is available after each graph to change the vertical scale to a logarithmic scale.

The tests are done with several different data types. The trivial data types are varying in size, they hold an array of longs and the size of the array varies to change the size of the data type. The non-trivial data type is composed of a string (just long enough to avoid SSO (Small String Optimization) (even though I'm using GCC)). The non-trivial data types comes in a second version with noexcept move operations. Not all results are presented for each data types if there are not significant differences between in order to keep this article relatively short (it's already probably too long :P).

Read more…

Comments

Speed up TensorFlow inference by compiling it from source
   Posted:


The most simple way to install TensorFlow is to work in a virtual Python environment and simply to use either the TensorFlow official packages in pip or use one of the official wheels for distributions. There is one big problem with that technique and it's the fact that the binaries are precompiled so that they fit as many hardware configuration as possible. This is normal from Google since generating precompiled binaries for all the possible combinations of processor capabilities would be a nightmare. This is not a problem for GPU since the CUDA Libraries will take care of the difference from one graphics card to another. But it is a problem with CPU performance. Indeed, different processors have different capabilities. For instance, the vectorization capabilities are different from processor to processor (SSE, AVX, AVX2, AVX-512F, FMA, ...). All those options can make a significant difference in the performance of the programs. Although most of the machine learning training occurs on GPU most of the time, the inference is mostly done on the CPU. Therefore, it probably remains important to be as fast as possible on CPU.

So if you care about performance on CPU, you should install TensorFlow from sources directly yourself. This will allow compilation of the TensorFlow sources with -march=native which will enable all the hardware capabilities of machine on which you are compiling the library.

Depending on your problem, this may give you some nice speedup. In my case, on a very small Recurrent Neural Network, it made inference about 20% faster. On a larger problem and depending on your processor, you may gain much more than that. If you are training on CPU, this may make a very large difference in total time.

Installing TensorFlow is sometimes a bit cumbersome. You'll likely have to compile Bazel from sources as well and depending on your processor, it may take a long time to finish. Nevertheless, I have successfully compiled TensorFlow from sources on several machines now without too many problems. Just pay close attention to the options you are setting while configuring TensorFlow, for instance CUDA configuration if you want GPU support.

I hope this little trick will help you gain some time :)

Here is the link to compile TensorFlow from source.

Comments