Expression Templates Library (ETL) 1.2 - Complete GPU support

ETL Logo

I'm happy to announce the version 1.2 of my Expression Templates Library (ETL): ETL 1.2, two months after I released the version 1.1. This version features much better GPU Support, a few new features and a lot of changes in the internal code.

GPU Support

Before, only algorithms such as 4D convolution or matrix-matrix multiplication were computed in the GPU and lots of operations were causing copies between CPU and GPU version. Now, the support for basic operations has also been completed and therefore, expressions like this:

C = sigmoid(2.0 * (A + B)) / sum(A)

Can be computed entirely on GPU.

Each matrix and vector containers have a secondary GPU memory space. During the execution, the status of both memory spaces is being managed and when necessary, copies are made between two spaces. In the best case, there should only be initial copies to the GPU and then everything should be done on the GPU. I've also considered using Unified Memory in place of this system, but this is a problem for fast matrix and I'd rather not have two different systems.

If you have an expression such as c = a + b * 2, it can be entirely computed on GPU, however, it will be computed in two GPU operations such as:

t1 = b * 2
c = a + t1

This is not perfect in terms of performance but this will be done without any copies between CPU and GPU memory. I plan to improve this system with a bit more complex operations to avoid too many GPU operations, but there will always be more operations than in CPU where this can easily be done in one go.

There are a few expressions that are not computable on the GPU, such as random generations. A few transformations are also not fully compatible with GPU. Moreover, if you access an element with operators [] or (), this will invalidate the GPU memory and force an update to the CPU memory.

GPU operations are not implemented directly in ETL, there are coming from various libraries. ETL is using NVIDIA CUDNN, CUFFT and CUDNN for most algorithms. Moreover, for other operations, I've implemented a libraries with simple GPU operations: ETL-GPU-BLAS (EGBLAS). You can have a look at egblas if you are interested.

My Deep Learning Library (DLL) project is based on ETL and its performances are mostly dependent on ETL's performances. Now that ETL fully supports GPU, the GPU performance of DLL is much improved. You may remember a few weeks ago I posted very high CPU performance of DLL. Now, I've run again the tests to see the GPU performance with DLL. Here is the performance for training a small CNN on the MNIST data set:

Performances for training a Convolutional Neural Network on MNIST

As you can see, the performances on GPU are now excellent. DLL's performances are on par with Tensorflow and Keras!

The next results are for training a much larger CNN on ImageNet, with the time necessary to train a single batch:

Performances for training a Convolutional Neural Network on Imagenet

Again, using the new version of ETL inside DLL has led to excellent performance. The framework is again on par with TensorFlow and Keras and faster than all the other frameworks. The large difference between DLL and Tensorflow and Keras is due to the inefficiency of reading the dataset in the two frameworks, so the performance of the three framework themselves are about the same.

Other Changes

The library also has a few other new features. Logarithms of base 2 and base 10 are now supported in complement to the base e that was already available before. Categorical Cross Entropy (CCE) computation is also available now, the CCE loss and error can be computed for one or many samples. Convolutions have also been improved in that you can use mixed types in both the image and the kernel and different storage order as well. Nevertheless, the most optimized version remains the version with the same storage order and the same data type.

I've also made a major change in the way implementations are selected for each operation. The tests and the benchmark are using a system to force the selection of an algorithm. This system is now disabled by default. This makes the compilation much faster by default. Since it's not necessary in most cases, this will help regular use cases of the library by compiling much faster.

Overall, the support for complex numbers has been improved in ETL. There are more routines that are supported and etl::complex is better supported throughout the code. I'll still work on this in the future to make it totally complete.

The internal code also has a few new changes. First, all traits have been rewritten to use variable templates instead of struct traits. This makes the code much nicer in my opinion. Moreover, I've started experimenting with C++17 if constexpr. Most of the if conditions that can be transformed to if constexpr have been annotated with comments that I can quickly enable or disable so that I can test the impact of C++17, especially on compilation time.

Finally, a few bugs have been fixed. ETL is now working better with parallel BLAS library. There should not be issues with double parallelization in ETL and BLAS. There was a slight bug in the Column-Major matrix-matrix multiplication kernel. Binary operations with different types in the left and right hand sides was also problematic with vectorization. The last bug was about GPU status in case ETL containers were moved.

What's next ?

I don't yet know exactly on which features I'm going to focus for the next version of ETL. I plan to focus a bit more in the near future on Deep Learning Library (DLL) for which I should release the version 1.0 soon. I also plan to start support for Recurrent Neural Networks on it, so that will take me quite some time.

Nevertheless, I'm still planning to consider the switch to C++17, since it is a bit faster to compile ETL with if constexpr. The next version of ETL will also probably have GPU-support for integers, at least in the cases that depend on the etl-gpu-blas library, which is the standard operators. I also plan to improve the support for complex numbers, especially in terms of performance and tests. Hopefully, I will have also time (and motivation) to start working on the sparse capabilities of ETL. It really needs much more unit tests and the performance should be improved as well.

Download ETL

You can download ETL on Github. If you only interested in the 1.2 version, you can look at the Releases pages or clone the tag 1.2. There are several branches:

  • master Is the eternal development branch, may not always be stable

  • stable Is a branch always pointing to the last tag, no development here

For the future release, there always will tags pointing to the corresponding commits. You can also have access to previous releases on Github or via the release tags.

The documentation is still a bit sparse. There are a few examples and the Wiki, but there still is work to be done. If you have questions on how to use or configure the library, please don't hesitate.

Don't hesitate to comment this post if you have any comment on this library or any question. You can also open an Issue on Github if you have a problem using this library or propose a Pull Request if you have any contribution you'd like to make to the library.

Hope this may be useful to some of you :)

C++11 Performance tip: Update on when to use std::pow ?

A few days ago, I published a post comparing the performance of std::pow against direct multiplications. When not compiling with -ffast-math, direct multiplication was significantly faster than std::pow, around two orders of magnitude faster when comparing x * x * x and code:std::pow(x, 3). One comment that I've got was to test for which n is code:std::pow(x, n) becoming faster than multiplying in a loop. Since std::pow is using a special algorithm to perform the computation rather than be simply loop-based multiplications, there may be a point after which it's more interesting to use the algorithm rather than a loop. So I decided to do the tests. You can also find the result in the original article, which I've updated.

First, our pow function:

double my_pow(double x, size_t n){
    double r = 1.0;

    while(n > 0){
        r *= x;
        --n;
    }

    return r;
}

And now, let's see the performance. I've compiled my benchmark with GCC 4.9.3 and running on my old Sandy Bridge processor. Here are the results for 1000 calls to each functions:

We can see that between n=100 and n=110, std::pow(x, n) starts to be faster than my_pow(x, n). At this point, you should only use std::pow(x, n). Interestingly too, the time for std::pow(x, n) is decreasing. Let's see how is the performance with higher range of n:

We can see that the pow function time still remains stable while our loop-based pow function still increases linearly. At n=1000, std::pow is one order of magnitude faster than my_pow.

Overall, if you do not care much about extreme accuracy, you may consider using you own pow function for small-ish (integer) n values. After n=100, it becomes more interesting to use std::pow.

If you want more results on the subject, you take a look at the original article.

If you are interested in the code of this benchmark, it's available online: bench_pow_my_pow.cpp

How I made my Deep Learning Library 38% faster to compile (Optimization and C++17 if constexpr)

My Deep Learning Library (DLL) project is a C++ library for training and using artificial neural networks (you can take a look at this post about DLL if you want more information).

While I made a lot of effort to make it as fast as possible to train and run neural networks, the compilation time has been steadily going up and is becoming quite annoying. This library is heavily templated and all the matrix operations are done using my Expression Templates Library (ETL) which is more than template-heavy itself.

In this post, I'll present two techniques with which I've been able to reduce the total compilation of the DLL unit tests by up to 38%.

Read more…

C++11 Performance tip: When to use std::pow ?

Update: I've added a new section for larger values of n.

Recently, I've been wondering about the performance of std::pow(x, n). I'm talking here about the case when n is an integer. In the case when n is not an integer, I believe, you should always use std::pow or use another specialized library.

In case when n is an integer, you can actually replace it with the direct equivalent (for instance std::pow(x, 3) = x * x x). If n is very large, you'd rather write a loop of course ;) In practice, we generally use powers of two and three much more often than power of 29, although that could happen. Of course, it especially make sense to wonder about this if the pow is used inside a loop. If you only use it once outside a loop, that won't be any difference on the overall performance.

Since I'm mostly interested in single precision performance (neural networks are only about single precision), the first benchmarks will be using float.

Read more…

budgetwarrior 0.4.2 - Budget summary and improved fortune reports

Almost three years ago, I published the version 0.4.1 of budgetwarrior. Since then, I've been using this tool almost every day to manage my personal budget. This is the only tool I use to keep track of my expenses and earnings and it makes a great tool for me. I recently felt that it was missing a few features and added them and polished a few things as well and release a new version with all the new stuff. This new version is probably nothing fancy, but a nice upgrade of the tool.

Don't pay too much attention to the values in the images since I've randomized all the data for the purpose of this post (new feature, by the way :P).

New summary view

I've added a new report with budget summary:

/images/budgetwarrior_042_summary.png

This view gives concise information about the current state of your accounts. It also gives information about your yearly and monthly objectives. Finally, it also gives information about the last two fortune values that you've set. I think this make a great kind of dashboard to view most of the information. If your terminal is large enough, the three parts will be shown side by side.

Improved fortune report

I've made a few improvements to the budget fortune view:

/images/budgetwarrior_042_fortune.png

It now display the time between the different fortune values and it compute the average savings (or avg losses) per day in each interval and in average from the beginning of the first value.

Various changes

The balance does not propagate over the years anymore. This should mainly change the behaviour of budget overview. I don't think it was very smart to propagate it all the time. The balance now starts at zero for each year. If you want the old system, you can use the multi_year_balance=true option in the .budgetrc configuration file.

The recurring expenses do not use an internal configuration value. This does not change anything for the behaviour, but means that if you sync between different machines, it will avoid a lot of possible conflicts :)

Fixed a few bugs with inconsistency between the different views and reports. Another bug that was fixed is that budget report was not always displaying the first month of the year correctly, this is now fixed.

The graphs display in budget report are now automatically adapted to width of your terminal. Finally, the budget overview command also displays more information about the comparison with the previous month.

Installation

If you are on Gentoo, you can install it using layman:

layman -a wichtounet
emerge -a budgetwarrior

If you are on Arch Linux, you can use this AUR repository.

For other systems, you'll have to install from sources:

git clone --recursive git://github.com/wichtounet/budgetwarrior.git
cd budgetwarrior
make
sudo make install

Conclusion

A brief tutorial is available on Github: Starting guide.

If you are interested by the sources, you can download them on Github: budgetwarrior.

If you have any suggestion for a new feature or an improvement to the tool or you found a bug, please post an issue on Github, I'd be glad to help you. You can post a comment directly on this post :)

If you have any other comment, don't hesitate to contact me, either by letting a comment on this post or by email.

I hope that this application can be useful to some of you command-line adepts :)

C++11 Concurrency Tutorial - Part 5: Futures

I've been recently reminded that a long time ago I was doing a series of tutorial on C++11 Concurrency. For some reason, I haven't continued these tutorials. The next post in the series was supposed to be about Futures, so I'm finally going to do it :)

Here are the links to the current posts of the C++11 Concurrency Tutorial:

In this post, we are going to talk about futures, more precisely std::future<T>. What is a future ? It's a very nice and simple mechanism to work with asynchronous tasks. It also has the advantage of decoupling you from the threads themselves, you can do multithreading without using std::thread. The future itself is a structure pointing to a result that will be computed in the future. How to create a future ? The simplest way is to use std::async that will create an asynchronous task and return a std::future.

Let's start with the simplest of the examples:

#include <thread>
#include <future>
#include <iostream>

int main(){
    auto future = std::async(std::launch::async, [](){
        std::cout << "I'm a thread" << std::endl;
    });

    future.get();

    return 0;
}

Nothing really special here. std::async will execute the task that we give it (here a lambda) and return a std::future. Once you use the get() function on a future, it will wait until the result is available and return this result to you once it is. The get() function is then blocking. Since the lambda, is a void lambda, the returned future is of type std::future<void> and get() returns void as well. It is very important to know that you cannot call get several times on the same future. Once the result is consumed, you cannot consume it again! If you want to use the result several times, you need to store it yourself after you called get().

Let's see with something that returns a value and actually takes some time before returning it:

#include <thread>
#include <future>
#include <iostream>
#include <chrono>

int main(){
    auto future = std::async(std::launch::async, [](){
        std::this_thread::sleep_for(std::chrono::seconds(5));
        return 42;
    });

    // Do something else ?

    std::cout << future.get() << std::endl;

    return 0;
}

This time, the future will be of the time std::future<int> and thus get() will also return an int. std::async will again launch a task in an asynchronous way and future.get() will wait for the answer. What is interesting, is that you can do something else before the call to future.

But get() is not the only interesting function in std::future. You also have wait() which is almost the same as get() but does not consume the result. For instance, you can wait for several futures and then consume their result together. But, more interesting are the wait_for(duration) and wait_until(timepoint) functions. The first one wait for the result at most the given time and then returns and the second one wait for the result at most until the given time point. I think that wait_for is more useful in practices, so let's discuss it further. Finally, an interesting function is bool valid(). When you use get() on the future, it will consume the result, making valid() returns :code:`false. So, if you intend to check multiple times for a future, you should use valid() first.

One possible scenario would be if you have several asynchronous tasks, which is a common scenario. You can imagine that you want to process the results as fast as possible, so you want to ask the futures for their result several times. If no result is available, maybe you want to do something else. Here is a possible implementation:

#include <thread>
#include <future>
#include <iostream>
#include <chrono>

int main(){
    auto f1 = std::async(std::launch::async, [](){
        std::this_thread::sleep_for(std::chrono::seconds(9));
        return 42;
    });

    auto f2 = std::async(std::launch::async, [](){
        std::this_thread::sleep_for(std::chrono::seconds(3));
        return 13;
    });

    auto f3 = std::async(std::launch::async, [](){
        std::this_thread::sleep_for(std::chrono::seconds(6));
        return 666;
    });

    auto timeout = std::chrono::milliseconds(10);

    while(f1.valid() || f2.valid() || f3.valid()){
        if(f1.valid() && f1.wait_for(timeout) == std::future_status::ready){
            std::cout << "Task1 is done! " << f1.get() << std::endl;
        }

        if(f2.valid() && f2.wait_for(timeout) == std::future_status::ready){
            std::cout << "Task2 is done! " << f2.get() << std::endl;
        }

        if(f3.valid() && f3.wait_for(timeout) == std::future_status::ready){
            std::cout << "Task3 is done! " << f3.get() << std::endl;
        }

        std::cout << "I'm doing my own work!" << std::endl;
        std::this_thread::sleep_for(std::chrono::seconds(1));
        std::cout << "I'm done with my own work!" << std::endl;
    }

    std::cout << "Everything is done, let's go back to the tutorial" << std::endl;

    return 0;
}

The three tasks are started asynchronously with std::async and the resulting std::future are stored. Then, as long as one of the tasks is not complete, we query each three task and try to process its result. If no result is available, we simply do something else. This example is important to understand, it covers pretty much every concept of the futures.

One interesting thing that remains is that you can pass parameters to your task via std::async. Indeed, all the extra parameters that you pass to std::async will be passed to the task itself. Here is an example of spawning tasks in a loop with different parameters:

#include <thread>
#include <future>
#include <iostream>
#include <chrono>
#include <vector>

int main(){
    std::vector<std::future<size_t>> futures;

    for (size_t i = 0; i < 10; ++i) {
        futures.emplace_back(std::async(std::launch::async, [](size_t param){
            std::this_thread::sleep_for(std::chrono::seconds(param));
            return param;
        }, i));
    }

    std::cout << "Start querying" << std::endl;

    for (auto &future : futures) {
      std::cout << future.get() << std::endl;
    }

    return 0;
}

Pretty practical :) All The created std::future<size_t> are stored in a std::vector and then are all queried for their result.

Overall, I think std::future and std::async are great tool that can simplify your asynchronous code a lot. They allow you to make pretty advanced stuff while keeping the complexity of the code to a minimum.

I hope this long-due post is going to be interesting to some of you :) The code for this post is available on Github

I do not yet know if there will be a next installment in the series. I've covered pretty much everything that is available in C++11 for concurrency. I may cover the parallel algorithms of C++17 in a following post. If you have any suggestion for the next post, don't hesitate to post a comment or contact me directly by email.

Simplify your type traits with C++14 variable templates

Often if you write templated code, you have to write and use a lot of different traits. In this article, I'll focus on the traits that are representing values, typically a boolean value. For instance, std::is_const, std::is_same or std::is_reference are type traits provided by the STL. They are giving you some information at compile time for a certain type. If you need to write a type traits, let's say is_float, here is how you would maybe do it in C++11:

template <typename T>
struct is_float {
    static constexpr bool value = std::is_same<T, float>::value;
};

or a bit nicer with a template type alias and std::integral constant:

template <typename T>
using is_float = std::integral_constant<bool, std::is_same<T, float>::value>;

or since is_same is itself a type traits, you can also directly alias it:

template <typename T>
using is_float = std::is_same<T, float>;

This makes for some very nice syntax, but we still have a type rather than a value.

Note that in some cases, you cannot use the using technique since it cannot be specialized and you often need specialization to write some more advanced traits.

And then you would use your traits to do something specific based on that information. For instance with a very basic example:

template <typename T>
void test(T t){
    if (is_float<T>::value){
        std::cout << "I'm a float" << std::endl;
    } else {
        std::cout << "I'm not a float" << std::endl;
    }
}

Really nothing fancy here, but that will be enough as examples.

Even though all this works pretty, it can be made better on two points. First, every time you use a traits, you need to use the value member (via ::value). Secondly, every time you declare a new traits, you have to declare a new type or a type alias. But all you want is a boolean value.

C++14 introduced a new feature, variable templates. As their name indicates, they are variables, parametrized with a type. This allows us to write type traits without using a type alias or struct, meaning we have a real value instead of a type. If we rewrite our is_float traits with variable templates, we have the following:

template <typename T>
constexpr bool is_float = std::is_same<T, float>::value;

I think it's much nicer, the intent is clearly stated and there is no unnecessary code. Moreover, it's also nicer to use:

template <typename T>
void test(T t){
    if (is_float<T>){
        std::cout << "I'm a float" << std::endl;
    } else {
        std::cout << "I'm not a float" << std::endl;
    }
}

No more ::value everywhere :) I think it's really cool.

Note that, unlike type alias template, they can be specialized, either fully or partially, so no more limitation on that side.

Interestingly, variable templates are used in C++17 to provide helpers for each type traits with values. For instance, std::is_same will have a std::is_same_v helper that is a variable template. With that, we can simplify our traits a bit more:

template <typename T>
constexpr bool is_float = std::is_same_v<T, float>;

Personally, I replaced all the type traits inside ETL using variable templates. If you don't want to do it, you can also introduce helpers like in the C++17 STL and start using the wrappers when you see fit so that you don't break any code.

If you want to use this feature, you need a C++14 compiler, such as any version from GCC5 family or clang 3.6. Although I haven't tested, it should also work on Microsoft VS2015 Update 2.

Unfortunately there is a bug in both clang (fixed in clang 3.7) and GCC (fixed in GCC 6 only) that you may encounter if you start using variable templates in template classes or variable templates used in another variable templates. If you plan to use variable templates inside a template, such as something like this:

template <typename T>
struct outer_traits {
    template <typename X>
    static constexpr bool sub_traits = std::is_same<T, X>::value;
};

template <typename T, typename X>
constexpr bool outer_helper = outer_traits<T>::template sub_traits<X>;

int main(){
    std::cout << outer_helper<float, float>;

    return 0;
}

You will encounter a not-helpful at all error message with GCC5 family, such as:

test.cpp: In instantiation of ‘constexpr const bool outer_helper<float, float>’:
test.cpp:14:22:   required from here
test.cpp:11:20: error: ‘template<class X> constexpr const bool outer_traits<float>::sub_traits<X>’ is not a function template
     constexpr bool outer_helper = outer_traits<T>::template sub_trait
                    ^
test.cpp:11:20: error: ‘sub_traits<X>’ is not a member of ‘outer_traits<float>’

It comes from a bug in the handling of variable templates as dependent names. If you don't come in this cases, you can use GCC5 family directly, otherwise, you'll have to use GCC6 family only.

I hope this can help some of you to improve your type traits or at least to discover the power of the new variable templates. Personally, I've rewritten all the traits from the ETL library using this new feature and I'm pretty satisfied with the result. Of course, that means that the compiler support was reduced, but since I don't have many users, it's not a real issue.

How to fix mdadm RAID5 / RAID6 growing stuck at 0K/s ?

I just started growing again my RAID6 array from 12 to 13 disks and I encountered a new issue. The reshape started, but with a speed of 0K/s. After some searching, I found a very simple solution:

echo max > /sys/block/md0/md/sync_max

And the reshape started directly at 50M/s :)

The solution is the same if you are growing any type of RAID level with parity (RAID5, RAID6, ...).

Normally, the issues I have are related to speed not very good. I've written a post in the post about how to speed up RAID5 / RAID6 growing with mdadm. Although RAID5 / RAID6 growing, or another reshape operation, will never be very fast, you can still speed up the process a lot from a few days to a few hours. Currently, my reshape is working at 48M/s and I'm looking at around 16 hours of reshape, but I have 13 disks of 3To, so it's not so bad.

I hope this very simple tip can be helpful to some of you :)

DLL: Blazing Fast Neural Network Library

A few weeks ago, I talked about all the new features of my Deep Learning Library (DLL) project. I've mentioned that, on several experiments, DLL was always significantly faster than some popular deep learning frameworks such as TensorFlow. I'll now go into more details into this comparison and provide all the results. So far, the paper we wrote about these results has not been published, so I'll not provide the paper directly yet.

For those that may not know, DLL is the project I've been developing to support my Ph.D. thesis. This is a neural network framework that supports Fully-Connected Neural Network (FCNN), Convolutional Neural Network (CNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Convolutional RBM (CRBM) and Convolutional DBN (CDBN). It also supports a large variety of options such as Dropout, Batch Normalization and Adaptive Learning Rates. You can read read the previous post if you want more information about the new features of the framework. And, as those of you that read my blog frequently may know, I'm a bit obsessed with performance optimization, so I've spent a considerable amount of time optimizing the performance of neural network training, on CPU. Since, at the beginning of my thesis, I had no access to GPU for training, I've focused on CPU. Although there is now support for GPU, the gains are not yet important enough.

Evaluation

To see how fast, or not, the library was, it was compared against five popular machine learning libraries:

  1. Caffe, installed from sources

  2. TensorFlow 1.0, from pip

  3. Keras 2.0, from pip

  4. Torch, installed from sources

  5. DeepLearning4J 0.7, from Maven

I've run four different experiments with all these frameworks and compared the efficiency of each of them for training the same neural networks with the same options. In each case, the training or testing error have also been compared to ensure that each framework is doing roughly the same. I wont present here the details, but in each experiment DLL showed around the same accuracies as the other frameworks. I will only focus on the speed results in this article.

Each experiment is done once with only CPU and once with a GPU. For DLL, I only report the CPU time in both modes, since it's more stable and more optimized.

The code for the evaluation is available online on the Github repository of the frameworks project.

MNIST: Fully Connected Neural Network

The first experiment is performed on The MNIST data set. It consists of 60'000 grayscale images of size 28x28. The goal is to classify each image of a digit from 0 to 9. To solve this task, I trained a very small fully-connected neural network with 500 hidden units in the first layer, 250 in the second and 10 final hidden units (or output units) for classification. The first two layers are using the logistic sigmoid activation function and the last layer is using the softmax activation function. The network is trained for 50 epochs with a categorical cross entropy loss, with mini-batches of 100 images. Here are results of this experiment:

Training time performance for the different frameworks on the Fully-Connected Neural Network experiment, on MNIST.

Training time performance for the different frameworks on the Fully-Connected Neural Network experiment, on MNIST. All the times are in seconds.

In DLL mode, the DLL framework is the clear winner here! It's about 35% faster than TensorFlow and Keras which are coming at the second place. DLL is more than four times slower than DLL and the last two frameworks (Caffe and DeepLearning4J) are five times slower than DLL! Once we add a GPU to the system, the results are very different. Caffe is now the fastest framework, three times faster than DLL. DLL is less than two times slower than Keras and TensorFlow. Interestingly, DLL is still faster than Torch and DeepLearning4J.

MNIST: Convolutional Neural Network

Although a Fully-Connected Neural Network is an interesting tool, the trend now is to use Convolutional Neural Network which have proved very efficient at solving a lot of problems. The second experiment is also using the same data set. Again, it's a rather small network. The first layer is a convolutional layer with 8 5x5 kernels, followed by max pooling layer with 2x2 kernel. They are followed by one more convolutional layers with 8 5x5 kernels and a 2x2 max pooling layer. These first four layers are followed by two fully-connected layers, the first with 150 hidden units and the last one with 10 output units. The activation functions are the same as for the first network, as is the training procedure. This takes significantly longer to train than the first network because of the higher complexity of the convolutional layers compared to the fully-connected layers even though they have much less weights. The results are present in the next figure:

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on MNIST.

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on MNIST. All the times are in seconds.

Again, on CPU, DLL is the clear winner, by a lot! It's already 3.6 times faster than the second frameworks Keras and TensorFlow, more than four times faster than Caffe and Torch and 8 times faster than DeepLearning4J that is proving very slow on this experiment. Once a GPU is added, Keras and TensorFlow are about twice faster than DLL. However, DLL is still faster than the other frameworks even though they are taking advantage of the GPU.

CIFAR-10

The second data set that is tested is the CIFAR-10 data set. It's an object recognition with 10 classes for classification. The training set is composed of 50'000 colour images for 32x32 pixels. The network that is used for this data set is similar in architecture than the first network, but has more parameters. The first convolutional layer now has 12 5x5 kernels and the second convolutional layer has 24 3x3 kernels. The pooling layers are the same. The first fully-connected has 64 hidden units and the last one has 10 output units. The last layer again use a softmax activation function while the other layers are using Rectifier Linear Units (ReLU). The training is done in the same manner as for the two first networks. Unfortunately, it was not possible to train DeepLearning4J on this data set, even though there is official support for this data set. Since I've had no answer to my question regarding this issue, the results are simply removed from this experiment. It may not seem so but it's considerably longer to train this network because of the larger number of input channels and larger number of convolutional kernels in each layer. Let's get to the results now:

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on CIFAR-10.

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on CIFAR-10. All the times are in seconds.

DLL is still the fastest on CPU, but the margin is less than before. It's about 40% faster than TensorFlow and Keras, twice faster than Torch and 2.6 times faster than Caffe. Once a GPU is added, DLL is about as fast as Torch but slower than the other three frameworks. TensorFlow and Keras are about four times faster than DLL while Caffe is about twice faster than DLL. We can see that with this larger network, the GPU becomes more interesting and that there is a smaller margin for improvements compared to the other frameworks.

ImageNet

The last experiment is made on the ImageNet data set. I used the ILSVRC 2012 subset, that consists "only" of about 1.2 million images for training. I've resized all the images to 256x256 pixels, this makes for 250 times more colour values than a MNIST image. This dimension and the number of images makes it impractical to keep the dataset in memory. The images must be loaded in batch from the disk. No random cropping or mirroring was performed. The network is much larger to solve this task. The network starts with 5 pairs of convolutional layers and max pooling layers. The convolutional layers have 3x3 kernels, 16 for the first two layers and 32 for the three following one. The five max pooling layers use 2x2 kernels. Each convolutional layer uses zero-padding so that their output features are the same dimensions as the input. They are followed by two fully-connected layer. The first one with 2048 hidden units and the last one with 1000 output units (one for each class). Except for the last layer, using softmax, the layers all uses ReLU. The network is trained with mini-batches of 128 images (except for DeepLearning4J and Torch, which can only use 64 images on the amount of RAM available on my machine). To ease the comparison, I report the time necessary to train one batch of data (or two for DeepLearning4J and Torch). The results, presented in logarithmic scale because of DeepLearning4J disastrous results, are as follows:

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on ImageNet.

Training time performance for the different frameworks on the Convolutional Neural Network experiment, on ImageNet. The times are the time necessary to train a batch of 128 images. All the times are in milliseconds.

For this final experiment, DLL is again significantly faster than all the other frameworks. It's about 40% faster than Keras, twice faster than TensorFlow and Caffe and more than three times faster than Torch. Although 40% may seem not that much, don't forget that this kind of training may take days, so it can save you a lot of time. All the frameworks are much faster than DeepLearning4J. Based on several posts on the internet, I suspect that this comes from the model of GPU I have been used (GTX 960), but all the other frameworks seem to handle this card pretty well.

Conclusion

I hope this is not too much of a bragging post :P We can see that my efforts to make the code as fast as possible have paid :) As was shown in the experiments, my DLL framework is always the fastest framework when the neural network is trained on CPU. I'm quite pleased with the results since I've done a lot of work to optimize the speed as much as possible and since I'm competing with well-known libraries that have been developed by several persons. Moreover, the accuracies of the trained networks is similar to that of the networks trained with the other frameworks. Even when the other frameworks are using GPU, the library still remains competitive, although never the fastest.

In the next step (I've no idea when I'll have the time though), I will want to focus on GPU speed. This will mostly come from a better support of the GPU in the ETL library on which DLL is based. I have many ideas to improve it a lot, but it will take me a lot of time.

If you want more information on the DLL library, you can have a look at its Github repository and especially at the few examples. You can also have a look at my posts about DLL. Finally, don't hesitate to comment or contact me through Github issues if you have comments or problems with this post, the library or anything ;)

Compiler benchmark GCC and Clang on C++ library (ETL)

It's been a while since I've done a benchmark of different compilers on C++ code. Since I've recently released the version 1.1 of my ETL project (an optimized matrix/vector computation library with expression templates), I've decided to use it as the base of my benchmark. It's a C++14 library with a lot of templates. I'm going to compile the full test suite (124 test cases). This is done directly on the last release (1.1) code. I'm going to compile once in debug mode and once in release_debug (release plus debug symbols and assertions) and record the times for each compiler. The tests were compiled with support for every option in ETL to account to maximal compilation time. Each compilation was made using four threads (make -j4). I'm also going to test a few of the benchmarks to see the difference in runtime performance between the code generated by each compiler. The benchmark will be compiled in release mode and its compilation time recorded as well.

I'm going to test the following compilers:

  • GCC-4.9.4

  • GCC-5.4.0

  • GCC-6.3.0

  • GCC-7.1.0

  • clang-3.9.1

  • clang-4.0.1

  • zapcc-1.0 (commercial, based on clang-5.0 trunk)

All have been installed directly using Portage (Gentoo package manager) except for clang-4.0.1 that has been installed from sources and zapcc since it does not have a Gentoo package. Since clang package on Gentoo does not support multislotting, I had to install one version from source and the other from the package manager. This is also the reason I'm testing less versions of clang, simply less practical.

For the purpose of these tests, the exact same options have been used throughout all the compilers. Normally, I use different options for clang than for GCC (mainly more aggressive vectorization options on clang). This may not lead to the best performance for each compiler, but allows for comparison between the results with defaults optimization level. Here are the main options used:

  • In debug mode: -g

  • In release_debug mode: -g -O2

  • In release mode: -g -O3 -DNDEBUG -fomit-frame-pointer

In each case, a lot of warnings are enabled and the ETL options are the same.

All the results have been gathered on a Gentoo machine running on Intel Core i7-2600 (Sandy Bridge...) @3.4GHz with 4 cores and 8 threads, 12Go of RAM and a SSD. I do my best to isolate as much as possible the benchmark from perturbations and that my benchmark code is quite sound, it may well be that some results are not totally accurate. Moreover, some of the benchmarks are using multithreading, which may add some noise and unpredictability. When I was not sure about the results, I ran the benchmarks several time to confirm them and overall I'm confident of the results.

Compilation Time

Let's start with the results of the performance of the compilers themselves:

Compiler

Debug

Release_Debug

Benchmark

g++-4.9.4

402s

616s

100s

g++-5.4.0

403s

642s

95s

g++-6.3.0

399s

683s

102s

g++-7.1.0

371s

650s

105s

clang++-3.9.1

380s

807s

106s

clang++-4.0.1

260s

718s

92s

zapcc++-1.0

221s

649s

108s

Note: For Release_Debug and Benchmark, I only use three threads with zapcc, because 12Go of RAM is not enough memory for four threads.

There are some very significant differences between the different compilers. Overall, clang-4.0.1 is by far the fastest free compiler for Debug mode. When the tests are compiled with optimizations however, clang is falling behind. It's quite impressive how clang-4.0.1 manages to be so much faster than clang-3.9.1 both in debug mode and release mode. Really great work by the clang team here! With these optimizations, clang-4.0.1 is almost on par with gcc-7.1 in release mode. For GCC, it seems that the cost of optimization has been going up quite significantly. However, GCC 7.1 seems to have made optimization faster and standard compilation much faster as well. If we take into account zapcc, it's the fastest compiler on debug mode, but it's slower than several gcc versions on release mode.

Overall, I'm quite impressed by the performance of clang-4.0.1 which seems really fast! I'll definitely make more tests with this new version of the compiler in the near future. It's also good to see that g++-7.1 also did make the build faster than gcc-6.3. However, the fastest gcc version for optimization is still gcc-4.9.4 which is already an old branch with low C++ standard support.

Runtime Performance

Let's now take a look at the quality of the generated code. For some of the benchmarks, I've included two versions of the algorithm. std is the most simple algorithm (the naive one) and vec is the hand-crafted vectorized and optimized implementation. All the tests were done on single-precision floating points.

Dot product

The first benchmark that is run is to compute the dot product between two vectors. Let's look first at the naive version:

dot (std)

100

500

1000

10000

100000

1000000

2000000

3000000

4000000

5000000

10000000

g++-4.9.4

64.96ns

97.12ns

126.07ns

1.89us

25.91us

326.49us

1.24ms

1.92ms

2.55ms

3.22ms

6.36ms

g++-5.4.0

72.96ns

101.62ns

127.89ns

1.90us

23.39us

357.63us

1.23ms

1.91ms

2.57ms

3.20ms

6.32ms

g++-6.3.0

73.31ns

102.88ns

130.16ns

1.89us

24.314us

339.13us

1.47ms

2.16ms

2.95ms

3.70ms

6.69ms

g++-7.1.0

70.20ns

104.09ns

130.98ns

1.90us

23.96us

281.47us

1.24ms

1.93ms

2.58ms

3.19ms

6.33ms

clang++-3.9.1

64.69ns

98.69ns

128.60ns

1.89us

23.33us

272.71us

1.24ms

1.91ms

2.56ms

3.19ms

6.37ms

clang++-4.0.1

60.31ns

96.34ns

128.90ns

1.89us

22.87us

270.21us

1.23ms

1.91ms

2.55ms

3.18ms

6.35ms

zapcc++-1.0

61.14ns

96.92ns

125.95ns

1.89us

23.84us

285.80us

1.24ms

1.92ms

2.55ms

3.16ms

6.34ms

The differences are not very significant between the different compilers. The clang-based compilers seem to be the compilers producing the fastest code. Interestingly, there seem to have been a big regression in gcc-6.3 for large containers, but that has been fixed in gcc-7.1.

dot (vec)

100

500

1000

10000

100000

1000000

2000000

3000000

4000000

5000000

10000000

g++-4.9.4

48.34ns

80.53ns

114.97ns

1.72us

22.79us

354.20us

1.24ms

1.89ms

2.52ms

3.19ms

6.55ms

g++-5.4.0

47.16ns

77.70ns

113.66ns

1.72us

22.71us

363.86us

1.24ms

1.89ms

2.52ms

3.19ms

6.56ms

g++-6.3.0

46.39ns

77.67ns

116.28ns

1.74us

23.39us

452.44us

1.45ms

2.26ms

2.87ms

3.49ms

7.52ms

g++-7.1.0

49.70ns

80.40ns

115.77ns

1.71us

22.46us

355.16us

1.21ms

1.85ms

2.49ms

3.14ms

6.47ms

clang++-3.9.1

46.13ns

78.01ns

114.70ns

1.66us

22.82us

359.42us

1.24ms

1.88ms

2.53ms

3.16ms

6.50ms

clang++-4.0.1

45.59ns

74.90ns

111.29ns

1.57us

22.47us

351.31us

1.23ms

1.85ms

2.49ms

3.12ms

6.45ms

zapcc++-1.0

45.11ns

75.04ns

111.28ns

1.59us

22.46us

357.32us

1.25ms

1.89ms

2.53ms

3.15ms

6.47ms

If we look at the optimized version, the differences are even slower. Again, the clang-based compilers are producing the fastest executables, but are closely followed by gcc, except for gcc-6.3 in which we can still see the same regression as before.

Logistic Sigmoid

The next test is to check the performance of the sigmoid operation. In that case, the evaluator of the library will try to use parallelization and vectorization to compute it. Let's see how the different compilers fare:

sigmoid

10

100

1000

10000

100000

1000000

g++-4.9.4

8.16us

5.23us

6.33us

29.56us

259.72us

2.78ms

g++-5.4.0

7.07us

5.08us

6.39us

29.44us

266.27us

2.96ms

g++-6.3.0

7.13us

5.32us

6.45us

28.99us

261.81us

2.86ms

g++-7.1.0

7.03us

5.09us

6.24us

28.61us

252.78us

2.71ms

clang++-3.9.1

7.30us

5.25us

6.57us

30.24us

256.75us

1.99ms

clang++-4.0.1

7.47us

5.14us

5.77us

26.03us

235.87us

1.81ms

zapcc++-1.0

7.51us

5.26us

6.48us

28.86us

258.31us

1.95ms

Interestingly, we can see that gcc-7.1 is the fastest for small vectors while clang-4.0 is the best for producing code for larger vectors. However, except for the biggest vector size, the difference is not really significantly. Apparently, there is a regression in zapcc (or clang-5.0) since it's slower than clang-4.0 at the same level as clang-3.9.

y = alpha * x + y (axpy)

The third benchmark is the well-known axpy (y = alpha * x + y). This is entirely resolved by expressions templates in the library, no specific algorithm is used. Let's see the results:

saxpy

10

100

1000

10000

100000

1000000

g++-4.9.4

38.1ns

61.6ns

374ns

3.65us

40.8us

518us

g++-5.4.0

35.0ns

58.1ns

383ns

3.87us

43.2us

479us

g++-6.3.0

34.3ns

59.4ns

371ns

3.57us

40.4us

452us

g++-7.1.0

34.8ns

59.7ns

399ns

3.78us

43.1us

547us

clang++-3.9.1

32.3ns

53.8ns

297ns

3.21us

38.3us

466us

clang++-4.0.1

32.4ns

59.8ns

296ns

3.31us

38.2us

475us

zapcc++-1.0

32.0ns

54.0ns

333ns

3.32us

38.7us

447us

Even on the biggest vector, this is a very fast operation, once vectorized and parallelized. At this speed, some of the differences observed may not be highly significant. Again clang-based versions are the fastest versions on this code, but by a small margin. There also seems to be a slight regression in gcc-7.1, but again quite small.

Matrix Matrix multiplication (GEMM)

The next benchmark is testing the performance of a Matrix-Matrix Multiplication, an operation known as GEMM in the BLAS nomenclature. In that case, we test both the naive and the optimized vectorized implementation. To save some horizontal space, I've split the tables in two.

sgemm (std)

10

20

40

60

80

100

g++-4.9.4

7.04us

50.15us

356.42us

1.18ms

3.41ms

5.56ms

g++-5.4.0

8.14us

74.77us

513.64us

1.72ms

4.05ms

7.92ms

g++-6.3.0

8.03us

64.78us

504.41us

1.69ms

4.02ms

7.87ms

g++-7.1.0

7.95us

65.00us

508.84us

1.69ms

4.02ms

7.84ms

clang++-3.9.1

3.58us

28.59us

222.36us

0.73ms

1.77us

3.41ms

clang++-4.0.1

4.00us

25.47us

190.56us

0.61ms

1.45us

2.80ms

zapcc++-1.0

4.00us

25.38us

189.98us

0.60ms

1.43us

2.81ms

sgemm (std)

200

300

400

500

600

700

800

900

1000

1200

g++-4.9.4

44.16ms

148.88ms

455.81ms

687.96ms

1.47s

1.98s

2.81s

4.00s

5.91s

9.52s

g++-5.4.0

63.17ms

213.01ms

504.83ms

984.90ms

1.70s

2.70s

4.03s

5.74s

7.87s

14.905

g++-6.3.0

64.04ms

212.12ms

502.95ms

981.74ms

1.69s

2.69s

4.13s

5.85s

8.10s

14.08s

g++-7.1.0

62.57ms

210.72ms

499.68ms

974.94ms

1.68s

2.67s

3.99s

5.68s

7.85s

13.49s

clang++-3.9.1

27.48ms

90.85ms

219.34ms

419.53ms

0.72s

1.18s

1.90s

2.44s

3.36s

5.84s

clang++-4.0.1

22.01ms

73.90ms

175.02ms

340.70ms

0.58s

0.93s

1.40s

1.98s

2.79s

4.69s

zapcc++-1.0

22.33ms

75.80ms

181.27ms

359.13ms

0.63s

1.02s

1.52s

2.24s

3.21s

5.62s

This time, the differences between the different compilers are very significant. The clang compilers are leading the way by a large margin here, with clang-4.0 being the fastest of them (by another nice margin). Indeed, clang-4.0.1 is producing code that is, on average, about twice faster than the code generated by the best GCC compiler. Very interestingly as well, we can see a huge regression starting from GCC-5.4 and that is still here in GCC-7.1. Indeed, the best GCC version, in the tested versions, is again GCC-4.9.4. Clang is really doing an excellent job of compiling the GEMM code.

sgemm (vec)

10

20

40

60

80

100

g++-4.9.4

264.27ns

0.95us

3.28us

14.77us

23.50us

60.37us

g++-5.4.0

271.41ns

0.99us

3.31us

14.811us

24.116us

61.00us

g++-6.3.0

279.72ns

1.02us

3.27us

15.39us

24.29us

61.99us

g++-7.1.0

273.74ns

0.96us

3.81us

15.55us

31.35us

71.11us

clang++-3.9.1

296.67ns

1.34us

4.18us

19.93us

33.15us

82.60us

clang++-4.0.1

322.68ns

1.38us

4.17us

20.19us

34.17us

83.64us

zapcc++-1.0

307.49ns

1.41us

4.10us

19.72us

33.72us

84.80us

sgemm (vec)

200

300

400

500

600

700

800

900

1000

1200

g++-4.9.4

369.52us

1.62ms

2.91ms

7.17ms

11.74ms

22.91ms

34.82ms

51.67ms

64.36ms

111.15ms

g++-5.4.0

387.54us

1.60ms

2.97ms

7.36ms

12.11ms

24.37ms

35.37ms

52.27ms

65.72ms

112.74ms

g++-6.3.0

384.43us

1.74ms

3.12ms

7.16ms

12.44ms

24.15ms

34.87ms

52.59ms

70.074ms

119.22ms

g++-7.1.0

458.05us

1.81ms

3.44ms

7.86ms

13.43ms

24.70ms

36.54ms

53.47ms

66.87ms

117.25ms

clang++-3.9.1

494.52us

1.96ms

4.80ms

8.88ms

18.20ms

29.37ms

41.24ms

60.72ms

72.28ms

123.75ms

clang++-4.0.1

511.24us

2.04ms

4.11ms

9.46ms

15.34ms

27.23ms

38.27ms

58.14ms

72.78ms

128.60ms

zapcc++-1.0

492.28us

2.03ms

3.90ms

9.00ms

14.31ms

25.72ms

37.09ms

55.79ms

67.88ms

119.92ms

As for the optimized version, it seems that the two families are reversed. Indeed, GCC is doing a better job than clang here, and although the margin is not as big as before, it's still significant. We can still observe a small regression in GCC versions because the 4.9 version is again the fastest. As for clang versions, it seems that clang-5.0 (used in zapcc) has had some performance improvements for this case.

For this case of matrix-matrix multiplication, it's very impressive that the differences in the non-optimized code are so significant. And it's also impressive that each family of compilers has its own strength, clang being seemingly much better at handling unoptimized code while GCC is better at handling vectorized code.

Convolution (2D)

The last benchmark that I considered is the case of the valid convolution on 2D images. The code is quite similar to the GEMM code but more complicated to optimized due to cache locality.

sconv2_valid (std)

100x50

105x50

110x55

115x55

120x60

125x60

130x65

135x65

140x70

g++-4.9.4

27.93ms

33.68ms

40.62ms

48.23ms

57.27ms

67.02ms

78.45ms

92.53ms

105.08ms

g++-5.4.0

37.60ms

44.94ms

54.24ms

64.45ms

76.63ms

89.75ms

105.08ms

121.66ms

140.95ms

g++-6.3.0

37.10ms

44.99ms

54.34ms

64.54ms

76.54ms

89.87ms

105.35ms

121.94ms

141.20ms

g++-7.1.0

37.55ms

45.08ms

54.39ms

64.48ms

76.51ms

92.02ms

106.16ms

125.67ms

143.57ms

clang++-3.9.1

15.42ms

18.59ms

22.21ms

26.40ms

31.03ms

36.26ms

42.35ms

48.87ms

56.29ms

clang++-4.0.1

15.48ms

18.67ms

22.34ms

26.50ms

31.27ms

36.58ms

42.61ms

49.33ms

56.80ms

zapcc++-1.0

15.29ms

18.37ms

22.00ms

26.10ms

30.75ms

35.95ms

41.85ms

48.42ms

55.74ms

In that case, we can observe the same as for the GEMM. The clang-based versions are much producing significantly faster code than the GCC versions. Moreover, we can also observe the same large regression starting from GCC-5.4.

sconv2_valid (vec)

100x50

105x50

110x55

115x55

120x60

125x60

130x65

135x65

140x70

g++-4.9.4

878.32us

1.07ms

1.20ms

1.68ms

2.04ms

2.06ms

2.54ms

3.20ms

4.14ms

g++-5.4.0

853.73us

1.03ms

1.15ms

1.36ms

1.76ms

2.05ms

2.44ms

2.91ms

3.13ms

g++-6.3.0

847.95us

1.02ms

1.14ms

1.35ms

1.74ms

1.98ms

2.43ms

2.90ms

3.12ms

g++-7.1.0

795.82us

0.93ms

1.05ms

1.24ms

1.60ms

1.77ms

2.20ms

2.69ms

2.81ms

clang++-3.9.1

782.46us

0.93ms

1.05ms

1.26ms

1.60ms

1.84ms

2.21ms

2.65ms

2.84ms

clang++-4.0.1

767.58us

0.92ms

1.04ms

1.25ms

1.59ms

1.83ms

2.20ms

2.62ms

2.83ms

zapcc++-1.0

782.49us

0.94ms

1.06ms

1.27ms

1.62ms

1.83ms

2.24ms

2.65ms

2.85ms

This time, clang manages to produce excellent results. Indeed, all the produced executables are significantly faster than the versions produced by GCC, except for GCC-7.1 which is producing similar results. The other versions of GCC are falling behind it seems. It seems that it was only for the GEMM that clang was having a lot of troubles handling the optimized code.

Conclusion

Clang seems to have recently done a lot of optimizations regarding compilation time. Indeed, clang-4.0.1 is much faster for compilation than clang-3.9. Although GCC-7.1 is faster than GCC-6.3, all the GCC versions are slower than GCC-4.9.4 which is the fastest at compiling code with optimizations. GCC-7.1 is the fastest GCC version for compiling code in debug mode.

In some cases, there is almost no difference between different compilers in the generated code. However, in more complex algorithms such as the matrix-matrix multiplication or the two-dimensional convolution, the differences can be quite significant. In my tests, Clang have shown to be much better at compiling unoptimized code. However, and especially in the GEMM case, it seems to be worse than GCC at handling hand-optimized. I will investigate that case and try to tailor the code so that clang is having a better time with it.

For me, it's really weird that the GCC regression, apparently starting from GCC-5.4, has still not been fixed in GCC 7.1. I was thinking of dropping support for GCC-4.9 in order to go full C++14 support, but now I may have to reconsider my position. However, seeing that GCC is generally the best at handling optimized code (especially for GEMM), I may be able to do the transition, since the optimized code will be used in most cases.

As for zapcc, although it is still the fastest compiler in debug mode, with the new speed of clang-4.0.1, its margin is quite small. Moreover, on optimized build, it's not as fast as GCC. If you use clang and can have access to zapcc, it's still quite a good option to save some time.

Overall, I have been quite pleased by clang-4.0.1 and GCC-7.1, the most recent versions I have been testing. It seems that they did quite some good work. I will definitely run some more tests with them and try to adapt the code. I'm still considering whether I will drop support for some older compilers.

I hope this comparison was interesting :) My next post will probably be about the difference in performance between my machine learning framework and other frameworks to train neural networks.