Skip to main content

DLL New Features: Embeddings and Merge layers

I've just finished integrating new features into DLL, my deep learning library. I've added support for an embeddings layer, a group layer and a merge layer. This is not yet released, but available in the master branch.

Embeddings are used more and more these days to learn dense representation of characters or word. An embedding layer in a neural network transform labels into a vector. It's generally used as the first layer of the network. The embedding are learned as part of the network.

The merge layer allows to create branches in the network. The input is passed to each sub layer and then the output of each layer is concatenated to form the output of the merged layers. This can be very useful to use different convolutional filter sizes.

The group layer is a simple utility to group layers together. This is mostly to use with merge layers to form several branches.

I've put together a new example to use these features on text classification. The dataset is totally synthetic for now, but this can easily be reproduced with a normal text classification dataset. This kind of model is called a Character Convolutional Neural Network.

Here is the code for example:

constexpr size_t embedding = 16; // The length of the embedding vector
constexpr size_t length = 15;    // The word (or sequence) length

using embedding_network_t = dll::dyn_network_desc<
        // The embedding layer
        dll::embedding_layer<26, length, embedding>

        // The convolutional layers
        , dll::merge_layer<
            , dll::group_layer<
                  dll::conv_layer<1, length, embedding, 16, 3, embedding>
                , dll::mp_2d_layer<16, length - 3 + 1, 1, length - 3 + 1, 1>
            , dll::group_layer<
                  dll::conv_layer<1, length, embedding, 16, 4, embedding>
                , dll::mp_2d_layer<16, length - 4 + 1, 1, length - 4 + 1, 1>
            , dll::group_layer<
                  dll::conv_layer<1, length, embedding, 16, 5, embedding>
                , dll::mp_2d_layer<16, length - 5 + 1, 1, length - 5 + 1, 1>

        // The final softmax layer
        , dll::dense_layer<48, 10, dll::softmax>
    , dll::updater<dll::updater_type::NADAM>     // Nesterov Adam (NADAM)
    , dll::batch_size<50>                        // The mini-batch size
    , dll::shuffle                               // Shuffle before each epoch

auto net = std::make_unique<embedding_network_t>();

// Display the network and dataset

// Train the network for performance sake
net->fine_tune(samples, labels, 50);

// Test the network on train set
net->evaluate(samples, labels);

The network starts with an embedding layer. The embedding is then passed to three convolutional layers with different filter sizes, each followed by a pooling layer. The outputs of the three layers are merged at the end of the merge layer. Finally, a softmax layer is used for classification.

This kind of model can be very powerful and is used regularly. These new features make for a much larger variety of models that can be build with the DLL library.

The full code with the dataset generation can be found online: char_cnn.cpp

The next feature I want to focus on is recurrent neural networks. I'll probably try a single RNN layer first and then upgrade to multi-layers and LSTM and maybe GRU.


I successfully defended my Ph.D.

I'm happy to announce that I've successfully defended my thesis "Deep Learning Features for Image Processing". After four years, I've defended it officially in front of the thesis committed last Friday and then again two days ago I've successfully publicly defended in front of my friends, family and colleagues.

I'm now a "Doctor of Philosophy in Computer Science :)

I will update my thesis with the last comments in November and send the final version to the university. At which point, I'll publish it on this website as well.


Budgetwarrior: Track assets and portfolio, savings rates and auto-completion

This last month, I've been reading quite a few blogs about personal finance and I've decided to integrate more features into budgetwarrior. This post is about three new features that I've integrated. It's not yet a new release, so if you want to test this version, you'll have to compile it from the master branch on Git.

As it was last time, the values on my screenshots have all been randomized.

If you have several assets with different distributions, I believe it is a great value to have them all shown at the same time. Especially if you want to change the distribution of your portfolio or if you plan big changes in it.

Track assets

The first feature I've added is a feature to precisely track each of your assets independently. And you can also track the allocation of your portfolio in terms of stocks, bonds and cash. The tool also lets you set the desired distribution of your assets and will compute the difference that you should make in order to comply to your desired distribution.

First, you need to define all your asset classes (your accounts, funds, and stocks, ...) and their distribution with budget asset add. It also supports to set a currency. The default currency is now CHF, but you can set it in the configuration file, for instance default_currency=USD. You can see your assets using budget asset:

View of your assets

You can then set the value of your assets using budget asset value add. The system will save all the values of your assets. For now, only the last value is used in the application to display. In the future, I plan to add new reports for evolution of the portfolio over time. You can see your current net worth with the budget asset value:

View of your portfolio

The different currencies will all be converted to the default currency.

Savings rate

The second change I did is to compute the savings rate of each month and year. The savings rate is simply the portion of your income that you are able to save each month. The savings rate for a year is simple the average of the savings rate of each month.

The savings rate of the month can be seen with budget overview month:

Savings rate of the month

The saving rates of each month can also be seen in the overview of the year with budget overview year:

Savings rate of the year

This shows the savings rate of each month, the average of the year and the average of the current year up to the current month.

The savings rate is a very important metric of your budget. In my case, it's currently way too low and made me realize I really need to save more. Any savings rate below 10% is too low. There are no rule as too much it should be, but I'd like to augment mine to at least 20% next year.


The last feature is mostly some quality-of-life improvement. Some of the inputs in the console can now be completed. It's not really auto-completion per se, but you can cycle through the list of possible values using the UP and DOWN.

This makes it much easier to set some values such as asset names (in budget asset value add for instance), account names and objective types and sources. I'm trying to make the input of values easier.


I don't know exactly what else will be integrated in this feature, but I may already improve some visualization for asset values. If I learn something new about personal finance that I may integrate in the tool, I'll do it as well.

If you are interested by the sources or want to install this version, you can download them on Github: budgetwarrior.

The new features are in the master branch.

If you have a suggestion for a new features or you found a bug, please post an issue on Github, I'd be glad to help you.

If you have any comment, don't hesitate to contact me, either by letting a comment on this post or by email.


Deep Learning Library 1.0 - Fast Neural Network Library

DLL Logo

I'm very happy to announce the release of the first version of Deep Learning Library (DLL) 1.0. DLL is a neural network library with a focus on speed and ease of use.

I started working on this library about 4 years ago for my Ph.D. thesis. I needed a good library to train and use Restricted Boltzmann Machines (RBMs) and at this time there was no good support for it. Therefore, I decided to write my own. It now has very complete support for the RBM and the Convolutional RBM (CRBM) models. Stacks of RBMs (or Deep Belief Networks (DBNs)) can be pretrained using Contrastive Divergence and then either fine-tuned with mini-batch gradient descent or Conjugate Gradient or used as a feature extractor. Over the years, the library has been extended to handle Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). The network is also able to train regular auto-encoders. Several advanced layers such as Dropout or Batch Normalization are also available as well as adaptive learning rates techniques such as Adadelta and Adam. The library also has integrated support for a few datasets: MNIST, CIFAR-10 and ImageNet.

This library can be used using a C++ interface. The library is fully header-only. It requires a C++14 compiler, which means a minimum of clang 3.9 or GCC 6.3.

In this post, I'm going to present a few examples on using the library and give some information about the performance of the library and the roadmap for the project.

Read more…


Expression Templates Library (ETL) 1.2 - Complete GPU support

ETL Logo

I'm happy to announce the version 1.2 of my Expression Templates Library (ETL): ETL 1.2, two months after I released the version 1.1. This version features much better GPU Support, a few new features and a lot of changes in the internal code.

GPU Support

Before, only algorithms such as 4D convolution or matrix-matrix multiplication were computed in the GPU and lots of operations were causing copies between CPU and GPU version. Now, the support for basic operations has also been completed and therefore, expressions like this:

C = sigmoid(2.0 * (A + B)) / sum(A)

Can be computed entirely on GPU.

Each matrix and vector containers have a secondary GPU memory space. During the execution, the status of both memory spaces is being managed and when necessary, copies are made between two spaces. In the best case, there should only be initial copies to the GPU and then everything should be done on the GPU. I've also considered using Unified Memory in place of this system, but this is a problem for fast matrix and I'd rather not have two different systems.

If you have an expression such as c = a + b * 2, it can be entirely computed on GPU, however, it will be computed in two GPU operations such as:

t1 = b * 2
c = a + t1

This is not perfect in terms of performance but this will be done without any copies between CPU and GPU memory. I plan to improve this system with a bit more complex operations to avoid too many GPU operations, but there will always be more operations than in CPU where this can easily be done in one go.

There are a few expressions that are not computable on the GPU, such as random generations. A few transformations are also not fully compatible with GPU. Moreover, if you access an element with operators [] or (), this will invalidate the GPU memory and force an update to the CPU memory.

GPU operations are not implemented directly in ETL, there are coming from various libraries. ETL is using NVIDIA CUDNN, CUFFT and CUDNN for most algorithms. Moreover, for other operations, I've implemented a libraries with simple GPU operations: ETL-GPU-BLAS (EGBLAS). You can have a look at egblas if you are interested.

My Deep Learning Library (DLL) project is based on ETL and its performances are mostly dependent on ETL's performances. Now that ETL fully supports GPU, the GPU performance of DLL is much improved. You may remember a few weeks ago I posted very high CPU performance of DLL. Now, I've run again the tests to see the GPU performance with DLL. Here is the performance for training a small CNN on the MNIST data set:

Performances for training a Convolutional Neural Network on MNIST

As you can see, the performances on GPU are now excellent. DLL's performances are on par with Tensorflow and Keras!

The next results are for training a much larger CNN on ImageNet, with the time necessary to train a single batch:

Performances for training a Convolutional Neural Network on Imagenet

Again, using the new version of ETL inside DLL has led to excellent performance. The framework is again on par with TensorFlow and Keras and faster than all the other frameworks. The large difference between DLL and Tensorflow and Keras is due to the inefficiency of reading the dataset in the two frameworks, so the performance of the three framework themselves are about the same.

Other Changes

The library also has a few other new features. Logarithms of base 2 and base 10 are now supported in complement to the base e that was already available before. Categorical Cross Entropy (CCE) computation is also available now, the CCE loss and error can be computed for one or many samples. Convolutions have also been improved in that you can use mixed types in both the image and the kernel and different storage order as well. Nevertheless, the most optimized version remains the version with the same storage order and the same data type.

I've also made a major change in the way implementations are selected for each operation. The tests and the benchmark are using a system to force the selection of an algorithm. This system is now disabled by default. This makes the compilation much faster by default. Since it's not necessary in most cases, this will help regular use cases of the library by compiling much faster.

Overall, the support for complex numbers has been improved in ETL. There are more routines that are supported and etl::complex is better supported throughout the code. I'll still work on this in the future to make it totally complete.

The internal code also has a few new changes. First, all traits have been rewritten to use variable templates instead of struct traits. This makes the code much nicer in my opinion. Moreover, I've started experimenting with C++17 if constexpr. Most of the if conditions that can be transformed to if constexpr have been annotated with comments that I can quickly enable or disable so that I can test the impact of C++17, especially on compilation time.

Finally, a few bugs have been fixed. ETL is now working better with parallel BLAS library. There should not be issues with double parallelization in ETL and BLAS. There was a slight bug in the Column-Major matrix-matrix multiplication kernel. Binary operations with different types in the left and right hand sides was also problematic with vectorization. The last bug was about GPU status in case ETL containers were moved.

What's next ?

I don't yet know exactly on which features I'm going to focus for the next version of ETL. I plan to focus a bit more in the near future on Deep Learning Library (DLL) for which I should release the version 1.0 soon. I also plan to start support for Recurrent Neural Networks on it, so that will take me quite some time.

Nevertheless, I'm still planning to consider the switch to C++17, since it is a bit faster to compile ETL with if constexpr. The next version of ETL will also probably have GPU-support for integers, at least in the cases that depend on the etl-gpu-blas library, which is the standard operators. I also plan to improve the support for complex numbers, especially in terms of performance and tests. Hopefully, I will have also time (and motivation) to start working on the sparse capabilities of ETL. It really needs much more unit tests and the performance should be improved as well.

Download ETL

You can download ETL on Github. If you only interested in the 1.2 version, you can look at the Releases pages or clone the tag 1.2. There are several branches:

  • master Is the eternal development branch, may not always be stable
  • stable Is a branch always pointing to the last tag, no development here

For the future release, there always will tags pointing to the corresponding commits. You can also have access to previous releases on Github or via the release tags.

The documentation is still a bit sparse. There are a few examples and the Wiki, but there still is work to be done. If you have questions on how to use or configure the library, please don't hesitate.

Don't hesitate to comment this post if you have any comment on this library or any question. You can also open an Issue on Github if you have a problem using this library or propose a Pull Request if you have any contribution you'd like to make to the library.

Hope this may be useful to some of you :)


C++11 Performance tip: Update on when to use std::pow ?

A few days ago, I published a post comparing the performance of std::pow against direct multiplications. When not compiling with -ffast-math, direct multiplication was significantly faster than std::pow, around two orders of magnitude faster when comparing x * x * x and code:std::pow(x, 3). One comment that I've got was to test for which n is code:std::pow(x, n) becoming faster than multiplying in a loop. Since std::pow is using a special algorithm to perform the computation rather than be simply loop-based multiplications, there may be a point after which it's more interesting to use the algorithm rather than a loop. So I decided to do the tests. You can also find the result in the original article, which I've updated.

First, our pow function:

double my_pow(double x, size_t n){
    double r = 1.0;

    while(n > 0){
        r *= x;

    return r;

And now, let's see the performance. I've compiled my benchmark with GCC 4.9.3 and running on my old Sandy Bridge processor. Here are the results for 1000 calls to each functions:

We can see that between n=100 and n=110, std::pow(x, n) starts to be faster than my_pow(x, n). At this point, you should only use std::pow(x, n). Interestingly too, the time for std::pow(x, n) is decreasing. Let's see how is the performance with higher range of n:

We can see that the pow function time still remains stable while our loop-based pow function still increases linearly. At n=1000, std::pow is one order of magnitude faster than my_pow.

Overall, if you do not care much about extreme accuracy, you may consider using you own pow function for small-ish (integer) n values. After n=100, it becomes more interesting to use std::pow.

If you want more results on the subject, you take a look at the original article.

If you are interested in the code of this benchmark, it's available online: bench_pow_my_pow.cpp


How I made my Deep Learning Library 38% faster to compile (Optimization and C++17 if constexpr)

My Deep Learning Library (DLL) project is a C++ library for training and using artificial neural networks (you can take a look at this post about DLL if you want more information).

While I made a lot of effort to make it as fast as possible to train and run neural networks, the compilation time has been steadily going up and is becoming quite annoying. This library is heavily templated and all the matrix operations are done using my Expression Templates Library (ETL) which is more than template-heavy itself.

In this post, I'll present two techniques with which I've been able to reduce the total compilation of the DLL unit tests by up to 38%.

Read more…


C++11 Performance tip: When to use std::pow ?

Update: I've added a new section for larger values of n.

Recently, I've been wondering about the performance of std::pow(x, n). I'm talking here about the case when n is an integer. In the case when n is not an integer, I believe, you should always use std::pow or use another specialized library.

In case when n is an integer, you can actually replace it with the direct equivalent (for instance std::pow(x, 3) = x * x x). If n is very large, you'd rather write a loop of course ;) In practice, we generally use powers of two and three much more often than power of 29, although that could happen. Of course, it especially make sense to wonder about this if the pow is used inside a loop. If you only use it once outside a loop, that won't be any difference on the overall performance.

Since I'm mostly interested in single precision performance (neural networks are only about single precision), the first benchmarks will be using float.

Read more…


budgetwarrior 0.4.2 - Budget summary and improved fortune reports

Almost three years ago, I published the version 0.4.1 of budgetwarrior. Since then, I've been using this tool almost every day to manage my personal budget. This is the only tool I use to keep track of my expenses and earnings and it makes a great tool for me. I recently felt that it was missing a few features and added them and polished a few things as well and release a new version with all the new stuff. This new version is probably nothing fancy, but a nice upgrade of the tool.

Don't pay too much attention to the values in the images since I've randomized all the data for the purpose of this post (new feature, by the way :P).

New summary view

I've added a new report with budget summary:


This view gives concise information about the current state of your accounts. It also gives information about your yearly and monthly objectives. Finally, it also gives information about the last two fortune values that you've set. I think this make a great kind of dashboard to view most of the information. If your terminal is large enough, the three parts will be shown side by side.

Improved fortune report

I've made a few improvements to the budget fortune view:


It now display the time between the different fortune values and it compute the average savings (or avg losses) per day in each interval and in average from the beginning of the first value.

Various changes

The balance does not propagate over the years anymore. This should mainly change the behaviour of budget overview. I don't think it was very smart to propagate it all the time. The balance now starts at zero for each year. If you want the old system, you can use the multi_year_balance=true option in the .budgetrc configuration file.

The recurring expenses do not use an internal configuration value. This does not change anything for the behaviour, but means that if you sync between different machines, it will avoid a lot of possible conflicts :)

Fixed a few bugs with inconsistency between the different views and reports. Another bug that was fixed is that budget report was not always displaying the first month of the year correctly, this is now fixed.

The graphs display in budget report are now automatically adapted to width of your terminal. Finally, the budget overview command also displays more information about the comparison with the previous month.


If you are on Gentoo, you can install it using layman:

layman -a wichtounet
emerge -a budgetwarrior

If you are on Arch Linux, you can use this AUR repository.

For other systems, you'll have to install from sources:

git clone --recursive git://
cd budgetwarrior
sudo make install


A brief tutorial is available on Github: Starting guide.

If you are interested by the sources, you can download them on Github: budgetwarrior.

If you have any suggestion for a new feature or an improvement to the tool or you found a bug, please post an issue on Github, I'd be glad to help you. You can post a comment directly on this post :)

If you have any other comment, don't hesitate to contact me, either by letting a comment on this post or by email.

I hope that this application can be useful to some of you command-line adepts :)


C++11 Concurrency Tutorial - Part 5: Futures

I've been recently reminded that a long time ago I was doing a series of tutorial on C++11 Concurrency. For some reason, I haven't continued these tutorials. The next post in the series was supposed to be about Futures, so I'm finally going to do it :)

Here are the links to the current posts of the C++11 Concurrency Tutorial:

In this post, we are going to talk about futures, more precisely std::future<T>. What is a future ? It's a very nice and simple mechanism to work with asynchronous tasks. It also has the advantage of decoupling you from the threads themselves, you can do multithreading without using std::thread. The future itself is a structure pointing to a result that will be computed in the future. How to create a future ? The simplest way is to use std::async that will create an asynchronous task and return a std::future.

Let's start with the simplest of the examples:

#include <thread>
#include <future>
#include <iostream>

int main(){
    auto future = std::async(std::launch::async, [](){
        std::cout << "I'm a thread" << std::endl;


    return 0;

Nothing really special here. std::async will execute the task that we give it (here a lambda) and return a std::future. Once you use the get() function on a future, it will wait until the result is available and return this result to you once it is. The get() function is then blocking. Since the lambda, is a void lambda, the returned future is of type std::future<void> and get() returns void as well. It is very important to know that you cannot call get several times on the same future. Once the result is consumed, you cannot consume it again! If you want to use the result several times, you need to store it yourself after you called get().

Let's see with something that returns a value and actually takes some time before returning it:

#include <thread>
#include <future>
#include <iostream>
#include <chrono>

int main(){
    auto future = std::async(std::launch::async, [](){
        return 42;

    // Do something else ?

    std::cout << future.get() << std::endl;

    return 0;

This time, the future will be of the time std::future<int> and thus get() will also return an int. std::async will again launch a task in an asynchronous way and future.get() will wait for the answer. What is interesting, is that you can do something else before the call to future.

But get() is not the only interesting function in std::future. You also have wait() which is almost the same as get() but does not consume the result. For instance, you can wait for several futures and then consume their result together. But, more interesting are the wait_for(duration) and wait_until(timepoint) functions. The first one wait for the result at most the given time and then returns and the second one wait for the result at most until the given time point. I think that wait_for is more useful in practices, so let's discuss it further. Finally, an interesting function is bool valid(). When you use get() on the future, it will consume the result, making valid() returns :code:`false. So, if you intend to check multiple times for a future, you should use valid() first.

One possible scenario would be if you have several asynchronous tasks, which is a common scenario. You can imagine that you want to process the results as fast as possible, so you want to ask the futures for their result several times. If no result is available, maybe you want to do something else. Here is a possible implementation:

#include <thread>
#include <future>
#include <iostream>
#include <chrono>

int main(){
    auto f1 = std::async(std::launch::async, [](){
        return 42;

    auto f2 = std::async(std::launch::async, [](){
        return 13;

    auto f3 = std::async(std::launch::async, [](){
        return 666;

    auto timeout = std::chrono::milliseconds(10);

    while(f1.valid() || f2.valid() || f3.valid()){
        if(f1.valid() && f1.wait_for(timeout) == std::future_status::ready){
            std::cout << "Task1 is done! " << f1.get() << std::endl;

        if(f2.valid() && f2.wait_for(timeout) == std::future_status::ready){
            std::cout << "Task2 is done! " << f2.get() << std::endl;

        if(f3.valid() && f3.wait_for(timeout) == std::future_status::ready){
            std::cout << "Task3 is done! " << f3.get() << std::endl;

        std::cout << "I'm doing my own work!" << std::endl;
        std::cout << "I'm done with my own work!" << std::endl;

    std::cout << "Everything is done, let's go back to the tutorial" << std::endl;

    return 0;

The three tasks are started asynchronously with std::async and the resulting std::future are stored. Then, as long as one of the tasks is not complete, we query each three task and try to process its result. If no result is available, we simply do something else. This example is important to understand, it covers pretty much every concept of the futures.

One interesting thing that remains is that you can pass parameters to your task via std::async. Indeed, all the extra parameters that you pass to std::async will be passed to the task itself. Here is an example of spawning tasks in a loop with different parameters:

#include <thread>
#include <future>
#include <iostream>
#include <chrono>
#include <vector>

int main(){
    std::vector<std::future<size_t>> futures;

    for (size_t i = 0; i < 10; ++i) {
        futures.emplace_back(std::async(std::launch::async, [](size_t param){
            return param;
        }, i));

    std::cout << "Start querying" << std::endl;

    for (auto &future : futures) {
      std::cout << future.get() << std::endl;

    return 0;

Pretty practical :) All The created std::future<size_t> are stored in a std::vector and then are all queried for their result.

Overall, I think std::future and std::async are great tool that can simplify your asynchronous code a lot. They allow you to make pretty advanced stuff while keeping the complexity of the code to a minimum.

I hope this long-due post is going to be interesting to some of you :) The code for this post is available on Github

I do not yet know if there will be a next installment in the series. I've covered pretty much everything that is available in C++11 for concurrency. I may cover the parallel algorithms of C++17 in a following post. If you have any suggestion for the next post, don't hesitate to post a comment or contact me directly by email.