Skip to main content

DLL: Pretty printing and live output

I've improved a lot the display of my Deep Learning Library (DLL). I know this is generally not the most important point in a machine learning framework, but the first impression being important. Therefore, I decided it was time to get a nicer output in the console for training networks.

A network or a dataset can be displayed using the display() function. I've added a display_pretty() function to them to display it more nicely. I've also added the dll::dump_timers_nice() function to do the same for dll::dump_timers().

I've also improved the display for the results of the batches during training. Now, the display is updated every 100ms and it also displays the current estimated time until the end of the epoch. With that, the user should have a much better idea on what's going on during training, especially when training networks when the epochs are taking a long time to complete.

Here is a full output of the training of fully-connected network on MNIST (mnist_mlp.cpp <>):

 | Index | Layer                | Parameters | Output Shape |
 | 0     | Dense(SIGMOID) (dyn) |     392000 | [Bx500]      |
 | 1     | Dropout(0.50)(dyn)   |          0 | [Bx500]      |
 | 2     | Dense(SIGMOID) (dyn) |     125000 | [Bx250]      |
 | 3     | Dropout(0.50)(dyn)   |          0 | [Bx250]      |
 | 4     | Dense(SOFTMAX) (dyn) |       2500 | [Bx10]       |
                Total Parameters:     519500

 | mnist | Size  | Batches | Augmented Size |
 | train | 60000 | 600     | 60000          |
 | test  | 10000 | 100     | 10000          |

Train the network with "Stochastic Gradient Descent"
    Updater: NADAM
 Early Stop: Goal(error)

With parameters:

epoch   0/50 batch  600/ 600 - error: 0.04623 loss: 0.15097 time 3230ms
epoch   1/50 batch  600/ 600 - error: 0.03013 loss: 0.09947 time 3188ms
epoch   2/50 batch  600/ 600 - error: 0.02048 loss: 0.06565 time 3102ms
epoch   3/50 batch  600/ 600 - error: 0.01593 loss: 0.05258 time 3189ms
epoch   4/50 batch  600/ 600 - error: 0.01422 loss: 0.04623 time 3160ms
epoch   5/50 batch  600/ 600 - error: 0.01112 loss: 0.03660 time 3131ms
epoch   6/50 batch  600/ 600 - error: 0.01078 loss: 0.03546 time 3200ms
epoch   7/50 batch  600/ 600 - error: 0.01003 loss: 0.03184 time 3246ms
epoch   8/50 batch  600/ 600 - error: 0.00778 loss: 0.02550 time 3222ms
epoch   9/50 batch  600/ 600 - error: 0.00782 loss: 0.02505 time 3119ms
epoch  10/50 batch  600/ 600 - error: 0.00578 loss: 0.02056 time 3284ms
epoch  11/50 batch  600/ 600 - error: 0.00618 loss: 0.02045 time 3220ms
epoch  12/50 batch  600/ 600 - error: 0.00538 loss: 0.01775 time 3444ms
epoch  13/50 batch  600/ 600 - error: 0.00563 loss: 0.01803 time 3304ms
epoch  14/50 batch  600/ 600 - error: 0.00458 loss: 0.01598 time 3577ms
epoch  15/50 batch  600/ 600 - error: 0.00437 loss: 0.01436 time 3228ms
epoch  16/50 batch  600/ 600 - error: 0.00360 loss: 0.01214 time 3180ms
epoch  17/50 batch  600/ 600 - error: 0.00405 loss: 0.01309 time 3090ms
epoch  18/50 batch  600/ 600 - error: 0.00408 loss: 0.01346 time 3045ms
epoch  19/50 batch  600/ 600 - error: 0.00337 loss: 0.01153 time 3071ms
epoch  20/50 batch  600/ 600 - error: 0.00297 loss: 0.01021 time 3131ms
epoch  21/50 batch  600/ 600 - error: 0.00318 loss: 0.01103 time 3076ms
epoch  22/50 batch  600/ 600 - error: 0.00277 loss: 0.00909 time 3090ms
epoch  23/50 batch  600/ 600 - error: 0.00242 loss: 0.00818 time 3163ms
epoch  24/50 batch  600/ 600 - error: 0.00267 loss: 0.00913 time 3229ms
epoch  25/50 batch  600/ 600 - error: 0.00295 loss: 0.00947 time 3156ms
epoch  26/50 batch  600/ 600 - error: 0.00252 loss: 0.00809 time 3066ms
epoch  27/50 batch  600/ 600 - error: 0.00227 loss: 0.00773 time 3156ms
epoch  28/50 batch  600/ 600 - error: 0.00203 loss: 0.00728 time 3158ms
epoch  29/50 batch  600/ 600 - error: 0.00240 loss: 0.00753 time 3114ms
epoch  30/50 batch  600/ 600 - error: 0.00263 loss: 0.00864 time 3099ms
epoch  31/50 batch  600/ 600 - error: 0.00210 loss: 0.00675 time 3096ms
epoch  32/50 batch  600/ 600 - error: 0.00163 loss: 0.00628 time 3120ms
epoch  33/50 batch  600/ 600 - error: 0.00182 loss: 0.00611 time 3045ms
epoch  34/50 batch  600/ 600 - error: 0.00125 loss: 0.00468 time 3140ms
epoch  35/50 batch  600/ 600 - error: 0.00183 loss: 0.00598 time 3093ms
epoch  36/50 batch  600/ 600 - error: 0.00232 loss: 0.00711 time 3068ms
epoch  37/50 batch  600/ 600 - error: 0.00170 loss: 0.00571 time 3057ms
epoch  38/50 batch  600/ 600 - error: 0.00162 loss: 0.00530 time 3115ms
epoch  39/50 batch  600/ 600 - error: 0.00155 loss: 0.00513 time 3226ms
epoch  40/50 batch  600/ 600 - error: 0.00150 loss: 0.00501 time 2987ms
epoch  41/50 batch  600/ 600 - error: 0.00122 loss: 0.00425 time 3117ms
epoch  42/50 batch  600/ 600 - error: 0.00108 loss: 0.00383 time 3102ms
epoch  43/50 batch  600/ 600 - error: 0.00165 loss: 0.00533 time 2977ms
epoch  44/50 batch  600/ 600 - error: 0.00142 loss: 0.00469 time 3009ms
epoch  45/50 batch  600/ 600 - error: 0.00098 loss: 0.00356 time 3055ms
epoch  46/50 batch  600/ 600 - error: 0.00127 loss: 0.00409 time 3076ms
epoch  47/50 batch  600/ 600 - error: 0.00132 loss: 0.00438 time 3068ms
epoch  48/50 batch  600/ 600 - error: 0.00130 loss: 0.00459 time 3045ms
epoch  49/50 batch  600/ 600 - error: 0.00107 loss: 0.00365 time 3103ms
Restore the best (error) weights from epoch 45
Training took 160s

Evaluation Results
   error: 0.01740
    loss: 0.07861
evaluation took 67ms

 | %        | Timer                         | Count  | Total     | Average   |
 | 100.000% | net:train:ft                  | 1      | 160.183s  | 160.183s  |
 | 100.000% | net:trainer:train             | 1      | 160.183s  | 160.183s  |
 |  99.997% | net:trainer:train:epoch       | 50     | 160.178s  | 3.20356s  |
 |  84.422% | net:trainer:train:epoch:batch | 30000  | 135.229s  | 4.50764ms |
 |  84.261% | sgd::train_batch              | 30000  | 134.971s  | 4.49904ms |
 |  44.404% | sgd::grad                     | 30000  | 71.1271s  | 2.3709ms  |
 |  35.453% | sgd::forward                  | 30000  | 56.7893s  | 1.89298ms |
 |  32.245% | sgd::update_weights           | 90000  | 51.6505s  | 573.894us |
 |  32.226% | sgd::apply_grad:nadam         | 180000 | 51.6211s  | 286.783us |
 |  28.399% | dense:dyn:forward             | 180300 | 45.4903s  | 252.303us |
 |  17.642% | dropout:train:forward         | 60000  | 28.2595s  | 470.99us  |
 |  13.707% | net:trainer:train:epoch:error | 50     | 21.957s   | 439.14ms  |
 |  12.148% | dense:dyn:gradients           | 90000  | 19.4587s  | 216.207us |
 |   4.299% | sgd::backward                 | 30000  | 6.88546s  | 229.515us |
 |   3.301% | dense:dyn:backward            | 60000  | 5.28729s  | 88.121us  |
 |   0.560% | dense:dyn:errors              | 60000  | 896.471ms | 14.941us  |
 |   0.407% | dropout:backward              | 60000  | 651.523ms | 10.858us  |
 |   0.339% | dropout:test:forward          | 60000  | 542.799ms | 9.046us   |
 |   0.161% | net:compute_loss:CCE          | 60100  | 257.915ms | 4.291us   |
 |   0.099% | sgd::error                    | 30000  | 158.33ms  | 5.277us   |

I hope this will make the output of the machine learning framework more useful.

All this support is now in the master branch of the DLL project if you want to check it out. You can also check out the example online: mnist_mlp.cpp

You can access the project on Github.


Inventor on four new research patents

During the first years of my thesis I worked on CTI research project with the American company Verisign, which has also an office near my school. A CTI research project is a project that is partially funded by the Commission on Innovation and Technology (CTI) where a school and a company work together. I was quite lucky to work on this project with the awesome people at Verisign Fribourg. After the success of the project, Verisign filled several patents regarding various points of the projects.

I'm quite happy now that these four patents are now approved and published. They They have been approved by both the United States Patent and Trademark Office (USPTO) and European Patent Office (EPO). The parents have been cl=¬ aimed by Verisign, I'm only one of the inventor, I got no claim on the patent. But it's still a great thing.

Here are the names of the four patents:

  • Systems and methods for automatic phonetization of domain names¬
  • Construction of phonetic representation of a string of characters¬
  • Method for writing a foreign language in a pseudo language phonetically resembling native language of the speaker¬
  • Construction of a phonetic representation of a generated string of characters¬

You can take a look at them on USPTO or EPO or on Google Patents, but the way a patent is written make it relatively hard to follow, it's more on a lawyer level or maybe I'm simply not used to patents anymore.

All these patents come from the research done during the CTI project with Verisign. In this project, name suggestions were generated from the phonetic sound of the name. The idea being to generate names that sounds the same as another input (airmix could become rmix or rmics). We are using various technologies to make this work: IG-Tree, Viterbi and HMM. And since we used a model with an encoder and a decoder, we can also mix languages. For instance, write something in French the way a English work would work (for instance school could become scoule).

These patents concludes a very interesting and successful project. I'm now working on yet another CTI research project with Verisign and it will surely be as successful as the first one.


Initial support for Recurrent Neural Network (RNN) in DLL

I'm happy to announce that I just merged support for Recurrent Neural Networks (RNNs) into my Deep Learning Library (DLL) machine learning framework.

It's nothing fancy yet, but forward propagation of RNN and basic Backpropagation Through Time (BPTT) are now supported. For now, only existing classification loss is supported for RNN. I plan to add support for sequence-to-sequence loss in order to be able to train models able to generate characters, but I don't know when I'll be able to work on that. I also plan to add support for other types of cells such as LSTM and GRU (maybe NAS) in the future.

For example, here is a simple RNN used on MNIST:

#include "dll/neural/dense_layer.hpp"
#include "dll/neural/recurrent_layer.hpp"
#include "dll/neural/recurrent_last_layer.hpp"
#include "dll/network.hpp"
#include "dll/datasets.hpp"

int main(int /*argc*/, char* /*argv*/ []) {
    // Load the dataset
    auto dataset = dll::make_mnist_dataset_nc(dll::batch_size<100>{}, dll::scale_pre<255>{});

    constexpr size_t time_steps      = 28;
    constexpr size_t sequence_length = 28;
    constexpr size_t hidden_units    = 100;

    // Build the network

    using network_t = dll::dyn_network_desc<
            dll::recurrent_layer<time_steps, sequence_length, hidden_units, dll::last_only>,
            dll::recurrent_last_layer<time_steps, hidden_units>,
            dll::dense_layer<hidden_units, 10, dll::softmax>
        , dll::updater<dll::updater_type::ADAM>      // Adam
        , dll::batch_size<100>                       // The mini-batch size

    auto net = std::make_unique<network_t>();

    // Display the network and dataset

    // Train the network for performance sake
    net->fine_tune(dataset.train(), 50);

    // Test the network on test set

    return 0;

The network starts with recurrent layer, followed by a layer that extracts only the last layer and finally a dense layer with a softmax function. The recurrent layer has support to change the activation function, change the initializer for the two weights matrices of the RNN and the number of steps for BPTT truncation.

Here is a possible result:

Network with 3 layers
    RNN(dyn): 28x28 -> TANH -> 28x100
    RNN(last): 28x100 -> 100
    Dense(dyn): 100 -> SOFTMAX -> 10
Total parameters: 13800
Train the network with "Stochastic Gradient Descent"
    Updater: ADAM
 Early Stop: Goal(error)

With parameters:

Epoch   0/50 - Classification error: 0.11635 Loss: 0.39999 Time 4717ms
Epoch   1/50 - Classification error: 0.11303 Loss: 0.36994 Time 4702ms
Epoch   2/50 - Classification error: 0.06732 Loss: 0.23469 Time 4702ms
Epoch   3/50 - Classification error: 0.04865 Loss: 0.17091 Time 4696ms
Epoch   4/50 - Classification error: 0.05957 Loss: 0.20437 Time 4706ms
Epoch   5/50 - Classification error: 0.05022 Loss: 0.16888 Time 4696ms
Epoch   6/50 - Classification error: 0.03912 Loss: 0.13743 Time 4698ms
Epoch   7/50 - Classification error: 0.04097 Loss: 0.14509 Time 4706ms
Epoch   8/50 - Classification error: 0.03938 Loss: 0.13397 Time 4694ms
Epoch   9/50 - Classification error: 0.03525 Loss: 0.12284 Time 4706ms
Epoch  10/50 - Classification error: 0.03927 Loss: 0.13770 Time 4694ms
Epoch  11/50 - Classification error: 0.03315 Loss: 0.11315 Time 4711ms
Epoch  12/50 - Classification error: 0.05037 Loss: 0.17123 Time 4711ms
Epoch  13/50 - Classification error: 0.02927 Loss: 0.10042 Time 4780ms
Epoch  14/50 - Classification error: 0.03322 Loss: 0.11027 Time 4746ms
Epoch  15/50 - Classification error: 0.03397 Loss: 0.11585 Time 4684ms
Epoch  16/50 - Classification error: 0.02938 Loss: 0.09984 Time 4708ms
Epoch  17/50 - Classification error: 0.03262 Loss: 0.11152 Time 4690ms
Epoch  18/50 - Classification error: 0.02872 Loss: 0.09753 Time 4672ms
Epoch  19/50 - Classification error: 0.02548 Loss: 0.08605 Time 4691ms
Epoch  20/50 - Classification error: 0.02245 Loss: 0.07797 Time 4693ms
Epoch  21/50 - Classification error: 0.02705 Loss: 0.08984 Time 4684ms
Epoch  22/50 - Classification error: 0.02422 Loss: 0.08164 Time 4688ms
Epoch  23/50 - Classification error: 0.02645 Loss: 0.08804 Time 4690ms
Epoch  24/50 - Classification error: 0.02927 Loss: 0.09739 Time 4715ms
Epoch  25/50 - Classification error: 0.02578 Loss: 0.08669 Time 4702ms
Epoch  26/50 - Classification error: 0.02785 Loss: 0.09368 Time 4700ms
Epoch  27/50 - Classification error: 0.02472 Loss: 0.08237 Time 4695ms
Epoch  28/50 - Classification error: 0.02125 Loss: 0.07324 Time 4690ms
Epoch  29/50 - Classification error: 0.01977 Loss: 0.06635 Time 4688ms
Epoch  30/50 - Classification error: 0.03635 Loss: 0.12140 Time 4689ms
Epoch  31/50 - Classification error: 0.02862 Loss: 0.09704 Time 4698ms
Epoch  32/50 - Classification error: 0.02463 Loss: 0.08158 Time 4686ms
Epoch  33/50 - Classification error: 0.02565 Loss: 0.08771 Time 4697ms
Epoch  34/50 - Classification error: 0.02278 Loss: 0.07634 Time 4718ms
Epoch  35/50 - Classification error: 0.02105 Loss: 0.07075 Time 4697ms
Epoch  36/50 - Classification error: 0.02770 Loss: 0.09358 Time 4711ms
Epoch  37/50 - Classification error: 0.02627 Loss: 0.08805 Time 4742ms
Epoch  38/50 - Classification error: 0.02282 Loss: 0.07712 Time 4708ms
Epoch  39/50 - Classification error: 0.02305 Loss: 0.07661 Time 4697ms
Epoch  40/50 - Classification error: 0.02243 Loss: 0.07773 Time 4700ms
Epoch  41/50 - Classification error: 0.02467 Loss: 0.08234 Time 4712ms
Epoch  42/50 - Classification error: 0.01808 Loss: 0.06186 Time 4691ms
Epoch  43/50 - Classification error: 0.02388 Loss: 0.07917 Time 4681ms
Epoch  44/50 - Classification error: 0.02162 Loss: 0.07508 Time 4699ms
Epoch  45/50 - Classification error: 0.01877 Loss: 0.06289 Time 4735ms
Epoch  46/50 - Classification error: 0.02263 Loss: 0.07969 Time 4764ms
Epoch  47/50 - Classification error: 0.02100 Loss: 0.07207 Time 4684ms
Epoch  48/50 - Classification error: 0.02425 Loss: 0.08076 Time 4752ms
Epoch  49/50 - Classification error: 0.02328 Loss: 0.07803 Time 4718ms
Restore the best (error) weights from epoch 42
Training took 235s
Evaluation Results
   error: 0.03000
    loss: 0.12260
evaluation took 245ms

Nothing fancy, but this example is not necessarily optimized.

All this support is now in the master branch of the DLL project if you want to check it out. You can also check out the example online: mnist_rnn.cpp

You can access the project on Github.


DLL New Features: Embeddings and Merge layers

I've just finished integrating new features into DLL, my deep learning library. I've added support for an embeddings layer, a group layer and a merge layer. This is not yet released, but available in the master branch.

Embeddings are used more and more these days to learn dense representation of characters or word. An embedding layer in a neural network transform labels into a vector. It's generally used as the first layer of the network. The embedding are learned as part of the network.

The merge layer allows to create branches in the network. The input is passed to each sub layer and then the output of each layer is concatenated to form the output of the merged layers. This can be very useful to use different convolutional filter sizes.

The group layer is a simple utility to group layers together. This is mostly to use with merge layers to form several branches.

I've put together a new example to use these features on text classification. The dataset is totally synthetic for now, but this can easily be reproduced with a normal text classification dataset. This kind of model is called a Character Convolutional Neural Network.

Here is the code for example:

constexpr size_t embedding = 16; // The length of the embedding vector
constexpr size_t length = 15;    // The word (or sequence) length

using embedding_network_t = dll::dyn_network_desc<
        // The embedding layer
        dll::embedding_layer<26, length, embedding>

        // The convolutional layers
        , dll::merge_layer<
            , dll::group_layer<
                  dll::conv_layer<1, length, embedding, 16, 3, embedding>
                , dll::mp_2d_layer<16, length - 3 + 1, 1, length - 3 + 1, 1>
            , dll::group_layer<
                  dll::conv_layer<1, length, embedding, 16, 4, embedding>
                , dll::mp_2d_layer<16, length - 4 + 1, 1, length - 4 + 1, 1>
            , dll::group_layer<
                  dll::conv_layer<1, length, embedding, 16, 5, embedding>
                , dll::mp_2d_layer<16, length - 5 + 1, 1, length - 5 + 1, 1>

        // The final softmax layer
        , dll::dense_layer<48, 10, dll::softmax>
    , dll::updater<dll::updater_type::NADAM>     // Nesterov Adam (NADAM)
    , dll::batch_size<50>                        // The mini-batch size
    , dll::shuffle                               // Shuffle before each epoch

auto net = std::make_unique<embedding_network_t>();

// Display the network and dataset

// Train the network for performance sake
net->fine_tune(samples, labels, 50);

// Test the network on train set
net->evaluate(samples, labels);

The network starts with an embedding layer. The embedding is then passed to three convolutional layers with different filter sizes, each followed by a pooling layer. The outputs of the three layers are merged at the end of the merge layer. Finally, a softmax layer is used for classification.

This kind of model can be very powerful and is used regularly. These new features make for a much larger variety of models that can be build with the DLL library.

The full code with the dataset generation can be found online: char_cnn.cpp

The next feature I want to focus on is recurrent neural networks. I'll probably try a single RNN layer first and then upgrade to multi-layers and LSTM and maybe GRU.


I successfully defended my Ph.D.

I'm happy to announce that I've successfully defended my thesis "Deep Learning Features for Image Processing". After four years, I've defended it officially in front of the thesis committed last Friday and then again two days ago I've successfully publicly defended in front of my friends, family and colleagues.

I'm now a "Doctor of Philosophy in Computer Science :)

I will update my thesis with the last comments in November and send the final version to the university. At which point, I'll publish it on this website as well.


Budgetwarrior: Track assets and portfolio, savings rates and auto-completion

This last month, I've been reading quite a few blogs about personal finance and I've decided to integrate more features into budgetwarrior. This post is about three new features that I've integrated. It's not yet a new release, so if you want to test this version, you'll have to compile it from the master branch on Git.

As it was last time, the values on my screenshots have all been randomized.

If you have several assets with different distributions, I believe it is a great value to have them all shown at the same time. Especially if you want to change the distribution of your portfolio or if you plan big changes in it.

Track assets

The first feature I've added is a feature to precisely track each of your assets independently. And you can also track the allocation of your portfolio in terms of stocks, bonds and cash. The tool also lets you set the desired distribution of your assets and will compute the difference that you should make in order to comply to your desired distribution.

First, you need to define all your asset classes (your accounts, funds, and stocks, ...) and their distribution with budget asset add. It also supports to set a currency. The default currency is now CHF, but you can set it in the configuration file, for instance default_currency=USD. You can see your assets using budget asset:

View of your assets

You can then set the value of your assets using budget asset value add. The system will save all the values of your assets. For now, only the last value is used in the application to display. In the future, I plan to add new reports for evolution of the portfolio over time. You can see your current net worth with the budget asset value:

View of your portfolio

The different currencies will all be converted to the default currency.

Savings rate

The second change I did is to compute the savings rate of each month and year. The savings rate is simply the portion of your income that you are able to save each month. The savings rate for a year is simple the average of the savings rate of each month.

The savings rate of the month can be seen with budget overview month:

Savings rate of the month

The saving rates of each month can also be seen in the overview of the year with budget overview year:

Savings rate of the year

This shows the savings rate of each month, the average of the year and the average of the current year up to the current month.

The savings rate is a very important metric of your budget. In my case, it's currently way too low and made me realize I really need to save more. Any savings rate below 10% is too low. There are no rule as too much it should be, but I'd like to augment mine to at least 20% next year.


The last feature is mostly some quality-of-life improvement. Some of the inputs in the console can now be completed. It's not really auto-completion per se, but you can cycle through the list of possible values using the UP and DOWN.

This makes it much easier to set some values such as asset names (in budget asset value add for instance), account names and objective types and sources. I'm trying to make the input of values easier.


I don't know exactly what else will be integrated in this feature, but I may already improve some visualization for asset values. If I learn something new about personal finance that I may integrate in the tool, I'll do it as well.

If you are interested by the sources or want to install this version, you can download them on Github: budgetwarrior.

The new features are in the master branch.

If you have a suggestion for a new features or you found a bug, please post an issue on Github, I'd be glad to help you.

If you have any comment, don't hesitate to contact me, either by letting a comment on this post or by email.


Deep Learning Library 1.0 - Fast Neural Network Library

DLL Logo

I'm very happy to announce the release of the first version of Deep Learning Library (DLL) 1.0. DLL is a neural network library with a focus on speed and ease of use.

I started working on this library about 4 years ago for my Ph.D. thesis. I needed a good library to train and use Restricted Boltzmann Machines (RBMs) and at this time there was no good support for it. Therefore, I decided to write my own. It now has very complete support for the RBM and the Convolutional RBM (CRBM) models. Stacks of RBMs (or Deep Belief Networks (DBNs)) can be pretrained using Contrastive Divergence and then either fine-tuned with mini-batch gradient descent or Conjugate Gradient or used as a feature extractor. Over the years, the library has been extended to handle Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). The network is also able to train regular auto-encoders. Several advanced layers such as Dropout or Batch Normalization are also available as well as adaptive learning rates techniques such as Adadelta and Adam. The library also has integrated support for a few datasets: MNIST, CIFAR-10 and ImageNet.

This library can be used using a C++ interface. The library is fully header-only. It requires a C++14 compiler, which means a minimum of clang 3.9 or GCC 6.3.

In this post, I'm going to present a few examples on using the library and give some information about the performance of the library and the roadmap for the project.

Read more…


Expression Templates Library (ETL) 1.2 - Complete GPU support

ETL Logo

I'm happy to announce the version 1.2 of my Expression Templates Library (ETL): ETL 1.2, two months after I released the version 1.1. This version features much better GPU Support, a few new features and a lot of changes in the internal code.

GPU Support

Before, only algorithms such as 4D convolution or matrix-matrix multiplication were computed in the GPU and lots of operations were causing copies between CPU and GPU version. Now, the support for basic operations has also been completed and therefore, expressions like this:

C = sigmoid(2.0 * (A + B)) / sum(A)

Can be computed entirely on GPU.

Each matrix and vector containers have a secondary GPU memory space. During the execution, the status of both memory spaces is being managed and when necessary, copies are made between two spaces. In the best case, there should only be initial copies to the GPU and then everything should be done on the GPU. I've also considered using Unified Memory in place of this system, but this is a problem for fast matrix and I'd rather not have two different systems.

If you have an expression such as c = a + b * 2, it can be entirely computed on GPU, however, it will be computed in two GPU operations such as:

t1 = b * 2
c = a + t1

This is not perfect in terms of performance but this will be done without any copies between CPU and GPU memory. I plan to improve this system with a bit more complex operations to avoid too many GPU operations, but there will always be more operations than in CPU where this can easily be done in one go.

There are a few expressions that are not computable on the GPU, such as random generations. A few transformations are also not fully compatible with GPU. Moreover, if you access an element with operators [] or (), this will invalidate the GPU memory and force an update to the CPU memory.

GPU operations are not implemented directly in ETL, there are coming from various libraries. ETL is using NVIDIA CUDNN, CUFFT and CUDNN for most algorithms. Moreover, for other operations, I've implemented a libraries with simple GPU operations: ETL-GPU-BLAS (EGBLAS). You can have a look at egblas if you are interested.

My Deep Learning Library (DLL) project is based on ETL and its performances are mostly dependent on ETL's performances. Now that ETL fully supports GPU, the GPU performance of DLL is much improved. You may remember a few weeks ago I posted very high CPU performance of DLL. Now, I've run again the tests to see the GPU performance with DLL. Here is the performance for training a small CNN on the MNIST data set:

Performances for training a Convolutional Neural Network on MNIST

As you can see, the performances on GPU are now excellent. DLL's performances are on par with Tensorflow and Keras!

The next results are for training a much larger CNN on ImageNet, with the time necessary to train a single batch:

Performances for training a Convolutional Neural Network on Imagenet

Again, using the new version of ETL inside DLL has led to excellent performance. The framework is again on par with TensorFlow and Keras and faster than all the other frameworks. The large difference between DLL and Tensorflow and Keras is due to the inefficiency of reading the dataset in the two frameworks, so the performance of the three framework themselves are about the same.

Other Changes

The library also has a few other new features. Logarithms of base 2 and base 10 are now supported in complement to the base e that was already available before. Categorical Cross Entropy (CCE) computation is also available now, the CCE loss and error can be computed for one or many samples. Convolutions have also been improved in that you can use mixed types in both the image and the kernel and different storage order as well. Nevertheless, the most optimized version remains the version with the same storage order and the same data type.

I've also made a major change in the way implementations are selected for each operation. The tests and the benchmark are using a system to force the selection of an algorithm. This system is now disabled by default. This makes the compilation much faster by default. Since it's not necessary in most cases, this will help regular use cases of the library by compiling much faster.

Overall, the support for complex numbers has been improved in ETL. There are more routines that are supported and etl::complex is better supported throughout the code. I'll still work on this in the future to make it totally complete.

The internal code also has a few new changes. First, all traits have been rewritten to use variable templates instead of struct traits. This makes the code much nicer in my opinion. Moreover, I've started experimenting with C++17 if constexpr. Most of the if conditions that can be transformed to if constexpr have been annotated with comments that I can quickly enable or disable so that I can test the impact of C++17, especially on compilation time.

Finally, a few bugs have been fixed. ETL is now working better with parallel BLAS library. There should not be issues with double parallelization in ETL and BLAS. There was a slight bug in the Column-Major matrix-matrix multiplication kernel. Binary operations with different types in the left and right hand sides was also problematic with vectorization. The last bug was about GPU status in case ETL containers were moved.

What's next ?

I don't yet know exactly on which features I'm going to focus for the next version of ETL. I plan to focus a bit more in the near future on Deep Learning Library (DLL) for which I should release the version 1.0 soon. I also plan to start support for Recurrent Neural Networks on it, so that will take me quite some time.

Nevertheless, I'm still planning to consider the switch to C++17, since it is a bit faster to compile ETL with if constexpr. The next version of ETL will also probably have GPU-support for integers, at least in the cases that depend on the etl-gpu-blas library, which is the standard operators. I also plan to improve the support for complex numbers, especially in terms of performance and tests. Hopefully, I will have also time (and motivation) to start working on the sparse capabilities of ETL. It really needs much more unit tests and the performance should be improved as well.

Download ETL

You can download ETL on Github. If you only interested in the 1.2 version, you can look at the Releases pages or clone the tag 1.2. There are several branches:

  • master Is the eternal development branch, may not always be stable
  • stable Is a branch always pointing to the last tag, no development here

For the future release, there always will tags pointing to the corresponding commits. You can also have access to previous releases on Github or via the release tags.

The documentation is still a bit sparse. There are a few examples and the Wiki, but there still is work to be done. If you have questions on how to use or configure the library, please don't hesitate.

Don't hesitate to comment this post if you have any comment on this library or any question. You can also open an Issue on Github if you have a problem using this library or propose a Pull Request if you have any contribution you'd like to make to the library.

Hope this may be useful to some of you :)


C++11 Performance tip: Update on when to use std::pow ?

A few days ago, I published a post comparing the performance of std::pow against direct multiplications. When not compiling with -ffast-math, direct multiplication was significantly faster than std::pow, around two orders of magnitude faster when comparing x * x * x and code:std::pow(x, 3). One comment that I've got was to test for which n is code:std::pow(x, n) becoming faster than multiplying in a loop. Since std::pow is using a special algorithm to perform the computation rather than be simply loop-based multiplications, there may be a point after which it's more interesting to use the algorithm rather than a loop. So I decided to do the tests. You can also find the result in the original article, which I've updated.

First, our pow function:

double my_pow(double x, size_t n){
    double r = 1.0;

    while(n > 0){
        r *= x;

    return r;

And now, let's see the performance. I've compiled my benchmark with GCC 4.9.3 and running on my old Sandy Bridge processor. Here are the results for 1000 calls to each functions:

We can see that between n=100 and n=110, std::pow(x, n) starts to be faster than my_pow(x, n). At this point, you should only use std::pow(x, n). Interestingly too, the time for std::pow(x, n) is decreasing. Let's see how is the performance with higher range of n:

We can see that the pow function time still remains stable while our loop-based pow function still increases linearly. At n=1000, std::pow is one order of magnitude faster than my_pow.

Overall, if you do not care much about extreme accuracy, you may consider using you own pow function for small-ish (integer) n values. After n=100, it becomes more interesting to use std::pow.

If you want more results on the subject, you take a look at the original article.

If you are interested in the code of this benchmark, it's available online: bench_pow_my_pow.cpp


How I made my Deep Learning Library 38% faster to compile (Optimization and C++17 if constexpr)

My Deep Learning Library (DLL) project is a C++ library for training and using artificial neural networks (you can take a look at this post about DLL if you want more information).

While I made a lot of effort to make it as fast as possible to train and run neural networks, the compilation time has been steadily going up and is becoming quite annoying. This library is heavily templated and all the matrix operations are done using my Expression Templates Library (ETL) which is more than template-heavy itself.

In this post, I'll present two techniques with which I've been able to reduce the total compilation of the DLL unit tests by up to 38%.

Read more…