Skip to main content

Advanced GPU Patterns Optimization in ETL
   Posted:


The GPU performance of my Expression Templates Library (ETL) is pretty good when most of the time is spent inside expensive operations such as Matrix-Matrix Multiplication or convolutions. However, when most of the time is spent in linear kernels, performance is not great because this will invoke a lot of CUDA kernels. Indeed, the way it is done is that each sub expressions compute its result in a temporary GPU vector (or matrix) and these temporaries are passed through the expressions. For instance, this expression:

yy = 1.1 * x + 1.2 * y

will be executed on the GPU as something like this:

t1 = 1.1 * x
t2 = 1.2 * y
yy = t1 + t2

that will results in three GPU kernels being invoked. In the CPU case, the complete expression will be executed as one CPU kernel, that is constructed with Expression Templates. Unfortunately, a CUDA kernel cannot be constructed in the same way since the CUDA compiler does not support general template metaprogramming. That's why I've implemented by using small kernels for each expression.

Fortunately, we can do better with a bit more meta-programming. Indeed, there are some patterns that are repeated a lot and that easily be implemented in CUDA kernels. I've started detecting a few of these patterns and for each of them a single CUDA kernel is executed. For instance, each of the following expressions can be executed with a single kernel:

yy = 1.1 * x + y
yy = x + 1.1 * y
yy = 1.1 * y + 1.2 * y
yy = 1.1 * x * y
yy = x / (1.1 * y)

This results in significantly performance improvement for these expressions!

I have tested these new improvements in my Deep Learning Library (DLL) project (not yet merged) and it resulted in 25% faster momentum computation and 17% faster Nesterov Adam (NADAM).

I'm going to continue to investigate which kernels need to be made faster for DLL and try to improve the overall performance. Currently, the GPU performance of DLL is very good for large convolutional networks, but could be improved for small fully-connected networks. Indeed, in that case, quite some time is spent outside the matrix-matrix multiplication and inside serial expressions for which GPU could be improved. Once I'm done with my optimizations, I'll probably post again on the blog with the latest results.

All these new optimizations are now in the master branch of the ETL project if you want to check it out. You can access the project on Github.

Comments

Initial support for Long Short Term Memory (LSTM) in DLL
   Posted:


I'm really happy to announce that I just merged support for

Long Short Term Memory (LSTM) cells into my Deep Learning Library (DLL) machine learning framework. Two weeks ago, I already merged suport for Recurrent Neural network (RNN).

It's nothing fancy yet, but forward propagation of LSTM and basic Backpropagation Through Time (BPTT) are now supported. It was not really complicated to implemenet the forward pass but the backward pass is much complicated for an LSTM than for a RNN. It took me quite a long time to figure out all the gradients formulas and the documentation on that is quite scarce.

For now, still only existing classification loss is supported for RNN and LSTM. As I said last time, I still plan to add support for sequence-to-sequence loss in order to be able to train models able to generate characters. However, I don't know when I'll be able to work on that. Now that I've got the code for LSTM, I should be able to implement a GRU cell and NAS cell quite easily I believe.

For example, here is a simple LSTM used on MNIST for classification:

#include "dll/neural/dense_layer.hpp"
#include "dll/neural/lstm_layer.hpp"
#include "dll/neural/recurrent_last_layer.hpp"
#include "dll/network.hpp"
#include "dll/datasets.hpp"

int main(int /*argc*/, char* /*argv*/ []) {
    // Load the dataset
    auto dataset = dll::make_mnist_dataset_nc(dll::batch_size<100>{}, dll::scale_pre<255>{});

    constexpr size_t time_steps      = 28;
    constexpr size_t sequence_length = 28;
    constexpr size_t hidden_units    = 100;

    // Build the network

    using network_t = dll::dyn_network_desc<
        dll::network_layers<
            dll::lstm_layer<time_steps, sequence_length, hidden_units, dll::last_only>,
            dll::recurrent_last_layer<time_steps, hidden_units>,
            dll::dense_layer<hidden_units, 10, dll::softmax>
        >
        , dll::updater<dll::updater_type::ADAM>      // Adam
        , dll::batch_size<100>                       // The mini-batch size
    >::network_t;

    auto net = std::make_unique<network_t>();

    // Display the network and dataset
    net->display();
    dataset.display();

    // Train the network for performance sake
    net->fine_tune(dataset.train(), 50);

    // Test the network on test set
    net->evaluate(dataset.test());

    return 0;
}

The network is quite similar to the one used previously with an RNN, just replace rnn with lstm and that's it. It starts with LSTM layer, followed by a layer extracting the last time step and finally a dense layer with a softmax function. The network is trained with Adam for 50 epochs. You can change the activation function , the initializer for the weights and the biases and number of steps for BPTT truncation.

Here is the result I got on my last run:

------------------------------------------------------------
| Index | Layer                | Parameters | Output Shape |
------------------------------------------------------------
| 0     | LSTM (TANH) (dyn)    |      51200 | [Bx28x100]   |
| 1     | RNN(last)            |          0 | [Bx100]      |
| 2     | Dense(SOFTMAX) (dyn) |       1000 | [Bx10]       |
------------------------------------------------------------
              Total Parameters:      52200

--------------------------------------------
| mnist | Size  | Batches | Augmented Size |
--------------------------------------------
| train | 60000 | 600     | 60000          |
| test  | 10000 | 100     | 10000          |
--------------------------------------------

Network with 3 layers
    LSTM(dyn): 28x28 -> TANH -> 28x100
    RNN(last): 28x100 -> 100
    Dense(dyn): 100 -> SOFTMAX -> 10
Total parameters: 52200
Dataset
Training: In-Memory Data Generator
              Size: 60000
           Batches: 600
Testing: In-Memory Data Generator
              Size: 10000
           Batches: 100

Train the network with "Stochastic Gradient Descent"
    Updater: ADAM
       Loss: CATEGORICAL_CROSS_ENTROPY
 Early Stop: Goal(error)

With parameters:
          epochs=50
      batch_size=100
   learning_rate=0.001
           beta1=0.9
           beta2=0.999

epoch   0/50 batch  600/ 600 - error: 0.07943 loss: 0.28504 time 20910ms
epoch   1/50 batch  600/ 600 - error: 0.06683 loss: 0.24021 time 20889ms
epoch   2/50 batch  600/ 600 - error: 0.04828 loss: 0.18233 time 21061ms
epoch   3/50 batch  600/ 600 - error: 0.04407 loss: 0.16665 time 20839ms
epoch   4/50 batch  600/ 600 - error: 0.03515 loss: 0.13290 time 22108ms
epoch   5/50 batch  600/ 600 - error: 0.03207 loss: 0.12019 time 21393ms
epoch   6/50 batch  600/ 600 - error: 0.02973 loss: 0.11239 time 28199ms
epoch   7/50 batch  600/ 600 - error: 0.02653 loss: 0.10455 time 37039ms
epoch   8/50 batch  600/ 600 - error: 0.02482 loss: 0.09657 time 23127ms
epoch   9/50 batch  600/ 600 - error: 0.02177 loss: 0.08422 time 41766ms
epoch  10/50 batch  600/ 600 - error: 0.02453 loss: 0.09382 time 29765ms
epoch  11/50 batch  600/ 600 - error: 0.02575 loss: 0.09796 time 21449ms
epoch  12/50 batch  600/ 600 - error: 0.02107 loss: 0.07833 time 42056ms
epoch  13/50 batch  600/ 600 - error: 0.01877 loss: 0.07171 time 24673ms
epoch  14/50 batch  600/ 600 - error: 0.02095 loss: 0.08481 time 20878ms
epoch  15/50 batch  600/ 600 - error: 0.02040 loss: 0.07578 time 41515ms
epoch  16/50 batch  600/ 600 - error: 0.01580 loss: 0.06083 time 25705ms
epoch  17/50 batch  600/ 600 - error: 0.01945 loss: 0.07046 time 20903ms
epoch  18/50 batch  600/ 600 - error: 0.01728 loss: 0.06683 time 41828ms
epoch  19/50 batch  600/ 600 - error: 0.01577 loss: 0.05947 time 27810ms
epoch  20/50 batch  600/ 600 - error: 0.01528 loss: 0.05883 time 21477ms
epoch  21/50 batch  600/ 600 - error: 0.01345 loss: 0.05127 time 44718ms
epoch  22/50 batch  600/ 600 - error: 0.01410 loss: 0.05357 time 25174ms
epoch  23/50 batch  600/ 600 - error: 0.01268 loss: 0.04765 time 23827ms
epoch  24/50 batch  600/ 600 - error: 0.01342 loss: 0.05004 time 47232ms
epoch  25/50 batch  600/ 600 - error: 0.01730 loss: 0.06872 time 22532ms
epoch  26/50 batch  600/ 600 - error: 0.01337 loss: 0.05016 time 30114ms
epoch  27/50 batch  600/ 600 - error: 0.01842 loss: 0.07049 time 40136ms
epoch  28/50 batch  600/ 600 - error: 0.01262 loss: 0.04639 time 21793ms
epoch  29/50 batch  600/ 600 - error: 0.01403 loss: 0.05292 time 34096ms
epoch  30/50 batch  600/ 600 - error: 0.01185 loss: 0.04456 time 35420ms
epoch  31/50 batch  600/ 600 - error: 0.01098 loss: 0.04180 time 20909ms
epoch  32/50 batch  600/ 600 - error: 0.01337 loss: 0.04687 time 30113ms
epoch  33/50 batch  600/ 600 - error: 0.01415 loss: 0.05292 time 37393ms
epoch  34/50 batch  600/ 600 - error: 0.00982 loss: 0.03615 time 20962ms
epoch  35/50 batch  600/ 600 - error: 0.01178 loss: 0.04830 time 29305ms
epoch  36/50 batch  600/ 600 - error: 0.00882 loss: 0.03408 time 38293ms
epoch  37/50 batch  600/ 600 - error: 0.01148 loss: 0.04341 time 20841ms
epoch  38/50 batch  600/ 600 - error: 0.00960 loss: 0.03701 time 29204ms
epoch  39/50 batch  600/ 600 - error: 0.00850 loss: 0.03094 time 39802ms
epoch  40/50 batch  600/ 600 - error: 0.01473 loss: 0.05136 time 20831ms
epoch  41/50 batch  600/ 600 - error: 0.01007 loss: 0.03579 time 29856ms
epoch  42/50 batch  600/ 600 - error: 0.00943 loss: 0.03370 time 38200ms
epoch  43/50 batch  600/ 600 - error: 0.01205 loss: 0.04409 time 21162ms
epoch  44/50 batch  600/ 600 - error: 0.00980 loss: 0.03674 time 32279ms
epoch  45/50 batch  600/ 600 - error: 0.01068 loss: 0.04133 time 38448ms
epoch  46/50 batch  600/ 600 - error: 0.00913 loss: 0.03478 time 20797ms
epoch  47/50 batch  600/ 600 - error: 0.00985 loss: 0.03759 time 28885ms
epoch  48/50 batch  600/ 600 - error: 0.00912 loss: 0.03295 time 41120ms
epoch  49/50 batch  600/ 600 - error: 0.00930 loss: 0.03438 time 21282ms
Restore the best (error) weights from epoch 39
Training took 1460s

Evaluation Results
   error: 0.02440
    loss: 0.11315
evaluation took 1000ms

Again, nothing fancy yet, but this example has not been optimized for performance nor for accuracy.

I also made a few changes to the RNN layer. I added support for biases and improved the code as well for performance and readability.

All this support is now in the master branch of the DLL project if you want to check it out. You can also check out the example online: mnist_lstm.cpp

You can access the project on Github.

Currently I'm working on the GPU performance again. The performance of some is still not as good as I want it to be, especially complex operation like used in Adam and Nadam. Currently, there are many calls to GPU BLAS libraries and I want to try to extract some more optimized patterns. Once it's done, I'll post more on that later on the blog.

Comments

DLL: Pretty printing and live output
   Posted:


I've improved a lot the display of my Deep Learning Library (DLL). I know this is generally not the most important point in a machine learning framework, but the first impression being important. Therefore, I decided it was time to get a nicer output in the console for training networks.

A network or a dataset can be displayed using the display() function. I've added a display_pretty() function to them to display it more nicely. I've also added the dll::dump_timers_nice() function to do the same for dll::dump_timers().

I've also improved the display for the results of the batches during training. Now, the display is updated every 100ms and it also displays the current estimated time until the end of the epoch. With that, the user should have a much better idea on what's going on during training, especially when training networks when the epochs are taking a long time to complete.

Here is a full output of the training of fully-connected network on MNIST (mnist_mlp.cpp <https://github.com/wichtounet/dll/blob/master/examples/src/mnist_mlp.cpp>):

 ------------------------------------------------------------
 | Index | Layer                | Parameters | Output Shape |
 ------------------------------------------------------------
 | 0     | Dense(SIGMOID) (dyn) |     392000 | [Bx500]      |
 | 1     | Dropout(0.50)(dyn)   |          0 | [Bx500]      |
 | 2     | Dense(SIGMOID) (dyn) |     125000 | [Bx250]      |
 | 3     | Dropout(0.50)(dyn)   |          0 | [Bx250]      |
 | 4     | Dense(SOFTMAX) (dyn) |       2500 | [Bx10]       |
 ------------------------------------------------------------
                Total Parameters:     519500

 --------------------------------------------
 | mnist | Size  | Batches | Augmented Size |
 --------------------------------------------
 | train | 60000 | 600     | 60000          |
 | test  | 10000 | 100     | 10000          |
 --------------------------------------------

Train the network with "Stochastic Gradient Descent"
    Updater: NADAM
       Loss: CATEGORICAL_CROSS_ENTROPY
 Early Stop: Goal(error)

With parameters:
          epochs=50
      batch_size=100
   learning_rate=0.002
           beta1=0.9
           beta2=0.999

epoch   0/50 batch  600/ 600 - error: 0.04623 loss: 0.15097 time 3230ms
epoch   1/50 batch  600/ 600 - error: 0.03013 loss: 0.09947 time 3188ms
epoch   2/50 batch  600/ 600 - error: 0.02048 loss: 0.06565 time 3102ms
epoch   3/50 batch  600/ 600 - error: 0.01593 loss: 0.05258 time 3189ms
epoch   4/50 batch  600/ 600 - error: 0.01422 loss: 0.04623 time 3160ms
epoch   5/50 batch  600/ 600 - error: 0.01112 loss: 0.03660 time 3131ms
epoch   6/50 batch  600/ 600 - error: 0.01078 loss: 0.03546 time 3200ms
epoch   7/50 batch  600/ 600 - error: 0.01003 loss: 0.03184 time 3246ms
epoch   8/50 batch  600/ 600 - error: 0.00778 loss: 0.02550 time 3222ms
epoch   9/50 batch  600/ 600 - error: 0.00782 loss: 0.02505 time 3119ms
epoch  10/50 batch  600/ 600 - error: 0.00578 loss: 0.02056 time 3284ms
epoch  11/50 batch  600/ 600 - error: 0.00618 loss: 0.02045 time 3220ms
epoch  12/50 batch  600/ 600 - error: 0.00538 loss: 0.01775 time 3444ms
epoch  13/50 batch  600/ 600 - error: 0.00563 loss: 0.01803 time 3304ms
epoch  14/50 batch  600/ 600 - error: 0.00458 loss: 0.01598 time 3577ms
epoch  15/50 batch  600/ 600 - error: 0.00437 loss: 0.01436 time 3228ms
epoch  16/50 batch  600/ 600 - error: 0.00360 loss: 0.01214 time 3180ms
epoch  17/50 batch  600/ 600 - error: 0.00405 loss: 0.01309 time 3090ms
epoch  18/50 batch  600/ 600 - error: 0.00408 loss: 0.01346 time 3045ms
epoch  19/50 batch  600/ 600 - error: 0.00337 loss: 0.01153 time 3071ms
epoch  20/50 batch  600/ 600 - error: 0.00297 loss: 0.01021 time 3131ms
epoch  21/50 batch  600/ 600 - error: 0.00318 loss: 0.01103 time 3076ms
epoch  22/50 batch  600/ 600 - error: 0.00277 loss: 0.00909 time 3090ms
epoch  23/50 batch  600/ 600 - error: 0.00242 loss: 0.00818 time 3163ms
epoch  24/50 batch  600/ 600 - error: 0.00267 loss: 0.00913 time 3229ms
epoch  25/50 batch  600/ 600 - error: 0.00295 loss: 0.00947 time 3156ms
epoch  26/50 batch  600/ 600 - error: 0.00252 loss: 0.00809 time 3066ms
epoch  27/50 batch  600/ 600 - error: 0.00227 loss: 0.00773 time 3156ms
epoch  28/50 batch  600/ 600 - error: 0.00203 loss: 0.00728 time 3158ms
epoch  29/50 batch  600/ 600 - error: 0.00240 loss: 0.00753 time 3114ms
epoch  30/50 batch  600/ 600 - error: 0.00263 loss: 0.00864 time 3099ms
epoch  31/50 batch  600/ 600 - error: 0.00210 loss: 0.00675 time 3096ms
epoch  32/50 batch  600/ 600 - error: 0.00163 loss: 0.00628 time 3120ms
epoch  33/50 batch  600/ 600 - error: 0.00182 loss: 0.00611 time 3045ms
epoch  34/50 batch  600/ 600 - error: 0.00125 loss: 0.00468 time 3140ms
epoch  35/50 batch  600/ 600 - error: 0.00183 loss: 0.00598 time 3093ms
epoch  36/50 batch  600/ 600 - error: 0.00232 loss: 0.00711 time 3068ms
epoch  37/50 batch  600/ 600 - error: 0.00170 loss: 0.00571 time 3057ms
epoch  38/50 batch  600/ 600 - error: 0.00162 loss: 0.00530 time 3115ms
epoch  39/50 batch  600/ 600 - error: 0.00155 loss: 0.00513 time 3226ms
epoch  40/50 batch  600/ 600 - error: 0.00150 loss: 0.00501 time 2987ms
epoch  41/50 batch  600/ 600 - error: 0.00122 loss: 0.00425 time 3117ms
epoch  42/50 batch  600/ 600 - error: 0.00108 loss: 0.00383 time 3102ms
epoch  43/50 batch  600/ 600 - error: 0.00165 loss: 0.00533 time 2977ms
epoch  44/50 batch  600/ 600 - error: 0.00142 loss: 0.00469 time 3009ms
epoch  45/50 batch  600/ 600 - error: 0.00098 loss: 0.00356 time 3055ms
epoch  46/50 batch  600/ 600 - error: 0.00127 loss: 0.00409 time 3076ms
epoch  47/50 batch  600/ 600 - error: 0.00132 loss: 0.00438 time 3068ms
epoch  48/50 batch  600/ 600 - error: 0.00130 loss: 0.00459 time 3045ms
epoch  49/50 batch  600/ 600 - error: 0.00107 loss: 0.00365 time 3103ms
Restore the best (error) weights from epoch 45
Training took 160s

Evaluation Results
   error: 0.01740
    loss: 0.07861
evaluation took 67ms

 -----------------------------------------------------------------------------
 | %        | Timer                         | Count  | Total     | Average   |
 -----------------------------------------------------------------------------
 | 100.000% | net:train:ft                  | 1      | 160.183s  | 160.183s  |
 | 100.000% | net:trainer:train             | 1      | 160.183s  | 160.183s  |
 |  99.997% | net:trainer:train:epoch       | 50     | 160.178s  | 3.20356s  |
 |  84.422% | net:trainer:train:epoch:batch | 30000  | 135.229s  | 4.50764ms |
 |  84.261% | sgd::train_batch              | 30000  | 134.971s  | 4.49904ms |
 |  44.404% | sgd::grad                     | 30000  | 71.1271s  | 2.3709ms  |
 |  35.453% | sgd::forward                  | 30000  | 56.7893s  | 1.89298ms |
 |  32.245% | sgd::update_weights           | 90000  | 51.6505s  | 573.894us |
 |  32.226% | sgd::apply_grad:nadam         | 180000 | 51.6211s  | 286.783us |
 |  28.399% | dense:dyn:forward             | 180300 | 45.4903s  | 252.303us |
 |  17.642% | dropout:train:forward         | 60000  | 28.2595s  | 470.99us  |
 |  13.707% | net:trainer:train:epoch:error | 50     | 21.957s   | 439.14ms  |
 |  12.148% | dense:dyn:gradients           | 90000  | 19.4587s  | 216.207us |
 |   4.299% | sgd::backward                 | 30000  | 6.88546s  | 229.515us |
 |   3.301% | dense:dyn:backward            | 60000  | 5.28729s  | 88.121us  |
 |   0.560% | dense:dyn:errors              | 60000  | 896.471ms | 14.941us  |
 |   0.407% | dropout:backward              | 60000  | 651.523ms | 10.858us  |
 |   0.339% | dropout:test:forward          | 60000  | 542.799ms | 9.046us   |
 |   0.161% | net:compute_loss:CCE          | 60100  | 257.915ms | 4.291us   |
 |   0.099% | sgd::error                    | 30000  | 158.33ms  | 5.277us   |
 -----------------------------------------------------------------------------

I hope this will make the output of the machine learning framework more useful.

All this support is now in the master branch of the DLL project if you want to check it out. You can also check out the example online: mnist_mlp.cpp

You can access the project on Github.

Comments

Inventor on four new research patents
   Posted:


During the first years of my thesis I worked on CTI research project with the American company Verisign, which has also an office near my school. A CTI research project is a project that is partially funded by the Commission on Innovation and Technology (CTI) where a school and a company work together. I was quite lucky to work on this project with the awesome people at Verisign Fribourg. After the success of the project, Verisign filled several patents regarding various points of the projects.

I'm quite happy now that these four patents are now approved and published. They They have been approved by both the United States Patent and Trademark Office (USPTO) and European Patent Office (EPO). The parents have been cl=¬ aimed by Verisign, I'm only one of the inventor, I got no claim on the patent. But it's still a great thing.

Here are the names of the four patents:

  • Systems and methods for automatic phonetization of domain names¬
  • Construction of phonetic representation of a string of characters¬
  • Method for writing a foreign language in a pseudo language phonetically resembling native language of the speaker¬
  • Construction of a phonetic representation of a generated string of characters¬

You can take a look at them on USPTO or EPO or on Google Patents, but the way a patent is written make it relatively hard to follow, it's more on a lawyer level or maybe I'm simply not used to patents anymore.

All these patents come from the research done during the CTI project with Verisign. In this project, name suggestions were generated from the phonetic sound of the name. The idea being to generate names that sounds the same as another input (airmix could become rmix or rmics). We are using various technologies to make this work: IG-Tree, Viterbi and HMM. And since we used a model with an encoder and a decoder, we can also mix languages. For instance, write something in French the way a English work would work (for instance school could become scoule).

These patents concludes a very interesting and successful project. I'm now working on yet another CTI research project with Verisign and it will surely be as successful as the first one.

Comments

Initial support for Recurrent Neural Network (RNN) in DLL
   Posted:


I'm happy to announce that I just merged support for Recurrent Neural Networks (RNNs) into my Deep Learning Library (DLL) machine learning framework.

It's nothing fancy yet, but forward propagation of RNN and basic Backpropagation Through Time (BPTT) are now supported. For now, only existing classification loss is supported for RNN. I plan to add support for sequence-to-sequence loss in order to be able to train models able to generate characters, but I don't know when I'll be able to work on that. I also plan to add support for other types of cells such as LSTM and GRU (maybe NAS) in the future.

For example, here is a simple RNN used on MNIST:

#include "dll/neural/dense_layer.hpp"
#include "dll/neural/recurrent_layer.hpp"
#include "dll/neural/recurrent_last_layer.hpp"
#include "dll/network.hpp"
#include "dll/datasets.hpp"

int main(int /*argc*/, char* /*argv*/ []) {
    // Load the dataset
    auto dataset = dll::make_mnist_dataset_nc(dll::batch_size<100>{}, dll::scale_pre<255>{});

    constexpr size_t time_steps      = 28;
    constexpr size_t sequence_length = 28;
    constexpr size_t hidden_units    = 100;

    // Build the network

    using network_t = dll::dyn_network_desc<
        dll::network_layers<
            dll::recurrent_layer<time_steps, sequence_length, hidden_units, dll::last_only>,
            dll::recurrent_last_layer<time_steps, hidden_units>,
            dll::dense_layer<hidden_units, 10, dll::softmax>
        >
        , dll::updater<dll::updater_type::ADAM>      // Adam
        , dll::batch_size<100>                       // The mini-batch size
    >::network_t;

    auto net = std::make_unique<network_t>();

    // Display the network and dataset
    net->display();

    // Train the network for performance sake
    net->fine_tune(dataset.train(), 50);

    // Test the network on test set
    net->evaluate(dataset.test());

    return 0;
}

The network starts with recurrent layer, followed by a layer that extracts only the last layer and finally a dense layer with a softmax function. The recurrent layer has support to change the activation function, change the initializer for the two weights matrices of the RNN and the number of steps for BPTT truncation.

Here is a possible result:

Network with 3 layers
    RNN(dyn): 28x28 -> TANH -> 28x100
    RNN(last): 28x100 -> 100
    Dense(dyn): 100 -> SOFTMAX -> 10
Total parameters: 13800
Train the network with "Stochastic Gradient Descent"
    Updater: ADAM
       Loss: CATEGORICAL_CROSS_ENTROPY
 Early Stop: Goal(error)

With parameters:
          epochs=50
      batch_size=100
   learning_rate=0.001
           beta1=0.9
           beta2=0.999

Epoch   0/50 - Classification error: 0.11635 Loss: 0.39999 Time 4717ms
Epoch   1/50 - Classification error: 0.11303 Loss: 0.36994 Time 4702ms
Epoch   2/50 - Classification error: 0.06732 Loss: 0.23469 Time 4702ms
Epoch   3/50 - Classification error: 0.04865 Loss: 0.17091 Time 4696ms
Epoch   4/50 - Classification error: 0.05957 Loss: 0.20437 Time 4706ms
Epoch   5/50 - Classification error: 0.05022 Loss: 0.16888 Time 4696ms
Epoch   6/50 - Classification error: 0.03912 Loss: 0.13743 Time 4698ms
Epoch   7/50 - Classification error: 0.04097 Loss: 0.14509 Time 4706ms
Epoch   8/50 - Classification error: 0.03938 Loss: 0.13397 Time 4694ms
Epoch   9/50 - Classification error: 0.03525 Loss: 0.12284 Time 4706ms
Epoch  10/50 - Classification error: 0.03927 Loss: 0.13770 Time 4694ms
Epoch  11/50 - Classification error: 0.03315 Loss: 0.11315 Time 4711ms
Epoch  12/50 - Classification error: 0.05037 Loss: 0.17123 Time 4711ms
Epoch  13/50 - Classification error: 0.02927 Loss: 0.10042 Time 4780ms
Epoch  14/50 - Classification error: 0.03322 Loss: 0.11027 Time 4746ms
Epoch  15/50 - Classification error: 0.03397 Loss: 0.11585 Time 4684ms
Epoch  16/50 - Classification error: 0.02938 Loss: 0.09984 Time 4708ms
Epoch  17/50 - Classification error: 0.03262 Loss: 0.11152 Time 4690ms
Epoch  18/50 - Classification error: 0.02872 Loss: 0.09753 Time 4672ms
Epoch  19/50 - Classification error: 0.02548 Loss: 0.08605 Time 4691ms
Epoch  20/50 - Classification error: 0.02245 Loss: 0.07797 Time 4693ms
Epoch  21/50 - Classification error: 0.02705 Loss: 0.08984 Time 4684ms
Epoch  22/50 - Classification error: 0.02422 Loss: 0.08164 Time 4688ms
Epoch  23/50 - Classification error: 0.02645 Loss: 0.08804 Time 4690ms
Epoch  24/50 - Classification error: 0.02927 Loss: 0.09739 Time 4715ms
Epoch  25/50 - Classification error: 0.02578 Loss: 0.08669 Time 4702ms
Epoch  26/50 - Classification error: 0.02785 Loss: 0.09368 Time 4700ms
Epoch  27/50 - Classification error: 0.02472 Loss: 0.08237 Time 4695ms
Epoch  28/50 - Classification error: 0.02125 Loss: 0.07324 Time 4690ms
Epoch  29/50 - Classification error: 0.01977 Loss: 0.06635 Time 4688ms
Epoch  30/50 - Classification error: 0.03635 Loss: 0.12140 Time 4689ms
Epoch  31/50 - Classification error: 0.02862 Loss: 0.09704 Time 4698ms
Epoch  32/50 - Classification error: 0.02463 Loss: 0.08158 Time 4686ms
Epoch  33/50 - Classification error: 0.02565 Loss: 0.08771 Time 4697ms
Epoch  34/50 - Classification error: 0.02278 Loss: 0.07634 Time 4718ms
Epoch  35/50 - Classification error: 0.02105 Loss: 0.07075 Time 4697ms
Epoch  36/50 - Classification error: 0.02770 Loss: 0.09358 Time 4711ms
Epoch  37/50 - Classification error: 0.02627 Loss: 0.08805 Time 4742ms
Epoch  38/50 - Classification error: 0.02282 Loss: 0.07712 Time 4708ms
Epoch  39/50 - Classification error: 0.02305 Loss: 0.07661 Time 4697ms
Epoch  40/50 - Classification error: 0.02243 Loss: 0.07773 Time 4700ms
Epoch  41/50 - Classification error: 0.02467 Loss: 0.08234 Time 4712ms
Epoch  42/50 - Classification error: 0.01808 Loss: 0.06186 Time 4691ms
Epoch  43/50 - Classification error: 0.02388 Loss: 0.07917 Time 4681ms
Epoch  44/50 - Classification error: 0.02162 Loss: 0.07508 Time 4699ms
Epoch  45/50 - Classification error: 0.01877 Loss: 0.06289 Time 4735ms
Epoch  46/50 - Classification error: 0.02263 Loss: 0.07969 Time 4764ms
Epoch  47/50 - Classification error: 0.02100 Loss: 0.07207 Time 4684ms
Epoch  48/50 - Classification error: 0.02425 Loss: 0.08076 Time 4752ms
Epoch  49/50 - Classification error: 0.02328 Loss: 0.07803 Time 4718ms
Restore the best (error) weights from epoch 42
Training took 235s
Evaluation Results
   error: 0.03000
    loss: 0.12260
evaluation took 245ms

Nothing fancy, but this example is not necessarily optimized.

All this support is now in the master branch of the DLL project if you want to check it out. You can also check out the example online: mnist_rnn.cpp

You can access the project on Github.

Comments

DLL New Features: Embeddings and Merge layers
   Posted:


I've just finished integrating new features into DLL, my deep learning library. I've added support for an embeddings layer, a group layer and a merge layer. This is not yet released, but available in the master branch.

Embeddings are used more and more these days to learn dense representation of characters or word. An embedding layer in a neural network transform labels into a vector. It's generally used as the first layer of the network. The embedding are learned as part of the network.

The merge layer allows to create branches in the network. The input is passed to each sub layer and then the output of each layer is concatenated to form the output of the merged layers. This can be very useful to use different convolutional filter sizes.

The group layer is a simple utility to group layers together. This is mostly to use with merge layers to form several branches.

I've put together a new example to use these features on text classification. The dataset is totally synthetic for now, but this can easily be reproduced with a normal text classification dataset. This kind of model is called a Character Convolutional Neural Network.

Here is the code for example:

constexpr size_t embedding = 16; // The length of the embedding vector
constexpr size_t length = 15;    // The word (or sequence) length

using embedding_network_t = dll::dyn_network_desc<
    dll::network_layers<
        // The embedding layer
        dll::embedding_layer<26, length, embedding>

        // The convolutional layers
        , dll::merge_layer<
            0
            , dll::group_layer<
                  dll::conv_layer<1, length, embedding, 16, 3, embedding>
                , dll::mp_2d_layer<16, length - 3 + 1, 1, length - 3 + 1, 1>
            >
            , dll::group_layer<
                  dll::conv_layer<1, length, embedding, 16, 4, embedding>
                , dll::mp_2d_layer<16, length - 4 + 1, 1, length - 4 + 1, 1>
            >
            , dll::group_layer<
                  dll::conv_layer<1, length, embedding, 16, 5, embedding>
                , dll::mp_2d_layer<16, length - 5 + 1, 1, length - 5 + 1, 1>
            >
        >

        // The final softmax layer
        , dll::dense_layer<48, 10, dll::softmax>
    >
    , dll::updater<dll::updater_type::NADAM>     // Nesterov Adam (NADAM)
    , dll::batch_size<50>                        // The mini-batch size
    , dll::shuffle                               // Shuffle before each epoch
>::network_t;

auto net = std::make_unique<embedding_network_t>();

// Display the network and dataset
net->display();

// Train the network for performance sake
net->fine_tune(samples, labels, 50);

// Test the network on train set
net->evaluate(samples, labels);

The network starts with an embedding layer. The embedding is then passed to three convolutional layers with different filter sizes, each followed by a pooling layer. The outputs of the three layers are merged at the end of the merge layer. Finally, a softmax layer is used for classification.

This kind of model can be very powerful and is used regularly. These new features make for a much larger variety of models that can be build with the DLL library.

The full code with the dataset generation can be found online: char_cnn.cpp

The next feature I want to focus on is recurrent neural networks. I'll probably try a single RNN layer first and then upgrade to multi-layers and LSTM and maybe GRU.

Comments

I successfully defended my Ph.D.
   Posted:


I'm happy to announce that I've successfully defended my thesis "Deep Learning Features for Image Processing". After four years, I've defended it officially in front of the thesis committed last Friday and then again two days ago I've successfully publicly defended in front of my friends, family and colleagues.

I'm now a "Doctor of Philosophy in Computer Science :)

I will update my thesis with the last comments in November and send the final version to the university. At which point, I'll publish it on this website as well.

Comments

Budgetwarrior: Track assets and portfolio, savings rates and auto-completion
   Posted:


This last month, I've been reading quite a few blogs about personal finance and I've decided to integrate more features into budgetwarrior. This post is about three new features that I've integrated. It's not yet a new release, so if you want to test this version, you'll have to compile it from the master branch on Git.

As it was last time, the values on my screenshots have all been randomized.

If you have several assets with different distributions, I believe it is a great value to have them all shown at the same time. Especially if you want to change the distribution of your portfolio or if you plan big changes in it.

Track assets

The first feature I've added is a feature to precisely track each of your assets independently. And you can also track the allocation of your portfolio in terms of stocks, bonds and cash. The tool also lets you set the desired distribution of your assets and will compute the difference that you should make in order to comply to your desired distribution.

First, you need to define all your asset classes (your accounts, funds, and stocks, ...) and their distribution with budget asset add. It also supports to set a currency. The default currency is now CHF, but you can set it in the configuration file, for instance default_currency=USD. You can see your assets using budget asset:

View of your assets

You can then set the value of your assets using budget asset value add. The system will save all the values of your assets. For now, only the last value is used in the application to display. In the future, I plan to add new reports for evolution of the portfolio over time. You can see your current net worth with the budget asset value:

View of your portfolio

The different currencies will all be converted to the default currency.

Savings rate

The second change I did is to compute the savings rate of each month and year. The savings rate is simply the portion of your income that you are able to save each month. The savings rate for a year is simple the average of the savings rate of each month.

The savings rate of the month can be seen with budget overview month:

Savings rate of the month

The saving rates of each month can also be seen in the overview of the year with budget overview year:

Savings rate of the year

This shows the savings rate of each month, the average of the year and the average of the current year up to the current month.

The savings rate is a very important metric of your budget. In my case, it's currently way too low and made me realize I really need to save more. Any savings rate below 10% is too low. There are no rule as too much it should be, but I'd like to augment mine to at least 20% next year.

Auto-completion

The last feature is mostly some quality-of-life improvement. Some of the inputs in the console can now be completed. It's not really auto-completion per se, but you can cycle through the list of possible values using the UP and DOWN.

This makes it much easier to set some values such as asset names (in budget asset value add for instance), account names and objective types and sources. I'm trying to make the input of values easier.

Conclusion

I don't know exactly what else will be integrated in this feature, but I may already improve some visualization for asset values. If I learn something new about personal finance that I may integrate in the tool, I'll do it as well.

If you are interested by the sources or want to install this version, you can download them on Github: budgetwarrior.

The new features are in the master branch.

If you have a suggestion for a new features or you found a bug, please post an issue on Github, I'd be glad to help you.

If you have any comment, don't hesitate to contact me, either by letting a comment on this post or by email.

Comments

Deep Learning Library 1.0 - Fast Neural Network Library
   Posted:


DLL Logo

I'm very happy to announce the release of the first version of Deep Learning Library (DLL) 1.0. DLL is a neural network library with a focus on speed and ease of use.

I started working on this library about 4 years ago for my Ph.D. thesis. I needed a good library to train and use Restricted Boltzmann Machines (RBMs) and at this time there was no good support for it. Therefore, I decided to write my own. It now has very complete support for the RBM and the Convolutional RBM (CRBM) models. Stacks of RBMs (or Deep Belief Networks (DBNs)) can be pretrained using Contrastive Divergence and then either fine-tuned with mini-batch gradient descent or Conjugate Gradient or used as a feature extractor. Over the years, the library has been extended to handle Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). The network is also able to train regular auto-encoders. Several advanced layers such as Dropout or Batch Normalization are also available as well as adaptive learning rates techniques such as Adadelta and Adam. The library also has integrated support for a few datasets: MNIST, CIFAR-10 and ImageNet.

This library can be used using a C++ interface. The library is fully header-only. It requires a C++14 compiler, which means a minimum of clang 3.9 or GCC 6.3.

In this post, I'm going to present a few examples on using the library and give some information about the performance of the library and the roadmap for the project.

Read more…

Comments

Expression Templates Library (ETL) 1.2 - Complete GPU support
   Posted:


ETL Logo

I'm happy to announce the version 1.2 of my Expression Templates Library (ETL): ETL 1.2, two months after I released the version 1.1. This version features much better GPU Support, a few new features and a lot of changes in the internal code.

GPU Support

Before, only algorithms such as 4D convolution or matrix-matrix multiplication were computed in the GPU and lots of operations were causing copies between CPU and GPU version. Now, the support for basic operations has also been completed and therefore, expressions like this:

C = sigmoid(2.0 * (A + B)) / sum(A)

Can be computed entirely on GPU.

Each matrix and vector containers have a secondary GPU memory space. During the execution, the status of both memory spaces is being managed and when necessary, copies are made between two spaces. In the best case, there should only be initial copies to the GPU and then everything should be done on the GPU. I've also considered using Unified Memory in place of this system, but this is a problem for fast matrix and I'd rather not have two different systems.

If you have an expression such as c = a + b * 2, it can be entirely computed on GPU, however, it will be computed in two GPU operations such as:

t1 = b * 2
c = a + t1

This is not perfect in terms of performance but this will be done without any copies between CPU and GPU memory. I plan to improve this system with a bit more complex operations to avoid too many GPU operations, but there will always be more operations than in CPU where this can easily be done in one go.

There are a few expressions that are not computable on the GPU, such as random generations. A few transformations are also not fully compatible with GPU. Moreover, if you access an element with operators [] or (), this will invalidate the GPU memory and force an update to the CPU memory.

GPU operations are not implemented directly in ETL, there are coming from various libraries. ETL is using NVIDIA CUDNN, CUFFT and CUDNN for most algorithms. Moreover, for other operations, I've implemented a libraries with simple GPU operations: ETL-GPU-BLAS (EGBLAS). You can have a look at egblas if you are interested.

My Deep Learning Library (DLL) project is based on ETL and its performances are mostly dependent on ETL's performances. Now that ETL fully supports GPU, the GPU performance of DLL is much improved. You may remember a few weeks ago I posted very high CPU performance of DLL. Now, I've run again the tests to see the GPU performance with DLL. Here is the performance for training a small CNN on the MNIST data set:

Performances for training a Convolutional Neural Network on MNIST

As you can see, the performances on GPU are now excellent. DLL's performances are on par with Tensorflow and Keras!

The next results are for training a much larger CNN on ImageNet, with the time necessary to train a single batch:

Performances for training a Convolutional Neural Network on Imagenet

Again, using the new version of ETL inside DLL has led to excellent performance. The framework is again on par with TensorFlow and Keras and faster than all the other frameworks. The large difference between DLL and Tensorflow and Keras is due to the inefficiency of reading the dataset in the two frameworks, so the performance of the three framework themselves are about the same.

Other Changes

The library also has a few other new features. Logarithms of base 2 and base 10 are now supported in complement to the base e that was already available before. Categorical Cross Entropy (CCE) computation is also available now, the CCE loss and error can be computed for one or many samples. Convolutions have also been improved in that you can use mixed types in both the image and the kernel and different storage order as well. Nevertheless, the most optimized version remains the version with the same storage order and the same data type.

I've also made a major change in the way implementations are selected for each operation. The tests and the benchmark are using a system to force the selection of an algorithm. This system is now disabled by default. This makes the compilation much faster by default. Since it's not necessary in most cases, this will help regular use cases of the library by compiling much faster.

Overall, the support for complex numbers has been improved in ETL. There are more routines that are supported and etl::complex is better supported throughout the code. I'll still work on this in the future to make it totally complete.

The internal code also has a few new changes. First, all traits have been rewritten to use variable templates instead of struct traits. This makes the code much nicer in my opinion. Moreover, I've started experimenting with C++17 if constexpr. Most of the if conditions that can be transformed to if constexpr have been annotated with comments that I can quickly enable or disable so that I can test the impact of C++17, especially on compilation time.

Finally, a few bugs have been fixed. ETL is now working better with parallel BLAS library. There should not be issues with double parallelization in ETL and BLAS. There was a slight bug in the Column-Major matrix-matrix multiplication kernel. Binary operations with different types in the left and right hand sides was also problematic with vectorization. The last bug was about GPU status in case ETL containers were moved.

What's next ?

I don't yet know exactly on which features I'm going to focus for the next version of ETL. I plan to focus a bit more in the near future on Deep Learning Library (DLL) for which I should release the version 1.0 soon. I also plan to start support for Recurrent Neural Networks on it, so that will take me quite some time.

Nevertheless, I'm still planning to consider the switch to C++17, since it is a bit faster to compile ETL with if constexpr. The next version of ETL will also probably have GPU-support for integers, at least in the cases that depend on the etl-gpu-blas library, which is the standard operators. I also plan to improve the support for complex numbers, especially in terms of performance and tests. Hopefully, I will have also time (and motivation) to start working on the sparse capabilities of ETL. It really needs much more unit tests and the performance should be improved as well.

Download ETL

You can download ETL on Github. If you only interested in the 1.2 version, you can look at the Releases pages or clone the tag 1.2. There are several branches:

  • master Is the eternal development branch, may not always be stable
  • stable Is a branch always pointing to the last tag, no development here

For the future release, there always will tags pointing to the corresponding commits. You can also have access to previous releases on Github or via the release tags.

The documentation is still a bit sparse. There are a few examples and the Wiki, but there still is work to be done. If you have questions on how to use or configure the library, please don't hesitate.

Don't hesitate to comment this post if you have any comment on this library or any question. You can also open an Issue on Github if you have a problem using this library or propose a Pull Request if you have any contribution you'd like to make to the library.

Hope this may be useful to some of you :)

Comments