Skip to main content

budgetwarrior 1.0.1: Allocation tracking, Retirement calculator and bug fixes

I'm happy to announce the release of budgetwarrior 1.0.1. This new version contains a series of improvement over the 1.0 version and some new features.

I haven't been very active this last month. I have been working a bit on budgetwarrior for features I needed for my budget. I've also been contacted with questions on my thor operating system and since that point I've doing some work on thor as well.

This new version of budgetwarrior has quite a few new features even though it's a minor version.

Note: The data from all the views is totally randomized and does not make sense ;)

Retirement Calculator

The biggest novelty in this version is the addition of a retirement calculator. This is still very basic, but it may give information on how close (or far) you are from early retirement. Here is what the view gives you:

Retirement Status

Using your annual withdrawal rate and expected Rate Of Return, it can compute how many years you will need to reach your goals Financial Independence (FI). It will also gives you your FI ratio and a few more information about your savings rate, income, expenses and so on. It's nothing very fancy but it can be very useful.

New features

I've also added a few graphs based on the budget information. The first is the visualization of the expenses over time:

Expenses over time graph

This can be pretty useful to see how are your expenses going. Even if your income is going, expenses should not necessarily go up (you should save more!).

Another new view can show your asset allocation over time and the current asset allocation of your entire net worth or specifically for your portfolio.

Asset allocation

This is also really useful if you want to have a global view of your asset allocation into bonds, stocks and such.

There are also two other new minor features. You can now search expenses by name. This is really useful once you start having many expenses. Another new view is the Full aggregate view. Before, you could aggregate your expenses by month or year, now they can be aggregate since the beginning of the budget. With this, you can see how much you spend on coffee since you started keeping track of your budget. For me, it's a lot! Both these features are available both in command line and in the web interface.


There are also a few improvements with this new version. You can now set a default account (in the configuration file with default_account=X). It will be set by default in both the web view and the console view. The rebalance view has been made more clear. I've added a second batch update view with only the assets that are being used (amount > 0). And lastly, the yearly overview is now displaying correctly the previous year savings rate.

Finally, there are also a few bug fixes. That is is the main reason I decided to release now. If you were using asset with different currency, several views where not correctly using the exchange rate to display them. Moreover, the average expenses in the monthly overview was not correct. Finally, if you were editing old expenses after having archived the accounts, it could be edited with the wrong account.


If you are on Gentoo, you can install it using layman:

layman -a wichtounet
emerge -a budgetwarrior

If you are on Arch Linux, you can use this AUR repository <> (wait a few days for the new version to be updated)_

For other systems, you'll have to install from sources:

git clone --recursive git://
cd budgetwarrior
git checkout 1.0.1
sudo make install

If you want to test the server mode, the default username is admin and the default password is 1234. You can change them in the configuration file with web_user and web_password.


Although it's a minor version, it improves and fixes quite a few things, especially for the web view. I encourage you to try it out. Don't hesitate to let me a comment if you fail to use it or don't understand something ;)

There are still a few things that I want to do, as I said when I introduced the web version. The website still needs to be made faster. And the communication between the console and the server can also be improved.

If you are interested by the sources, you can download them on Github: budgetwarrior.

If you have a suggestion or you found a bug, please post an issue on Github.

If you have any comment, don't hesitate to contact me, either by letting a comment on this post or by email.


I got rid of Vivaldi browser for Google Chrome

About a year ago, I switched from Firefox to Vivaldi. This week, I decided to get rid of Vivaldi and replaced with Google Chrome. In this post, I'm going to outline the reasons why I got rid of it.

Before, I switched to Vivaldi because Firefox was dropping support for XUL/XPCOM extensions and I was using Pentadactyl. In fact, Pentadactyl was the only reason I was using Firefox. It was slow and bloated and a bit unstable, but the extension was making it worth. Since they are dropping support for such extensions, I did not want to use Firefox anymore. So I switched to Vivaldi with Vimium. It's not as great as Firefox plus Pentadactyl. But it's a more customizable version of Google Chrome on which it's based.

But, in that year or so of using Vivaldi, I have had many issues. Some of them were not too bad and there was some workarounds. But they continued to pile up and they did not fix any of them so now, I decided it's too much.

Since the beginning, it always has been slow. It's not really bad, but still noticeable compared to Chrome. Especially opening Vivaldi is pretty bad. This is something I can live with, but they should really do something to make it faster.

The thing that I had the most issues with is multimedia. For instance Youtube (but all the other platforms have the same issues).

The first problem with media is to get a video in fullscreen. Most of the time, when I press the fullscreen button on Youtube, it grays out the screen and I have to press ESC. If I do that around five to ten times, it finally goes fullscreen. It may be because of my multi-monitor setup but Google Chrome has no issues whatsoever with that. It's pretty painful to do, but again I could live for since I don't use full screen a lot.

A second problem I had with media is they were running too fast. I'm not kidding, really too fast, not too slow. The media was running about twice too fast, you could see the seconds going fast on Youtube. I never seen this issue in any other tool, but it was happening at every start of Vivaldi. The fix was to restart Vivaldi every time this happened and the video played normally.

Another problem I had from the beginning is to make all HTML5 videos work. You have to download the binary plugins from Chrome to let Vivaldi play all HTML5 videos. It's not a big deal, but the problem is that they are overwritten after each update of Vivaldi. So you have to do it all the time.

A new media issue I had on the last update of Vivaldi is with Flash. At the beginning it was working even if it was outdated. I just had to confirm to run it with a warning. But, since the last update, I only had the warning that it was outdated. But I could not confirm to use it, the option was not here anymore. And it was still happening after I updated Flash... The only option to run Flash was to use a private navigation window...

And finally, I had another big issue with the last version of Vivaldi as well. The browser keeps crashing on my work computer. It can stay up a few minutes and then crash. The complete interface is not updated. I can still press the tabs and I can see the title of the window change, but the interface does not update. Again, it may come from my special window manager (I use awesome), but it's the only application not working...

With all these issues and especially the last two new problems, I decided it was time to cut the losses. So I reinstalled Google Chrome, transferred my plugins and everything worked like a charm. I still use Vimium to use vim bindings so my usage of the browser does not change. Of course, I don't have the customization that I had with Vivaldi. I would really really like to get rid of the address bar in the browser. I would also like to significantly reduce the size of the tab bar. But I prefer to live without these improvements than with so many bugs. I think Vivaldi is a good idea, but with a terrible implementation.

I also considered qutebrowser as an alternative. But for now it's still missing many features that I don't want to get rid of. So I will stay with Google Chrome for the time being.

What about you ? Do you have any experience with Vivaldi ?


Decrease DLL neural network compilation time with C++17

Just last week, I've migrated my Expression Templates Library (ETL) library to C++17, it is now also done in my Deep Learning Library (DLL) library. In ETL, this resulted in a much nicer code overall, but no real improvement in compilation time.

The objective of the migration of DLL was two-fold. First, I also wanted to simplify some code, especially with if constexpr. But I also especially wanted to try to reduce the compilation time. In the past, I've already tried a few changes with C++17, with good results on the compilation of the entire test suite. While this is very good, this is not very representative of users of the library. Indeed, normally you'll have only one network in your source file not several. The new changes will especially help in the case of many networks, but less in the case of a single network per source file.

This time, I decided to test the compilation on the examples. I've tested the eight official examples from the DLL library:

  1. mnist_dbn: A fully-connected Deep Belief Network (DBN) on the MNIST data set with three layers
  2. char_cnn: A special CNN with embeddings and merge and group layers for text recognition
  3. imagenet_cnn: A 12 layers Convolutional Neural Network (CNN) for Imagenet
  4. mnist_ae: A simple two-layers auto-encoder for MNIST
  5. mnist_cnn: A simple 6 layers CNN for MNIST
  6. mnist_deep_ae: A deep auto-encoder for MNIST, only fully-connected
  7. mnist_lstm: A Recurrent Neural Network (RNN) with Long Short Term Memory (LSTM) cells
  8. mnist_mlp: A simple fully-connected network for MNIST, with dropout
  9. mnist_rnn: A simple RNN with simple cells for MNIST

This is really representative of what users can do with the library and I think it's a much better for compilation time.

For reference, you can find the source code of all the examples online.


Let's start with the results. I've tested this at different stages of the migration with clang 5 and GCC 7.2. I tested the following steps:

  1. The original C++14 version
  2. Simply compiling in c++17 mode (-std=c++17)
  3. Using the C++17 version of the ETL library
  4. Upgrading DLL to C++17 (without ETL)
  5. ETL and DLL in C++17 versions

I've compiled each example independently in release_debug mode. Here are the results for G++ 7.2:

Example 0 1 2 3 4 5 6 7 8
C++14 37.818 32.944 33.511 15.403 29.998 16.911 24.745 18.974 19.006
-std=c++17 38.358 32.409 32.707 15.810 30.042 16.896 24.635 19.134 19.027
ETL C++17 36.045 31.000 30.942 15.322 28.840 16.747 24.151 18.208 18.939
DLL C++17 35.251 32.577 32.854 15.653 29.758 16.851 24.606 19.098 19.146
Final C++17 32.289 31.133 30.939 15.232 28.753 16.526 24.326 18.116 17.819
Final Improvement 14.62% 5.49% 7.67% 1.11% 4.15% 2.27% 1.69% 4.52% 6.24%

The difference by just enabling c++17 is not significant. On the other hand, some significant gain can be obtained by using the C++17 version of ETL, especially for the DBN version and for the CNN versions. Except for the DBN case, the migration of DLL to C++17 did not bring any significant advantage. When everything is combined, the gains are more important :) In the best case, the example is 14.6% faster to compile.

Let's see if it's the same with clang++ 5.0:

Example 0 1 2 3 4 5 6 7 8
C++14 40.690 34.753 35.488 16.146 31.926 17.708 29.806 19.207 20.858
-std=c++17 40.502 34.664 34.990 16.027 31.510 17.630 29.465 19.161 20.860
ETL C++17 37.386 33.008 33.896 15.519 30.269 16.995 28.897 18.383 19.809
DLL C++17 37.252 34.592 35.250 16.131 31.782 17.606 29.595 19.126 20.782
Final C++17 34.470 33.154 33.881 15.415 30.279 17.078 28.808 18.497 19.761
Final Improvement 15.28% 4.60% 4.52% 4.52% 5.15% 3.55% 3.34% 3.69% 5.25%

First of all, as I have seen time after time, clang is still slower than GCC. It's a not a big difference, but still significant. Overall, the gains are a bit higher on clang than on GCC, but not by much. Interestingly, the migration of DLL to C++17 is less interesting in terms of compilation time for clang. It seems even to slow down compilation on some examples. On the other hand, the migration of ETL is more important than on GCC.

Overall, every example is faster to compile using both libraries in C++17, but we don't have spectacular speed-ups. With clang, we have speedups from 3.3% to 15.3%. With GCC, we have speedup from 1.1% to 14.6%. It's not very high, but I'm already satisfied with these results.

C++17 in DLL

Overall, the migration of DLL to C++17 was quite similar to that of ETL. You can take a look at my previous article if you want more details on C++17 features I've used.

I've replaced a lot of SFINAE functions with if constexpr. I've also replaced a lot of statif_if with if constexpr. There was a large number of these in DLL's code. I also enabled all the constexpr that were commented for this exact time :)

I was also thinking that I could replace a lot of meta-programming stuff with fold expressions. While I was able to replace a few of them, most of them were harder to replace with fold expressions. Indeed, the variadic pack is often hidden behind another class and therefore the pack is not directly usable from the network class or the group and merge layers classes. I didn't want to start a big refactoring just to use a C++17 feature, the current state of this code is fine.

I made some use of structured bindings as well, but again not as much as I was thinking. In fact, a lot of time, I'm assigning the elements of a pair or tuple to existing variables not declaring new variables and unfortunately, you can only use structured bindings with auto declaration.

Overall, the code is significantly better now, but there was less impact than there was on ETL. It's also a smaller code base, so maybe this is normal and my expectations were too high ;)


The trunk of DLL is now a C++17 library :) I think this improve the quality of the code by a nice margin! Even though, there is still some work to be done to improve the code, especially for the DBN pretraining code, the quality is quite good now. Moreover, the switch to C++17 made the compilation of neural networks using the DLL library faster to compile, from 1.1% in the worst case to 15.3% in the best case! I don't know when I will release the next version of DLL, but it will take some time. I'll especially have to polish the RNN support and add a sequence to sequence loss before I will release the 1.1 version of DLL.

I'm quite satisfied with C++17 even if I would have liked a bit more features to play with! I'm already a big fan of if constexpr, this can make the code much nicer and fold expressions are much more intuitive than their previous recursive template counterpart.

I may also consider migrating some parts of the cpp-utils library, but if I do, it will only be through the use of conditionals in order not to break the other projects that are based on the library.


C++17 Migration of Expression Templates Library (ETL)

I've finally decided to migrate my Expression Templates Library (ETL) project to C++17. I've talking about doing that for a long time and I've released several releases without doing the change, but the next version will be a C++17 library. The reason why I didn't want to rush the change was that this means the library needs a very recent compiler that may not be available to everybody. Indeed, after this change, the ETL library now needs at least GCC 7.1 or Clang 4.0.

I've already made some previous experiments in the past. For instance, by using if constexpr, I've managed to speed up compilation by 38% and I've also written an article about the fold expressions introduced in C++17. But I haven't migrated a full library yet. This is now done with ETL. In this article, I'll try to give some example of improvements by using C++17.

This will only cover the C++17 features I'm using in the updated ETL library, I won't cover all of the new C++17 features.

if constexpr

The most exciting new thing in C++17 for me is the if constexpr statement. This is a really really great thing. In essence, it's a normal if statement, but with one very important difference. The statement that is not taken (the else if the condition is true, or the if constexpr if the condition is false) is discarded. And what is interesting is what happens to discarded statements:

  1. The body of a discarded statement does not participate in return type deduction.
  2. The discarded statement is not instantiated
  3. The discarded statement can odr-use a variable that is not defined

Personally, I'm especially interested by points 1 and 2. Let's start with an example where point 1 is useful. In ETL, I have a make_temporary function. This function either forwards an ETL container or creates a temporary container from an ETL expression. This is based on a compile-time traits. The return type of the function is the not the same in both cases. What you did in those case before C++17, is use SFINAE and make two functions:

template <typename E, cpp_enable_iff(is_dma<E>)>
decltype(auto) make_temporary(E&& expr) {
    return std::forward<E>(expr);

template <typename E, cpp_enable_iff(!is_dma<E>)>
decltype(auto) make_temporary(E&& expr) {
    return force_temporary(std::forward<E>(expr));

One version of the function will forward and the other version will force a temporary and the return type can be different since these are two different functions. This is not bad, but still requires two functions where you only want to write one. However, in C++17, we can do much better using if constexpr:

template <typename E>
decltype(auto) make_temporary(E&& expr) {
    if constexpr (is_dma<E>) {
        return std::forward<E>(expr);
    } else {
        return force_temporary(std::forward<E>(expr));

I think this version is really superior to the previous one. We only have one function and the logic is much clearer!

Let's now see an advantage of the point 2. In ETL, there are two kinds of matrices, matrices with compile-time dimensions (fast matrices) and matrices with runtime dimensions (dynamic matrices). When they are used, for instance for a matrix-multiplication, I use static assertions for fast matrices and runtime assertions for dynamic matrices. Here is an example for the validation of the matrix-matrix multiplication:

template <typename C, cpp_disable_iff(all_fast<A, B, C>)>
static void check(const A& a, const B& b, const C& c) {
    static_assert(all_2d<A,B,C>, "Matrix multiplication needs matrices");
        dim<1>(a) == dim<0>(b)         //interior dimensions
            && dim<0>(a) == dim<0>(c)  //exterior dimension 1
            && dim<1>(b) == dim<1>(c), //exterior dimension 2
        "Invalid sizes for multiplication");

template <typename C, cpp_enable_iff(all_fast<A, B, C>)>
static void check(const A& a, const B& b, const C& c) {
    static_assert(all_2d<A,B,C>, "Matrix multiplication needs matrices");
        dim<1, A>() == dim<0, B>()         //interior dimensions
            && dim<0, A>() == dim<0, C>()  //exterior dimension 1
            && dim<1, B>() == dim<1, C>(), //exterior dimension 2
        "Invalid sizes for multiplication");

Again, we use SFINAE to distinguish the two different cases. In that case, we cannot use a normal if since the value of the dimensions cannot be taken at compile-time for dynamic matrices, more precisely, some templates cannot be instantiated for dynamic matrices. As for the cpp_unused, we have to use for the static version because we don't use them and for the dynamic version because they won't be used if the assertions are not enabled. Let's use if constexpr to avoid having two functions:

template <typename C>
static void check(const A& a, const B& b, const C& c) {
    static_assert(all_2d<A,B,C>, "Matrix multiplication needs matrices");

    if constexpr (all_fast<A, B, C>) {
        static_assert(dim<1, A>() == dim<0, B>()         //interior dimensions
                          && dim<0, A>() == dim<0, C>()  //exterior dimension 1
                          && dim<1, B>() == dim<1, C>(), //exterior dimension 2
                      "Invalid sizes for multiplication");
    } else {
        cpp_assert(dim<1>(a) == dim<0>(b)         //interior dimensions
                       && dim<0>(a) == dim<0>(c)  //exterior dimension 1
                       && dim<1>(b) == dim<1>(c), //exterior dimension 2
                   "Invalid sizes for multiplication");


Since the discarded won't be instantiated, we can now use a single function! We also avoid some duplications of the first static assertion of the unused statements. Pretty great, right ? But we can do better with C++17. Indeed, it added a nice new attribute [[maybe_unused]]. Let's see what this gives us:

template <typename C>
static void check([[maybe_unused]] const A& a, [[maybe_unused]] const B& b, [[maybe_unused]] const C& c) {
    static_assert(all_2d<A,B,C>, "Matrix multiplication needs matrices");

    if constexpr (all_fast<A, B, C>) {
        static_assert(dim<1, A>() == dim<0, B>()         //interior dimensions
                          && dim<0, A>() == dim<0, C>()  //exterior dimension 1
                          && dim<1, B>() == dim<1, C>(), //exterior dimension 2
                      "Invalid sizes for multiplication");
    } else {
        cpp_assert(dim<1>(a) == dim<0>(b)         //interior dimensions
                       && dim<0>(a) == dim<0>(c)  //exterior dimension 1
                       && dim<1>(b) == dim<1>(c), //exterior dimension 2
                   "Invalid sizes for multiplication");

No more need for cpp_unused trick :) This attribute tells the compiler that a variable or parameter can be sometimes unused and therefore does not lead to a warning for it. Only one thing that is not great with this attribute is that it's too long, 16 characters. It almost double the width of my check function signature. Imagine if you have more parameters, you'll soon have to use several lines. I wish there was a way to set an attribute for all parameters together or a shortcut. I'm considering whether to use a short macro to use in place of it, but haven't yet decided.

Just a note, if you have else if statements, you need to set them as constexpr as well! This was a bit weird for me, but you can figure it as if the condition is constexpr, then the if (or else if) is constexpr as well.

Overall, I'm really satisfied with the new if constexpr! This really makes the code much nicer in many cases, especially if you abuse metaprogramming like I do.

You may remember that I've coded a version of static if in the past with C++14 in the past. This was able to solve point 2, but not point 1 and was much uglier. Now we have a good solution to it. I've replaced two of these in the current code with the new if constexpr.

Fold expressions

For me, fold expressions is the second major feature of C++17. I wont' go into too much details here, since I've already talked about fold expression in the past . But I'll show two examples of refactorings I've been able to do with this.

Here was the size() function of a static matrix in ETL before:

static constexpr size_t size() {
   return mul_all<Dims...>;

The Dims parameter pack from the declaration of fast_matrix:

template <typename T, typename ST, order SO, size_t... Dims>
struct fast_matrix_impl;

And the mul_all is a simple helper that multiplies each value of the variadic parameter pack:

template <size_t F, size_t... Dims>
struct mul_all_impl final : std::integral_constant<size_t, F * mul_all_impl<Dims...>::value> {};

template <size_t F>
struct mul_all_impl<F> final : std::integral_constant<size_t, F> {};

template <size_t F, size_t... Dims>
constexpr size_t mul_all = mul_all_impl<F, Dims...>::value;

Before C++17, the only way to compute this result at compilation time was to use template recursion, either with types or with constexpr functions. I think this is pretty heavy only for doing a multiplication sum. Now, with fold expressions, we can manipulate the parameter pack directly and rewrite our size function:

static constexpr size_t size() {
    return (Dims * ...);

This is much better! This clearly states that each value of the parameter should be multiplied together. For instance 1,2,3 will become (1*2)*3.

Another place where I was using this was to code a traits that tests if a set of boolean are all true at compilation-time:

template <bool... B>
constexpr bool and_v = std::is_same<
    cpp::tmp_detail::bool_list<true, B...>,
    cpp::tmp_detail::bool_list<B..., true>>::value;

I was using a nice trick here to test if all booleans are true. I don't remember where I picked it up, but it's quite nice and very fast to compile.

This was used for instance to test that a set of expressions are all single-precision floating points:

template <typename... E>
constexpr bool all_single_precision = and_v<(is_single_precision<E>)...>;

Now, we can get rid of the and_v traits and use directly the parameter pack directly:

template <typename... E>
constexpr bool all_single_precision = (is_single_precision<E> && ...);

I think using fold expressions results in much clearer syntax and better code and it's a pretty nice feature overall :)

As a note here, I'd like to mention, that you can also use this syntax to call a function on each argument that you have, which makes for much nicer syntax as well and I'll be using that in DLL once I migrate it to C++17.


There are also a few more C++17 features that I've used to improve ETL, but that have a bit less impact.

A very nice feature of C++17 is the support for structured bindings. Often you end up with a function that returns several parts of information in the form of a pair or a tuple or even a fixed-size array. You can use an object for this, but if you don't, you end up with code that is not terribly nice:

size_t index;
bool result;
float alpha;
std::tie(index, result, alpha) = my_function();

It's not terribly bad, but in these cases, you should be be hoping for something better. With c++17, you can do better:

auto [index, result, alpha] = my_function();

Now you can directly use auto to deduce the types of the three variables at once and you can get all the results in the variables at once as well :) I think this is really nice and can really profit some projects. In ETL, I've almost no use for this, but I'm going to be using that a bit more in DLL.

Something really nice to clean up the code in C++17 is the ability to declared nested namespaces in one line. Before, you have a nested namespace etl::impl::standard for instance, you would do:

namespace etl {
namespace impl {
namespace standard {

// Someting inside etl::impl::standard

} // end of namespace standard
} // end of namespace impl
} // end of namespace etl

In C++17, you can do:

namespace etl::impl::standard {

// Someting inside etl::impl::standard

} // end of namespace etl::impl::standard

I think it's pretty neat :)

Another very small change is the ability to use the typename keyword in place of the class keyword when declaring template template parameters. Before, you had to declare:

template <template <typename> class X>

now you can also use:

template <template <typename> typename X>

It's just some syntactic sugar, but I think it's quite nice.

The last improvement that I want to talk about is one that probably very few know about but it's pretty neat. Since C++11, you can use the alignas(X) specifier for types and objects to specify on how many bytes you want to align these. This is pretty nice if you want to align on the stack. However, this won't always work for dynamic memory allocation. Imagine this struct:

struct alignas(128)  test_struct  { char data; };

If you declare an object of this type on the stack, you have the guarantee that it will be aligned on 128 bytes. However, if you use new to allocate it on the heap, you don't have such guarantee. Indeed, the problem is that 128 is greater than the maximum default alignment. This is called an over-aligned type. In such cases, the result will be aligned on the max alignment of your system. Since C++17, new supports aligned dynamic memory allocation of over-aligned types. Therefore, you can use a simple alignas to allocate dynamic over-aligned types :) I need this in ETL for matrices that need to be aligned for vectorized code. Before, I was using a larger array with some padding in order to find an aligned element inside, but that is not very nice, now the code is much better.

Compilation Time

I've done a few tests to see how much impact these news features have on compilation time. Here, I'm doing benchmark on compiling the entire test suite in different compilation mode, I enabled most compilation options (all GPU and BLAS options in order to make sure almost all of the library is compiled).

Since I'm a bit short on time before going to vacation, I've only gathered the results with g++. Here are the results with G++ 7.2.0

  debug release release_debug
C++14 862s 1961s 1718s
C++17 892s 2018s 1745s
Difference +3.4% +2.9% +1.5%

Overall, I'm a bit disappointed by these results, it's around 3% slower to compile the C++17 version than the C++14 version. I was thinking that this would a least be as fast to compile as before. It seems that currently with G++ 7.2, if constexpr are slower to compile than the equivalent SFINAE functions. I didn't do individual benchmarks of all the features I've migrated, therefore, it may not be coming from if constexpr, but since it's the greatest change by far, it's the more likely candidate. Once I'll have a little more time, after my vacations, I'll try to see if that is also the case with clang.

Keep in mind that we are compiling the test suite here. The ETL test suite is using the manual selection mode of the library in order to be able to test all the possible implementations of each operation. This makes a considerable difference in performance. I expect better compilation time when this is used in automatic selection mode (the default mode). In the default mode, a lot more code can be disabled with if constexpr. I will test this next with the DLL library which I will also migrate to C++17.


This concludes this report on the migration of my ETL library from C++14 to C++17. Overall, I'm really satisfied with the improvement of the code, it's much better. I'm a bit disappointed by the slight increase (around 3%) in compilation time, but it's not dramatic either. I'm still hoping that once it's used in DLL, I will see a decrease in compilation, but we'll see that when I'll be done with the migration of DLL to C++17 which may take some time since I'll have two weeks vacation in China starting Friday.

The new version is available only through the master branch. It will be released as the 1.3 version probably when I integrate some new features, but in itself will not be released as new version. You can take a look in the Github etl repository if you are interested.


budgetwarrior 1.0: Web interface and asset tracking!

I'm happy to announce the release of budgetwarrior 1.0. This is a major change over the previous version.

Web Interface

Until now, budgetwarrior could only be used in command line. This is fine for me, but not for every body. Since I wanted to share my budget with my girlfriend, I needed something less nerdy ;)

Therefore, I added support for a web interface for budgetwarrior. Every feature of the console application is now available in the web version. Moreover, since the web version offers slightly better graphical capabilities, I added a few more graphs and somewhat more information at some places. I'm not nearly an expert in web interface, but I think I managed to get something not too bad together. There are still some things to improve that I'll go through in the future but so far the web interface is pretty satisfying and it is mobile friendly!

The web server is coded in C++ (who would have guessed...) and is embedded in the application, you need to use the command server to use it:

budget server

and the server will be launched (by default at localhost:8080). You can configure the port with server_port=X in the configuration file and the listen address with server_listen=X. You can access your server at http://localhost:8080.

Here is what this will display:

Web interface index

Note: All the data is randomized

The main page shows your assets, the current net worth, your monthly cash-flow and the state of your objectives.

The menu will give you access to all the features of the application. You can add expenses and earnings, see reports, manage your assets and your objectives and so on. Basically, you can do everything you did in the application, but you have access to more visualization tools than you would on the console. For instance, you can access your fortune over time:

Web interface fortune graph

or see how your portfolio does in terms of currency:

Web interface portofolio currency breakdown

Normally, unless I forgot something (in which case, I'll fix it), everything should be doable from the web interface. This is simply easier people that are not as nerdy as me for console ;)

The management is still the same, the server will write to the same file the base application uses. Therefore, you cannot use the server and the command line application on the same machine at the same time. Nevertheless, if the server is not running, you can still use the command line application. This could be useful if you want to use the web visualization while still using the command line tool for managing the budget.

The default user and password is admin:1234, but you of course change it using web_password and web_user in the configuration. You can also disable the security if you are sure of yourself by setting server_secure=true in the configuration. The server currently does not support

Currently, it does not protect against concurrent modifications of the same data. It is very unlikely to happen with only a few people using the applications, but I plan to improve that in the future.

Server mode

Although it's not possible to use both the server and the command line application at the same time, it's possible to use the command line application in server mode. In this case, instead of reading and writing the data from the hard disk, the application will send requests to the server to read and write the data it needs. With this, you can use both the server and the command line application at the same time!

While running, the server exposes a simple API that can be used to get all the information about the budget data and that can also be used to add new expenses, earnings and so on directly to the server data. The API is also protected by authentication.

Currently, the server does not support HTTPS. However, you can run it behind a proxy such as nginx which is running in HTTPS. This is what I'm doing. The server mode supports SSL from the client to the server, you just have to set server_sll=true in the configuration.

This is the mode I'm currently using and will continue using. With this, I can quickly do some modifications using the command line and if I want to see advanced visualization, I just have to open my browser and everything is updated. Moreover, in the future, other people involved with my budget will be able to access the web interface. This also solves the synchronization problem in a much better way than before.

Just as it was the case with the server, this is not made to be used in parallel by different users. This should be perfectly fine for a small household.

Assets Tracking

Already a few months ago, I've added the feature to track assets <> `_ into budgetwarrior. You can define the list of the assets you possess. The tool will then help you track the value of your assets. You can set your desired distribution of bonds, cash and stocks and the tool will help you see if you need to rebalance your assets. This will let you compute your net worth, with :code:`budget asset value:

View of your assets

Moreover, you can also set a few of your assets as your portfolio assets. These assets have a desired distribution and are handled different. These are the assets you directly manage yourself, your investment portfolio. You can then track their value and see if they need rebalancing. For instance, here is a randomized rebalancing of your portfolio, with budget asset rebalance:

View of the needed rebalance

All these features are now also available on the web version as well.

Better console usability

A few months ago, I added some quality-of-life improvements to the console appplication. You can now cycle through the list of possible values for accounts for instance in the console! This is down with the UP and DOWN keys. Now, I also added auto-completion with TAB key. You can write Ins<TAB> and it will complete to Insurances if you have an Insurances account in your budget. This makes it much faster to enter new expenses or to update asset values.


If you are on Gentoo, you can install it using layman:

layman -a wichtounet
emerge -a budgetwarrior

If you are on Arch Linux, you can use this AUR repository <> (wait a few day for the new version to be updated)_

For other systems, you'll have to install from sources:

git clone --recursive git://
cd budgetwarrior
git checkout 1.0
sudo make install


Overall, even though I'm not a fan of web development, it was quite fun to add all these features to budgetwarrior and made it much better I think. This is a very significant change to the project since it almost doubled in number of source lines of code, but I think it's a change that was needed.

I think these changes really make budgetwarrior more useful to a wider group of people and I'm pretty to have finally come around and implemented them. I still have a few things I plan to improve in the near future. First, I want to make the website a bit faster, there are many scripts and stylesheets that are being loaded and make the site a bit bloated. I'll also enable gzip compression of the website to speed up things. I will also ensure that the server can handle requests concurrently without any problem of the data (should be simple since we don't need high performance). I may also add a new module to budgetwarrior to track your progress towards retirement if this is something you are interested in, but I haven't decided in what form exactly. Finally, I will also try to optimize the requests that are being done between the server and the client when run in server mode. Indeed, it currently downloads almost all the data from the server which is far from optimal.

If you are interested by the sources, you can download them on Github: budgetwarrior.

If you have a suggestion or you found a bug, please post an issue on Github.

If you have any comment, don't hesitate to contact me, either by letting a comment on this post or by email.


My thesis is available: Deep Learning Feature Extraction for Image Processing

I'm happy to say that I've finally put my thesis online and updated my Publications page.

I should have done that earlier but it slipped my mind, so there it is!

My thesis (Deep Learning Feature Extraction for Image Processing) is now available to download. Here is the abstract of the thesis:

In this thesis, we propose to use methodologies that automatically learn how to extract relevant features from images. We are especially interested in evaluating how these features compare against handcrafted features. More precisely, we are interested in the unsupervised training that is used for the Restricted Boltzmann Machine (RBM) and Convolutional RBM (CRBM) models. These models relaunched the Deep Learning interest of the last decade. During the time of this thesis, the auto-encoders approach, especially Convolutional Auto-Encoders (CAE) have been used more and more. Therefore, one objective of this thesis is also to compare the CRBM approach with the CAE approach.

The scope of this work is defined by several machine learning tasks. The first one, handwritten digit recognition, is analysed to see how much the unsupervised pretraining technique introduced with the Deep Belief Network (DBN) model improves the training of neural networks. The second, detection and recognition of Sudoku in images, is evaluating the efficiency of DBN and Convolutional DBN (CDBN) models for classification of images of poor quality. Finally, features are learned fully unsupervised from images for a keyword spotting task and are compared against well-known handcrafted features. Moreover, the thesis was also oriented around a software engineering axis. Indeed, a complete machine learning framework was developed during this thesis to explore possible optimizations and possible algorithms in order to train the tested models as fast as possible.

If you are interested, you can:

I hope this will interest a few of you! As always, if you have any question, don't hesitate to let me a comment ;)

As for the current projects, I'm still currently working on the next version of budgetwarrior, but I don't have any expected release date. It will depend on much time I'm able to put to the project.


Expression Templates Library 1.2.1: Faster GPU and new features

Happy new year to all my dear readers!

It has been a while since I've posted on this blog. I've had to serve three weeks in the army and then I had two weeks vacation. I've been actively working on budgetwarrior with a brand new web interface! More on that later ;)

Today, I'm happy to release the version 1.2.1 of my Expression Templates Library (ETL) project. This is a minor version but with significantly better GPU support and a few new features and bug fixes so I decided to release it now.

Faster GPU support

Last year, I implemented the support for the detection of advanced GPU patterns in ETL.

This will significantly reduce the number of CUDA kernel calls that are being launched. For instance, each of the following expressions will be evaluated using a single GPU kernel:

yy = 1.1 * x + y
yy = x + 1.1 * y
yy = 1.1 * y + 1.2 * y
yy = 1.1 * x * y
yy = x / (1.1 * y)

This makes some operation significantly faster.

Moreover, I've reduced a lot the numbers of device synchronization in the library. Especially, I've removed almost all synchronization from the etl-gpu-blas sub library. This means that synchronization is mostly only done when data needs to go back to the CPU. For machine learning, this means at the end of the epoch to compute the final error. This makes a HUGE difference in time, I didn't realize before that I was doing way too much synchronization.

With these two changes, I've been able to attain state of the art training performance on GPU with my Deep Learning Library (DLL) project!

Moreover, I've now added for random number generations on the GPU and for shuffle operations as well.

New Features

I've also added a few new features recently. They were especially added to support new features in DLL.

Matrices and vectors can now be normalized in order to have zero-mean and unit-variance distribution. You can also merge matrices together. For now, there is no GPU support, so this will use CPU anyway. I plan to fix that later.

In addition to bias_batch_mean that I added before, I also added bias_batch_var now with the variance in place of the mean. This is mainly used for Batch Normalization in machine learning, but it may have some other usages. The GPU support has been added as well directly.

And the last feature is the support for embedding and the gradients of embedding. Again this is totally related to machine learning, but can be very useful as well. I haven't add the time to develop the GPU version so far, but this will come as well.


Nothing fancy on the CPU performance side, I only added vectorization for hyperbolic versions. This makes tanh much faster on CPU.

Bug Fixes

I fixed quite a few bugs in this version, which is one of the main reason I released it:

1. When using large fast_matrix and aliasing was detected, there was a big chance of stack overflow occurring. This is now fixed by using a dynamic temporary. 1. Some assignables such sub_view did not perform any detection for aliasing. This is now fixed and aliasing is detected everywhere. 1. fast_dyn_matrix can now be correctly used with bool 1. The use of iterators was not always ensuring correct CPU/GPU consistency. This is now correctly handled. 1. The 4D convolution in GPU were not using the correct flipping 1. Fix small compilation bug with sub_matrix and GPU

What's next ?

I don't really know what will be in the next release. This should be the release 1.3. One possible idea would be to improve and review the support for sparse matrix which is more than poor as of now. But I'm not really motivated to work on that :P Moreover, I'm now actively working on the next release of budgetwarrior which will probably still come this month.

I'm also still hesitating in switching to C++17 for the library to make it faster to compile. And also to clean some parts of the code. I would be able to remove quite some SFINAE with the new if constexpr, but I'm afraid this will make the library to difficult to use since it would need at least GCC 7 or clang 3.9.

Download ETL

You can download ETL on Github. If you only interested in the 1.2.1 version, you can look at the Releases pages or clone the tag 1.2.1. There are several branches:

  • master Is the eternal development branch, may not always be stable
  • stable Is a branch always pointing to the last tag, no development here

For the future release, there always will tags pointing to the corresponding commits. You can also have access to previous releases on Github or via the release tags.

The documentation is still a bit sparse. There are a few examples and the Wiki, but there still is work to be done. If you have questions on how to use or configure the library, please don't hesitate.

Don't hesitate to comment this post if you have any comment on this library or any question. You can also open an Issue on Github if you have a problem using this library or propose a Pull Request if you have any contribution you'd like to make to the library.

Hope this may be useful to some of you :)


Advanced GPU Patterns Optimization in ETL

The GPU performance of my Expression Templates Library (ETL) is pretty good when most of the time is spent inside expensive operations such as Matrix-Matrix Multiplication or convolutions. However, when most of the time is spent in linear kernels, performance is not great because this will invoke a lot of CUDA kernels. Indeed, the way it is done is that each sub expressions compute its result in a temporary GPU vector (or matrix) and these temporaries are passed through the expressions. For instance, this expression:

yy = 1.1 * x + 1.2 * y

will be executed on the GPU as something like this:

t1 = 1.1 * x
t2 = 1.2 * y
yy = t1 + t2

that will results in three GPU kernels being invoked. In the CPU case, the complete expression will be executed as one CPU kernel, that is constructed with Expression Templates. Unfortunately, a CUDA kernel cannot be constructed in the same way since the CUDA compiler does not support general template metaprogramming. That's why I've implemented by using small kernels for each expression.

Fortunately, we can do better with a bit more meta-programming. Indeed, there are some patterns that are repeated a lot and that easily be implemented in CUDA kernels. I've started detecting a few of these patterns and for each of them a single CUDA kernel is executed. For instance, each of the following expressions can be executed with a single kernel:

yy = 1.1 * x + y
yy = x + 1.1 * y
yy = 1.1 * y + 1.2 * y
yy = 1.1 * x * y
yy = x / (1.1 * y)

This results in significantly performance improvement for these expressions!

I have tested these new improvements in my Deep Learning Library (DLL) project (not yet merged) and it resulted in 25% faster momentum computation and 17% faster Nesterov Adam (NADAM).

I'm going to continue to investigate which kernels need to be made faster for DLL and try to improve the overall performance. Currently, the GPU performance of DLL is very good for large convolutional networks, but could be improved for small fully-connected networks. Indeed, in that case, quite some time is spent outside the matrix-matrix multiplication and inside serial expressions for which GPU could be improved. Once I'm done with my optimizations, I'll probably post again on the blog with the latest results.

All these new optimizations are now in the master branch of the ETL project if you want to check it out. You can access the project on Github.


Initial support for Long Short Term Memory (LSTM) in DLL

I'm really happy to announce that I just merged support for

Long Short Term Memory (LSTM) cells into my Deep Learning Library (DLL) machine learning framework. Two weeks ago, I already merged suport for Recurrent Neural network (RNN).

It's nothing fancy yet, but forward propagation of LSTM and basic Backpropagation Through Time (BPTT) are now supported. It was not really complicated to implemenet the forward pass but the backward pass is much complicated for an LSTM than for a RNN. It took me quite a long time to figure out all the gradients formulas and the documentation on that is quite scarce.

For now, still only existing classification loss is supported for RNN and LSTM. As I said last time, I still plan to add support for sequence-to-sequence loss in order to be able to train models able to generate characters. However, I don't know when I'll be able to work on that. Now that I've got the code for LSTM, I should be able to implement a GRU cell and NAS cell quite easily I believe.

For example, here is a simple LSTM used on MNIST for classification:

#include "dll/neural/dense_layer.hpp"
#include "dll/neural/lstm_layer.hpp"
#include "dll/neural/recurrent_last_layer.hpp"
#include "dll/network.hpp"
#include "dll/datasets.hpp"

int main(int /*argc*/, char* /*argv*/ []) {
    // Load the dataset
    auto dataset = dll::make_mnist_dataset_nc(dll::batch_size<100>{}, dll::scale_pre<255>{});

    constexpr size_t time_steps      = 28;
    constexpr size_t sequence_length = 28;
    constexpr size_t hidden_units    = 100;

    // Build the network

    using network_t = dll::dyn_network_desc<
            dll::lstm_layer<time_steps, sequence_length, hidden_units, dll::last_only>,
            dll::recurrent_last_layer<time_steps, hidden_units>,
            dll::dense_layer<hidden_units, 10, dll::softmax>
        , dll::updater<dll::updater_type::ADAM>      // Adam
        , dll::batch_size<100>                       // The mini-batch size

    auto net = std::make_unique<network_t>();

    // Display the network and dataset

    // Train the network for performance sake
    net->fine_tune(dataset.train(), 50);

    // Test the network on test set

    return 0;

The network is quite similar to the one used previously with an RNN, just replace rnn with lstm and that's it. It starts with LSTM layer, followed by a layer extracting the last time step and finally a dense layer with a softmax function. The network is trained with Adam for 50 epochs. You can change the activation function , the initializer for the weights and the biases and number of steps for BPTT truncation.

Here is the result I got on my last run:

| Index | Layer                | Parameters | Output Shape |
| 0     | LSTM (TANH) (dyn)    |      51200 | [Bx28x100]   |
| 1     | RNN(last)            |          0 | [Bx100]      |
| 2     | Dense(SOFTMAX) (dyn) |       1000 | [Bx10]       |
              Total Parameters:      52200

| mnist | Size  | Batches | Augmented Size |
| train | 60000 | 600     | 60000          |
| test  | 10000 | 100     | 10000          |

Network with 3 layers
    LSTM(dyn): 28x28 -> TANH -> 28x100
    RNN(last): 28x100 -> 100
    Dense(dyn): 100 -> SOFTMAX -> 10
Total parameters: 52200
Training: In-Memory Data Generator
              Size: 60000
           Batches: 600
Testing: In-Memory Data Generator
              Size: 10000
           Batches: 100

Train the network with "Stochastic Gradient Descent"
    Updater: ADAM
 Early Stop: Goal(error)

With parameters:

epoch   0/50 batch  600/ 600 - error: 0.07943 loss: 0.28504 time 20910ms
epoch   1/50 batch  600/ 600 - error: 0.06683 loss: 0.24021 time 20889ms
epoch   2/50 batch  600/ 600 - error: 0.04828 loss: 0.18233 time 21061ms
epoch   3/50 batch  600/ 600 - error: 0.04407 loss: 0.16665 time 20839ms
epoch   4/50 batch  600/ 600 - error: 0.03515 loss: 0.13290 time 22108ms
epoch   5/50 batch  600/ 600 - error: 0.03207 loss: 0.12019 time 21393ms
epoch   6/50 batch  600/ 600 - error: 0.02973 loss: 0.11239 time 28199ms
epoch   7/50 batch  600/ 600 - error: 0.02653 loss: 0.10455 time 37039ms
epoch   8/50 batch  600/ 600 - error: 0.02482 loss: 0.09657 time 23127ms
epoch   9/50 batch  600/ 600 - error: 0.02177 loss: 0.08422 time 41766ms
epoch  10/50 batch  600/ 600 - error: 0.02453 loss: 0.09382 time 29765ms
epoch  11/50 batch  600/ 600 - error: 0.02575 loss: 0.09796 time 21449ms
epoch  12/50 batch  600/ 600 - error: 0.02107 loss: 0.07833 time 42056ms
epoch  13/50 batch  600/ 600 - error: 0.01877 loss: 0.07171 time 24673ms
epoch  14/50 batch  600/ 600 - error: 0.02095 loss: 0.08481 time 20878ms
epoch  15/50 batch  600/ 600 - error: 0.02040 loss: 0.07578 time 41515ms
epoch  16/50 batch  600/ 600 - error: 0.01580 loss: 0.06083 time 25705ms
epoch  17/50 batch  600/ 600 - error: 0.01945 loss: 0.07046 time 20903ms
epoch  18/50 batch  600/ 600 - error: 0.01728 loss: 0.06683 time 41828ms
epoch  19/50 batch  600/ 600 - error: 0.01577 loss: 0.05947 time 27810ms
epoch  20/50 batch  600/ 600 - error: 0.01528 loss: 0.05883 time 21477ms
epoch  21/50 batch  600/ 600 - error: 0.01345 loss: 0.05127 time 44718ms
epoch  22/50 batch  600/ 600 - error: 0.01410 loss: 0.05357 time 25174ms
epoch  23/50 batch  600/ 600 - error: 0.01268 loss: 0.04765 time 23827ms
epoch  24/50 batch  600/ 600 - error: 0.01342 loss: 0.05004 time 47232ms
epoch  25/50 batch  600/ 600 - error: 0.01730 loss: 0.06872 time 22532ms
epoch  26/50 batch  600/ 600 - error: 0.01337 loss: 0.05016 time 30114ms
epoch  27/50 batch  600/ 600 - error: 0.01842 loss: 0.07049 time 40136ms
epoch  28/50 batch  600/ 600 - error: 0.01262 loss: 0.04639 time 21793ms
epoch  29/50 batch  600/ 600 - error: 0.01403 loss: 0.05292 time 34096ms
epoch  30/50 batch  600/ 600 - error: 0.01185 loss: 0.04456 time 35420ms
epoch  31/50 batch  600/ 600 - error: 0.01098 loss: 0.04180 time 20909ms
epoch  32/50 batch  600/ 600 - error: 0.01337 loss: 0.04687 time 30113ms
epoch  33/50 batch  600/ 600 - error: 0.01415 loss: 0.05292 time 37393ms
epoch  34/50 batch  600/ 600 - error: 0.00982 loss: 0.03615 time 20962ms
epoch  35/50 batch  600/ 600 - error: 0.01178 loss: 0.04830 time 29305ms
epoch  36/50 batch  600/ 600 - error: 0.00882 loss: 0.03408 time 38293ms
epoch  37/50 batch  600/ 600 - error: 0.01148 loss: 0.04341 time 20841ms
epoch  38/50 batch  600/ 600 - error: 0.00960 loss: 0.03701 time 29204ms
epoch  39/50 batch  600/ 600 - error: 0.00850 loss: 0.03094 time 39802ms
epoch  40/50 batch  600/ 600 - error: 0.01473 loss: 0.05136 time 20831ms
epoch  41/50 batch  600/ 600 - error: 0.01007 loss: 0.03579 time 29856ms
epoch  42/50 batch  600/ 600 - error: 0.00943 loss: 0.03370 time 38200ms
epoch  43/50 batch  600/ 600 - error: 0.01205 loss: 0.04409 time 21162ms
epoch  44/50 batch  600/ 600 - error: 0.00980 loss: 0.03674 time 32279ms
epoch  45/50 batch  600/ 600 - error: 0.01068 loss: 0.04133 time 38448ms
epoch  46/50 batch  600/ 600 - error: 0.00913 loss: 0.03478 time 20797ms
epoch  47/50 batch  600/ 600 - error: 0.00985 loss: 0.03759 time 28885ms
epoch  48/50 batch  600/ 600 - error: 0.00912 loss: 0.03295 time 41120ms
epoch  49/50 batch  600/ 600 - error: 0.00930 loss: 0.03438 time 21282ms
Restore the best (error) weights from epoch 39
Training took 1460s

Evaluation Results
   error: 0.02440
    loss: 0.11315
evaluation took 1000ms

Again, nothing fancy yet, but this example has not been optimized for performance nor for accuracy.

I also made a few changes to the RNN layer. I added support for biases and improved the code as well for performance and readability.

All this support is now in the master branch of the DLL project if you want to check it out. You can also check out the example online: mnist_lstm.cpp

You can access the project on Github.

Currently I'm working on the GPU performance again. The performance of some is still not as good as I want it to be, especially complex operation like used in Adam and Nadam. Currently, there are many calls to GPU BLAS libraries and I want to try to extract some more optimized patterns. Once it's done, I'll post more on that later on the blog.


DLL: Pretty printing and live output

I've improved a lot the display of my Deep Learning Library (DLL). I know this is generally not the most important point in a machine learning framework, but the first impression being important. Therefore, I decided it was time to get a nicer output in the console for training networks.

A network or a dataset can be displayed using the display() function. I've added a display_pretty() function to them to display it more nicely. I've also added the dll::dump_timers_nice() function to do the same for dll::dump_timers().

I've also improved the display for the results of the batches during training. Now, the display is updated every 100ms and it also displays the current estimated time until the end of the epoch. With that, the user should have a much better idea on what's going on during training, especially when training networks when the epochs are taking a long time to complete.

Here is a full output of the training of fully-connected network on MNIST (mnist_mlp.cpp <>):

 | Index | Layer                | Parameters | Output Shape |
 | 0     | Dense(SIGMOID) (dyn) |     392000 | [Bx500]      |
 | 1     | Dropout(0.50)(dyn)   |          0 | [Bx500]      |
 | 2     | Dense(SIGMOID) (dyn) |     125000 | [Bx250]      |
 | 3     | Dropout(0.50)(dyn)   |          0 | [Bx250]      |
 | 4     | Dense(SOFTMAX) (dyn) |       2500 | [Bx10]       |
                Total Parameters:     519500

 | mnist | Size  | Batches | Augmented Size |
 | train | 60000 | 600     | 60000          |
 | test  | 10000 | 100     | 10000          |

Train the network with "Stochastic Gradient Descent"
    Updater: NADAM
 Early Stop: Goal(error)

With parameters:

epoch   0/50 batch  600/ 600 - error: 0.04623 loss: 0.15097 time 3230ms
epoch   1/50 batch  600/ 600 - error: 0.03013 loss: 0.09947 time 3188ms
epoch   2/50 batch  600/ 600 - error: 0.02048 loss: 0.06565 time 3102ms
epoch   3/50 batch  600/ 600 - error: 0.01593 loss: 0.05258 time 3189ms
epoch   4/50 batch  600/ 600 - error: 0.01422 loss: 0.04623 time 3160ms
epoch   5/50 batch  600/ 600 - error: 0.01112 loss: 0.03660 time 3131ms
epoch   6/50 batch  600/ 600 - error: 0.01078 loss: 0.03546 time 3200ms
epoch   7/50 batch  600/ 600 - error: 0.01003 loss: 0.03184 time 3246ms
epoch   8/50 batch  600/ 600 - error: 0.00778 loss: 0.02550 time 3222ms
epoch   9/50 batch  600/ 600 - error: 0.00782 loss: 0.02505 time 3119ms
epoch  10/50 batch  600/ 600 - error: 0.00578 loss: 0.02056 time 3284ms
epoch  11/50 batch  600/ 600 - error: 0.00618 loss: 0.02045 time 3220ms
epoch  12/50 batch  600/ 600 - error: 0.00538 loss: 0.01775 time 3444ms
epoch  13/50 batch  600/ 600 - error: 0.00563 loss: 0.01803 time 3304ms
epoch  14/50 batch  600/ 600 - error: 0.00458 loss: 0.01598 time 3577ms
epoch  15/50 batch  600/ 600 - error: 0.00437 loss: 0.01436 time 3228ms
epoch  16/50 batch  600/ 600 - error: 0.00360 loss: 0.01214 time 3180ms
epoch  17/50 batch  600/ 600 - error: 0.00405 loss: 0.01309 time 3090ms
epoch  18/50 batch  600/ 600 - error: 0.00408 loss: 0.01346 time 3045ms
epoch  19/50 batch  600/ 600 - error: 0.00337 loss: 0.01153 time 3071ms
epoch  20/50 batch  600/ 600 - error: 0.00297 loss: 0.01021 time 3131ms
epoch  21/50 batch  600/ 600 - error: 0.00318 loss: 0.01103 time 3076ms
epoch  22/50 batch  600/ 600 - error: 0.00277 loss: 0.00909 time 3090ms
epoch  23/50 batch  600/ 600 - error: 0.00242 loss: 0.00818 time 3163ms
epoch  24/50 batch  600/ 600 - error: 0.00267 loss: 0.00913 time 3229ms
epoch  25/50 batch  600/ 600 - error: 0.00295 loss: 0.00947 time 3156ms
epoch  26/50 batch  600/ 600 - error: 0.00252 loss: 0.00809 time 3066ms
epoch  27/50 batch  600/ 600 - error: 0.00227 loss: 0.00773 time 3156ms
epoch  28/50 batch  600/ 600 - error: 0.00203 loss: 0.00728 time 3158ms
epoch  29/50 batch  600/ 600 - error: 0.00240 loss: 0.00753 time 3114ms
epoch  30/50 batch  600/ 600 - error: 0.00263 loss: 0.00864 time 3099ms
epoch  31/50 batch  600/ 600 - error: 0.00210 loss: 0.00675 time 3096ms
epoch  32/50 batch  600/ 600 - error: 0.00163 loss: 0.00628 time 3120ms
epoch  33/50 batch  600/ 600 - error: 0.00182 loss: 0.00611 time 3045ms
epoch  34/50 batch  600/ 600 - error: 0.00125 loss: 0.00468 time 3140ms
epoch  35/50 batch  600/ 600 - error: 0.00183 loss: 0.00598 time 3093ms
epoch  36/50 batch  600/ 600 - error: 0.00232 loss: 0.00711 time 3068ms
epoch  37/50 batch  600/ 600 - error: 0.00170 loss: 0.00571 time 3057ms
epoch  38/50 batch  600/ 600 - error: 0.00162 loss: 0.00530 time 3115ms
epoch  39/50 batch  600/ 600 - error: 0.00155 loss: 0.00513 time 3226ms
epoch  40/50 batch  600/ 600 - error: 0.00150 loss: 0.00501 time 2987ms
epoch  41/50 batch  600/ 600 - error: 0.00122 loss: 0.00425 time 3117ms
epoch  42/50 batch  600/ 600 - error: 0.00108 loss: 0.00383 time 3102ms
epoch  43/50 batch  600/ 600 - error: 0.00165 loss: 0.00533 time 2977ms
epoch  44/50 batch  600/ 600 - error: 0.00142 loss: 0.00469 time 3009ms
epoch  45/50 batch  600/ 600 - error: 0.00098 loss: 0.00356 time 3055ms
epoch  46/50 batch  600/ 600 - error: 0.00127 loss: 0.00409 time 3076ms
epoch  47/50 batch  600/ 600 - error: 0.00132 loss: 0.00438 time 3068ms
epoch  48/50 batch  600/ 600 - error: 0.00130 loss: 0.00459 time 3045ms
epoch  49/50 batch  600/ 600 - error: 0.00107 loss: 0.00365 time 3103ms
Restore the best (error) weights from epoch 45
Training took 160s

Evaluation Results
   error: 0.01740
    loss: 0.07861
evaluation took 67ms

 | %        | Timer                         | Count  | Total     | Average   |
 | 100.000% | net:train:ft                  | 1      | 160.183s  | 160.183s  |
 | 100.000% | net:trainer:train             | 1      | 160.183s  | 160.183s  |
 |  99.997% | net:trainer:train:epoch       | 50     | 160.178s  | 3.20356s  |
 |  84.422% | net:trainer:train:epoch:batch | 30000  | 135.229s  | 4.50764ms |
 |  84.261% | sgd::train_batch              | 30000  | 134.971s  | 4.49904ms |
 |  44.404% | sgd::grad                     | 30000  | 71.1271s  | 2.3709ms  |
 |  35.453% | sgd::forward                  | 30000  | 56.7893s  | 1.89298ms |
 |  32.245% | sgd::update_weights           | 90000  | 51.6505s  | 573.894us |
 |  32.226% | sgd::apply_grad:nadam         | 180000 | 51.6211s  | 286.783us |
 |  28.399% | dense:dyn:forward             | 180300 | 45.4903s  | 252.303us |
 |  17.642% | dropout:train:forward         | 60000  | 28.2595s  | 470.99us  |
 |  13.707% | net:trainer:train:epoch:error | 50     | 21.957s   | 439.14ms  |
 |  12.148% | dense:dyn:gradients           | 90000  | 19.4587s  | 216.207us |
 |   4.299% | sgd::backward                 | 30000  | 6.88546s  | 229.515us |
 |   3.301% | dense:dyn:backward            | 60000  | 5.28729s  | 88.121us  |
 |   0.560% | dense:dyn:errors              | 60000  | 896.471ms | 14.941us  |
 |   0.407% | dropout:backward              | 60000  | 651.523ms | 10.858us  |
 |   0.339% | dropout:test:forward          | 60000  | 542.799ms | 9.046us   |
 |   0.161% | net:compute_loss:CCE          | 60100  | 257.915ms | 4.291us   |
 |   0.099% | sgd::error                    | 30000  | 158.33ms  | 5.277us   |

I hope this will make the output of the machine learning framework more useful.

All this support is now in the master branch of the DLL project if you want to check it out. You can also check out the example online: mnist_mlp.cpp

You can access the project on Github.