Use templight and Templar to debug C++ templates
   Posted:


C++ has some very good tools to debug, profile and analyze source files and executables. This all works well for standard runtime program. But, when you are using templates, you sometimes want these tools to act at compile-time. And at this point the support is much more scarce.

templight and Templar and two tools that are trying to fix this issue.

From the templight site:

Templight is a Clang-based tool to profile the time and memory consumption of template instantiations and to perform interactive debugging sessions to gain introspection into the template instantiation process.

and Templar is a visualization tool for the traces generated by templight.

Installation

Unfortunately, the templight installation is not user-friendly at all. You need to clone the complete LLVM/Clang tree and add templight inside it before compiling the complete clang toolchain. But that is the case for all clang-based tools... You also need to patch clang but that may not be necessary in the future. The complete instructions are available here.

The installation of Templar is much more convenient:

git clone https://github.com/schulmar/Templar.git
git checkout feature/templight2
cd Templar
qmake .
make
sudo make install

The branch feature/templight2 has much more features than the master and should support both Qt4 and Qt5, but I have only tested it on Qt4.

Example

Let's use the class Fibonacci function as an example:

#include <iostream>

template <std::size_t N>
struct Fibonacci {
    static constexpr const std::size_t value = Fibonacci<N-1>::value + Fibonacci<N-2>::value;
};

template <>
struct Fibonacci<1> {
    static constexpr const std::size_t value = 1;
};

template <>
struct Fibonacci<0> {
    static constexpr const std::size_t value = 0;
};

int main(){
    std::cout << "Fibonacci<5>:" << Fibonacci<5>::value << std::endl;
}

Nothing fancy here, we're simply printing the fifth Fibonacci number on the console.

You can compile it with templight++:

templight++ -Xtemplight -profiler -Xtemplight -memory -Xtemplight -ignore-system -std=c++14 main.cpp

All the templight options starts with -Xtemplight and then you can use any clang++ options. This will generate a a.memory.trace.pbf file in the current directory. You can then run Templar. use File > Open Trace to open the trace file. This should open a window of this sort:

/images/templar.png

The top-left panel contains the source code of the application, automatically refreshed whenever you move in the template tree. In the top right, there is the template instantiation graph. In the bottom left, you'll see a list of list of files to be able to filter them and in the bottom right, you'll see the list of templates events. You can sort the list of template events by duration which is really convenient. You can then select Fibonacci<5> by double clicking it in the list (once sorted, it should be near the top). This should give you a tree looking something like that:

/images/templar_tree.png

The edgy nodes are template instantiations and the round nodes are template memoization. We can directly see that each instantiation was only done once. I think this graph view is really helpful if you need to debug computation done at compile time. You can see that that not all nodes are displayed, this is because there is a limit on the displayed depth. Simply click on Fibonacci<3> and the remaining nodes will be shown.

I have already used this tool to find the most time-consuming templates in ETL an DLL. This is a great tool to indicate where you should focus on improving the template compile-time. I have also been able to find some unnecessary instantiations that could be avoided (either with SFINAE or with refactorings).

templight also contains a fully-fledged debugger for template programs, but I haven't tested it.

Conclusion

In conclusion, I would say that templight and Templar are really helping with template debugging and profiling. There is a real lack of tools in this domain and I hope to see more tools of this kind in the future. I hope this will help you develop template-heavy programs or template metaprograms.

Comments

Improve DLL and ETL Compile Time further
   Posted:


For a while, the compilation time of my matrix/vector computation library (ETL), based on Expression Templates has become more and more problematic. I've already worked on this problem here and there, using some general techniques (pragmas, precompiled headers, header removals and so on). On this post, I'll talk about two major improvements I have been able to do directly in the code.

Use of static_if

Remember static_if ? I was able to use it to really reduce the compile time of DLL.

I wrote a script to time each test case of the DLL project to find the test cases that took the longest to compile. Once I found the best candidate, I isolated the functions that took the longest to compile. It was quite tedious and I did it by hand, primarily by commenting parts of the code and going deeper and deeper in the code. I was quite suprised to find that a single function call (template function of course ;) ) was responsible for 60% of the compilation time of my candidate test case. The function was instantiating a whole bunch of expression templates (to compute the free energy of several models). The function itself was not really optimizable, but what was really interesting is that this function was only used in some very rare cases and that these cases were known at compile-time :) This was a perfect case to use a static_if. And once the call was inside the static_if, the test case was indeed about 60% faster. This reduced the overall compilation time of DLL by about 30%!

This could also of course also have been achieved by using two functions, one with the call, one empty and selected by SFINAE (Substitution Failure Is Not An Error). I prefer the statif_if version since this really shows the intent and hides SFINAE behind nicer syntax.

I was also able to use static_if at other places in the DLL code to avoid instantiating some templates, but the improvements were much less dramatic (about 1% of the total compilation time). I was very lucky to find a single function that accounted for so much compile time. After some more tests, I concluded that much of the compilation time of DLL was spent compiling the Expression Templates from my ETL library so I decided to delve into ETL code directly.

Removal of std::async

The second improvement was very surprising. I was working on improving the compilation of ETL and found out that the sum and average reductions of matrices were dramatically slow, about an order of magnitude slower than standard operations on matrices. In parallel (but the two facts are linked), I also found out another weird fact when splitting a file into 10 parts (the file was comprised of 10 test cases). Compiling the 10 parts separarely (and sequentially, not multiple threads) was about 40% faster than compiling the complete file. There was no swapping so it was not a memory issue. This is not expected. Generally, it is faster to compile a big file than to compile its parts separately. The advantage of smaller files is that you can compile them in parallel and that incremental builds are faster (only compile a small part).

By elimination, I found out that most of the time was spent inside the function that was dispatching in parallel the work for accumulating the sum of a matrix. Here is the function:

template <typename T, typename Functor, typename AccFunctor>
inline void dispatch_1d_acc(bool p, Functor&& functor, AccFunctor&& acc_functor, std::size_t first, std::size_t last){
    if(p){
        std::vector<std::future<T>> futures(threads - 1);

        auto n = last - first;
        auto batch = n / threads;

        for(std::size_t t = 0; t < threads - 1; ++t){
            futures[t] = std::async(std::launch::async, functor, first + t * batch, first + (t+1) * batch);
        }

        acc_functor(functor(first + (threads - 1) * batch, last));

        for(auto& fut : futures){
            acc_functor(fut.get());
        }
    } else {
        acc_functor(functor(first, last));
    }
}

There isn't anything really fancy about this function. This takes one functor that will be done in parallel and one function for accumulation. It dispatches all the work in batch and then accumulates the results. I tried several things to optimize the compilation time of this function, but nothing worked. The line that was consuming all the time was the std::async line. This function was using std::async because the thread pool that I'm generally using does not support returning values from parallel functors. I decided to use a workaround and use my thread pool and I came out with this version:

template <typename T, typename Functor, typename AccFunctor>
inline void dispatch_1d_acc(bool p, Functor&& functor, AccFunctor&& acc_functor, std::size_t first, std::size_t last){
    if(p){
        std::vector<T> futures(threads - 1);
        cpp::default_thread_pool<> pool(threads - 1);

        auto n = last - first;
        auto batch = n / threads;

        auto sub_functor = [&futures, &functor](std::size_t t, std::size_t first, std::size_t last){
            futures[t] = functor(first, last);
        };

        for(std::size_t t = 0; t < threads - 1; ++t){
            pool.do_task(sub_functor, t, first + t * batch, first + (t+1) * batch);
        }

        acc_functor(functor(first + (threads - 1) * batch, last));

        pool.wait();

        for(auto fut : futures){
            acc_functor(fut);
        }
    } else {
        acc_functor(functor(first, last));
    }
}

I simply preallocate space for all the threads and create a new functor calling the input functor and saving its result inside the vector. It is less nice, but it works well. And it compiles MUCH faster. This reduced the compilation time of my biggest test case by a factor of 8 (from 344 seconds to 44 seconds). This is really crazy. It also fixed the problem where splitting the test case was faster than big file (it is now twice faster to compile the big files than compiling all the small files separately). This reduced the total compilation time of dll by about 400%.

As of now, I still have no idea why this makes such a big difference. I have looked at the std::async code, but I haven't found a valid reason for this slowdown. If someone has any idea, I'd be very glad to discuss in the comments below.

Improving the template instantiation tree

I recently discovered the templight tool that is a profiler for templates (pretty cool). After some time, I was able to build it and use it on ETL. For now, I haven't been able to reduce compile time a lot, but I have been able to reduce the template instantiation tree a lot seeing that some instantiations were completely useless and I optimized the code to remove them.

I won't be go into much details here because I plan to write a post on this subject in the coming days.

Conclusion

In conclusion, I would say that it is pretty hard to improve the compile time of complex C++ programs once you have gone through all the standard methods. However, I was very happy to found that two optimizations in the source code reduced the overall compilation of DLL by almost 500%. I will continue working on this, but for now, the compilation time is much more reasonable.

I hope the two main facts in this article were interesting. If you have similar experience, comments or ideas for further improvements, I'd be glad to discuss them with you in the comments :)

Comments

Detect overflows and more in Java with COJAC
   Posted:


Back at school, I worked on the COJAC project to detect numeric overflow in Java programs automatically. Since then, this project has evolved a lot and has now more features:

  • It can detect integer overflows
  • Detect smearing and cancellation with float and double types
  • Detect NaN and Infinite results from computations
  • Detect offending type casting

Moreover, all these features are available without any recompilation of your program. You simply add an argument to the invocation of the Java virtual machine and all these errors will be detected for you automatically!

Frédéric Bapst, the person in charge of the project has recently published two videos about the project, don't hesitate to check them out:

The first video presents the automatic analysis features of the tool:

And the second presents the numeric wrapper features of the tool for even more features:

If you have any question related to the project, you can add a comment to this page or contact me directly be email.

If you want more information on the project you can also check out its repository on Github: https://github.com/Cojac/Cojac

Comments

Aggregator Plugin : Display global metrics in Sonarqube
   Posted:


Recently, I wanted to know how many lines of code I had on my Sonar server with all my C++ projects. Sonarsource proposes a commercial plugins (Views) that allows to do that (and much more...), but I didn't wanted to pay thousands of dollars simply to get a total of my lines of code, therefore I wrote a very simple Sonar plugin to compute some global metrics.

This plugin is very simple, it only provides a global widgets that aggregates some stats over all your projects. For instance, here is the results on my Sonar server:

/images/aggregator_widget.png

The plugin is freely available on Github: https://github.com/wichtounet/aggregator-plugin . However, it has only be tested on my Sonar server (4.5.2) and it is my first Sonar plugin, so it may not work everywhere. If you experience issues, don't hesitate to open an issue on Github or to propose a Pull Request.

You can install the plugin by putting the .jar file (from the Github Releases page) into your sonar/extensions/plugins directory and restart Sonar. You should then have access to a new global widget that you can add to a dashboard.

I hope this plugin helps some of you.

Comments

Upgrade to Nikola 7
   Posted:


I've finally taken the time to upgrade the website to Nikola 7 (it is about time, I know...).

The migration worked flawlessly, I simply had to update configuration to migrate deprecated and renamed tags and it worked really well. I also had to add a comma to the COMPILERS list because of the use of Python 3.3 now.

As you may have seen, I haven't posted in a while. I had quite some work for my thesis as well as for the courses I give at my school and I started playing Path Of Exile with took quite a bit of my free time :) I'll try to give some updates on the project I'm working on to make this blog live again.

Comments

Simulate static_if with C++11/C++14
   Posted:


If you are doing a lot of template metaprogramming and other template magic stuff, you are likely to miss a static_if in the language. Unfortunately, it didn't make the cut for C++11 and it seems unlikely that it will make it in C++17.

static_if

As its name indicates, static_if is an if statement but that is done at compile-time. At first, it could seem that the main point is performance, but that is not the case. With recent compilers, if you have an if statement with a compile-time constant, it will never be executed at runtime and only the correct branch will be included in the final executable code. However, even if the compiler knows that a branch will never be executed, it still has to ensure that this branch compiles. This is not the case with static_if. With static_if, only the valid branch is compiled, the other can contains invalid code. The most common reason to use a static_if is inside a template where you perform a test on a template argument and execute code based on this test. static_if has another advantage on standard if. Since only one branch is instantiated, it may save quite a lot of compile-time.

Let's say we have to write a template function that, if the template argument is a string, removes the last character of the string argument, otherwise decrement the argument (I know, stupid example, but simple). With static_if, you can write it like this:

template<typename T>
void decrement_kindof(T& value){
    static_if(std::is_same<std::string, T>::value){
        value.pop_back();
    } else {
        --value;
    }
}

I think it is quite elegant.

The problem

Some may think, that we could do the same with C++ standard if statement:

template<typename T>
void decrement_kindof(T& value){
    if(std::is_same<std::string, T>::value){
        value.pop_back();
    } else {
        --value;
    }
}

However, this won't work. This template cannot be instantiated for std::string since it doesn't have an operator -- and it cannot be instantiated for int since it doesn't have a pop_back() function.

There are two solutions in plain C++: specialization and SFINAE. Let's start with specialization:

template<typename T>
void decrement_kindof(T& value){
    --value;
}

template<>
void decrement_kindof(std::string& value){
    value.pop_back();
}

We do a specialization for std::string case so that in the general case it uses -- and in the std::string case, it uses pop_back(). And the SFINAE version:

template<typename T, std::enable_if_t<!std::is_same<std::string, T>::value, int> = 42>
void decrement_kindof(T& value){
    --value;
}

template<typename T, std::enable_if_t<std::is_same<std::string, T>::value, int> = 42>
void decrement_kindof(T& value){
    value.pop_back();
}

The first function is enabled when the type is not a std::string and the second function is enabled when the type is a std::string.

Both solutions needs two functions to make it work. In this particular case, specialization is easier since the condition states exactly one type. If the condition was more complex for instance testing that a constant inside the type is equals to some value, we could only do it with SFINAE.

Even if both solutions work, both solutions are more complicated than the static_if version and both solutions are creating more functions than what should be necessary.

One solution

There is one way to emulate a kind of static_if with C++14 generic lambdas. It is kind of using anonymous template function to emulate what we did with the previous solutions but does it behind the scene. Here the code I'm using for this emulation:

namespace static_if_detail {

struct identity {
    template<typename T>
    T operator()(T&& x) const {
        return std::forward<T>(x);
    }
};

template<bool Cond>
struct statement {
    template<typename F>
    void then(const F& f){
        f(identity());
    }

    template<typename F>
    void else_(const F&){}
};

template<>
struct statement<false> {
    template<typename F>
    void then(const F&){}

    template<typename F>
    void else_(const F& f){
        f(identity());
    }
};

} //end of namespace static_if_detail

template<bool Cond, typename F>
static_if_detail::statement<Cond> static_if(F const& f){
    static_if_detail::statement<Cond> if_;
    if_.then(f);
    return if_;
}

Note: I got the idea (and most of the code) from the Boost Mailing List.

The condition is passed a non-type template parameter and the code for the branch is a passed a generic lambda functor. The static_if function returns a statement structure. We could avoid returning a struct and directly execute, or not, the functor based on the condition, but using a structure allows for the else_ part which may be practical. The structure statement is specialized on the condition. If the condition is true, the right part will execute the functor while the false part will not execute anything. The specialization when the condition is false willl do the contrary. A special point here is the use of the identity function. The function is passed to the lambda. The user can then use this function to make non-dependent type dependent. This is necessary if we want to call functions on non-dependent types and these functions may not exist. For instance, you may want to call a function on this, which is not a dependent type.

Here is how the code will look using this solution:

template<typename T>
void decrement_kindof(T& value){
    static_if<std::is_same<std::string, T>::value>([&](auto f){
        f(value).pop_back();
    }).else_([&](auto f){
        --f(value);
    });
}

It is not as elegant as the "real" static_if version, but it is closer than the other solutions.

If you don't use the lazy identity function (f), it still works on g++, but not on clang for some reasons.

Conclusion

We saw that there are some solutions to emulate static_if in C++ that you may use to make the code easier to read. I'm personally using this trick on branches with few lines of code and when I don't have to use the identity function too much, otherwise it is cleaner to use standard SFINAE functions to do the job. When you only have a if and no else, this trick is even better because that is where it saves the more code.

I hope this can be useful to some of you ;)

You can find my implementation on Github.

Comments

Improve ETL compile-time with Precompiled Headers
   Posted:


Very recently, I started trying to improve the compile-time of the ETL test suite. While not critical, it is always better to have tests that compile as fast as possible. In a previous post, I was able to improve the time a bit by improve the makefile, using pragra once and avoiding <iostream> headers. With these techniques, I reduced the compile-time from 87.5 to 84.1, which is not bad, but not as good as I would have expected.

In the previous, I had not tried to use Precompiled Headers (PCH) to improve the compile time, so I thought it would be a good time to do it.

Precompiled Headers

Precompiled Headers are an option of the compiler, where one header gets compiled. Normally, you only compile source files into object files, but you can also compile headers, although it is not the same thing. When a compiler compiles a header, it can do a lot of preprocessing (macros, includes, AST, symbols) and then store all the results into a precompiled header file. Once you compile the source files, the compiler will try to use the precompiled header file instead of the real header file. Of course, this can breaks the C++ standard since with that a header can not have different behaviour based on macros for instance. For these reasons (and probably implementation reasons as well), precompiled headers are really limited.

If we take the case of G++, G++ will consider the precompiled header file instead of the standard header only if (for a complete list, take a look at the GCC docs):

  • The same compilation flags are the same between the two compilations
  • The same compiler binary is used for the compilations
  • Only one precompiled header can be used in each compilation
  • The same macros must be defined
  • The include of the header must be before every possible C/C++ token

If all these conditions are met and you try to #include "header.hpp and there is a header.hpp.gch (the precompiled file) available in the search path, then the precompiled header will be taken instead of the standard one.

With clang, it is a bit different because the precompiled header cannot be included automatically, but has to be included explicitely in the source code, meaning you have to modify your code for this technique to work. This is a bad thing in my opinion, you never should have to modify your code to profit from a compiler feature. This is why I haven't used and don't plan to use precompiled headers with clang.

How-to

Once you know all the conditions for a precompiled header to be automatically included, it is quite straightforward to use them.

To generate a PCH file is easy:

g++ options header.hpp

This will generate header.hpp.gch. When you compile your source file using header.hpp, you don't have anything to do, you just have to compile it as usually and if all the conditions are met, the PCH file will be used instead of the other header.

Results and conclusion

I added precompiled header support into my make-utils collection of Makefile utilities and tested it on ETL. I have precompiled a header that itself included Catch and ETL. Almost all test files are including this header. With this change, I went from 84 seconds to 78seconds. Headers are taking 1.5seconds to be precompiled. This is a nice result I think. If your application is not as template-heavy as mine or if you have more source files, you should expect better improvements.

To conclude, even if precompiled headers are a sound way to reduce compile-time, they are really limited to some cases. I'm not a fan of the feature overally. It is not portable between compilers and not standard. Anyway, if you are really in need of saving some time, you should not hesitate too much ;)

Comments

How I improved (a bit) compile time of ETL ?
   Posted:


Recently I read several articles about C++ and compile time and I wondered if I could improve the compile time of my Expression Template Library (ETL) project. ETL is a header-only and template-heavy library. I'm not going to the change the design completely or to use type erasure techniques to reduce the compile time, ETL is all about performance.

As a disclaimer, don't expect fancy results from this post, I haven't been able to reduce compile time a lot, but I still wanted to share my experience.

I've used g++-4.9.2 to perform these tests.

I'm compiling the complete test suite (around 6900 source lines of codes in 36 files) in release mode. Each test file includes the ETL (around 10K SLOC). Each test is run with 8 threads (make -j8). For each result, I have run a complete build 5 times and taken the best result as the final result. Everything is run on a SSD and I have more than enough RAM to handle all the compilation in parallel.

The reference build time was 87.5 seconds.

Compile and generate dependency files at the same time

To help write my makefiles, I'm using a set of functions that I have written. This includes automatic dependency generation using -MM -MT options of the compiler. Until now, I had two targets, one to compile the cpp file into the object file and another one to generate the dependency file. I recently saw that compilers were able to do both at the same time! Clang, G++ and the Intel compiler all have a -MD -MF options that lets you generate the dependency file at the same time you compile your file, saving you at least one read of the file.

My compilation rule in my makefile has now become:

release/$(1)/%.cpp.o: $(1)/%.cpp
    @ mkdir -p release/$(1)/
    $(CXX) $(CXX_FLAGS) $(RELEASE_FLAGS) $(2) -MD -MF release/$(1)/$$*.cpp.d -o release/$(1)/$$*.cpp.o -c $(1)/$$*.cpp
    @ sed -i -e 's@^\(.*\)\.o:@\1.d \1.o:@' release/$(1)/$$*.cpp.d

This reduced the compilation time to 86.8 seconds. Not that much reduction, but it still is quite nice to know that. I would have expected this to reduce more the compile time.

Use #pragma once

Normally, I'm not a fan of #pragma since it is not standard, but for now ETL only supports three compilers and only very recent of them, so I have the guarantee that #pragma once is available, so what the hell!

I've replaced all the include guards by single #pragma once directives.

Again, the results are not impressive, this reduced the compile time to 86.2 seconds. I would only advise to use this if you are sure of the compilers you want to support and you need the extra time.

Avoid <iostream>

I've read that the <iostream> header was one of the slowest to compile of the STL. It is only one that is included several times in my headers only for stream operators and it turns out that there is a <iosfwd> header that forward declares a lot of things from the <iostream> and other I/O headers.

By replacing all <iostream> include by <iosfwd>, compile time has gone down to 84.1 seconds.

Conclusion

By using the three techniques, I've reduced the compile time from 87.5 to 84.1 seconds. I would have honestly hoped for more improvements, but this is a already a good start.

As a side note, clang compile time is 45.2 seconds under the same conditions (was 46.2 seconds before the optimizations). It is really much faster :) I'm still using GCC a lot since in several cases, it does generate much better code and in average, the generated code if faster (on my benchmarks at least). I don't have the numbers for icc, but icc is definitely the slowest of the three. When I have it available (at work), I use for release build before running something. The generated executables are generally faster (I only use Intel processors) and sometimes the difference can be quite important.

If you have ideas to reduce further the compile time on this test case, I'd be glad to hear them and put them to the test.

I hope that this small experience would be helpful to some of you :)

Other techniques

There are several other techniques that you can use to reduce compile time:

  1. Precompiled Headers are supported by both Clang and GCC, altough not in a compatible. I haven't tested this in a while, but it is quite effective and a very interesting technique. The main problem with this is that is not standard and not compatible between compilers. But it probably is the most efficient techniques when you have lots of headers and lots of templates as in my case.
  2. Unity builds can make full rebuild much faster. I personally don't like unity builds especially because it is only really good for full builds and you generally don't do full rebuilds that much (I know, I know, this is also the test done in this article :) ). Moreover, it also sucks at doing parallel builds.
  3. Pimpl idioms and other type erasure techniques can reduce compile time a lot. If it is well done, it can be implemented without so much overhead.
  4. Explicit instantiation of templates can also help, but only in the case of a user program. In the case of a library itself, you cannot do anything.
  5. Reduce inclusions and use forward declarations, obviously...
  6. Use tools like distcc (I very rarely use it) and ccache (I generally use it).
  7. Update your compiler
  8. Upgrade your computer ;)
  9. ...
Comments

Continuous Performance Management with CPM for C++
   Posted:


For some time, I have wanted some tool to monitor the performance of some of my projects. There are plenty of tools for Continuous Integration and Sonar is really great for continuous monitoring of code quality, but I haven't found anything satisfying for monitoring performance of C++ code. So I decided to write my own. Continous Performance Monitor (CPM) is a simple C++ tool that helps you running benchmarks for your C++ programs and generate web reports based on the results. In this article, I will present this tool. CPM is especially made to benchmark several sub parts of libraries, but it perfectly be used to benchmark a whole program as well.

The idea is to couple it with a Continuous Integration tool (I use Jenkins for instance) and run the benchmarks for every new push in a repository. With that, you can check if you have performance regression for instance or simply see if your changes were really improving the performance as much as you thought.

It is made of two separate parts:

  1. A header-only library that you can use to benchmark your program and that will give you the performance results. It will also generate a JSON report of the collected data.
  2. A program that will generate a web version of the reports with analysis over time, over different compilers or over different configurations.

CPM is especially made to benchmark functions that takes input data and which runtime depends on the dimensions of the input data. For each benchmark, CPM will execute it with several different input sizes. There are different ways to define a benchmark:

  • two_pass: The benchmark is made of two part, the initialization part is called once for each input size and then the benchmark part is repeated several times for the measure. This is the most general version.
  • global: The benchmark will be run with different input sizes but uses global data that will be randomized before each measure
  • simple: The benchmark will be run with different input sizes, data will not be randomized
  • once: The benchmark will be run with no input size.

Note: The randomization of the data can be disabled.

You can run independent benchmarks or you can run sections of benchmarks. A section is used to compared different implementations of the same thing. For instance, I use them to compare different implementation of convolution or to see how ETL compete with other Expression Templates library.

/images/cpm_large.png

Examples

I've uploaded three generated reports so that you can have look at some results:

Run benchmarks

There are two ways of running CPM. You can directly use the library to run the benchmarks or you can use the macro facilities to make it easier. I recommend to use the second way since it is easier and I'm gonna try to keep it stable while the library can change. If you want an example of using the library directly, you can take a look at this example. In this chapter, I'm gonna focus on the macro-way.

The library is available here, you can either include as a submodule of your projects or install it globally to have access to its headers.

The first thing to do is to include the CPM header:

#define CPM_BENCHMARK "Example Benchmarks"
#include "cpm/cpm.hpp"

You have to name your benchmark. This will automatically creates a main and will run all the declared benchmark.

Define benchmarks

Benchmarks can be defined either in a CPM_BENCH functor or in the global scope with CPM_DIRECT_BENCH.

  1. simple
CPM_DIRECT_BENCH_SIMPLE("bench_name", [](std::size_t d){ std::this_thread::sleep_for((factor * d) * 2_ns ); })

The first argument is the name of the benchmark and the second argument is the function that will be benchmarked by the system, this function takes the input size as input.

  1. global
CPM_BENCH() {
    test a{3};
    CPM_GLOBAL("bench_name", [&a](std::size_t d){ std::this_thread::sleep_for((factor * d * a.d) * 1_ns ); }, a);
}

The first argument is the name of the benchmark, the second is the function being benchmarked and the following arguments must be references to global data that will be randomized by CPM.

  1. two_pass
CPM_DIRECT_BENCH_TWO_PASS("bench_name",
    [](std::size_t d){ return std::make_tuple(test{d}); },
    [](std::size_t d, test& d2){ std::this_thread::sleep_for((factor * 3 * (d + d2.d)) * 1_ns ); }
)

Again, the first argument is the name. The second argument is the initialization functor. This functor must returns a tuple with all the information that will be passed (unpacked) to the third argument (the benchmark functor). Everything that is being returned by the initialization functor will be randomized.

Select the input sizes

By default, CPM will invoke your benchmarks with values from 10 to 1000000, multiplying it by 10 each step. This can be tuned for each benchmark and section independently. Each benchmark macro has a _P suffix that allows you to set the size policy:

CPM_SIMPLE_P(
    VALUES_POLICY(1,2,3,4,5,6),
    "simple_a_n",
    [](std::size_t d){ std::this_thread::sleep_for((factor * d) * 1_ns ); });

You can also have several sizes (for multidimensional data structures or algorithms):

CPM_DIRECT_BENCH_TWO_PASS_P(
    NARY_POLICY(VALUES_POLICY(16, 16, 32, 32, 64, 64), VALUES_POLICY(4, 8, 8, 16, 16, 24)),
    "convmtx2",
    [](std::size_t d1, std::size_t d2){ return std::make_tuple(dmat(d1, d1), dmat((d1 + d2 - 1)*(d1 + d2 - 1), d2 * d2)); },
    [](std::size_t /*d1*/, std::size_t d2, dmat& a, dmat& b){ b = etl::convmtx2(a, d2, d2); }
)

Configure benchmarks

By default, each benchmark is run 10 times for warmup and then repeated 50 times, but you can define your own values:

#define CPM_WARMUP 3
#define CPM_REPEAT 10

This must be done before the inclusion of the header.

Define sections

Sections are simply a group of benchmarks, so instead of putting several benchmarks inside a CPM_BENCH, you can put them inside a CPM_SECTION. For instance:

CPM_SECTION("mmul")
    CPM_SIMPLE("std", [](std::size_t d){ std::this_thread::sleep_for((factor * d) * 9_ns ); });
    CPM_SIMPLE("fast", [](std::size_t d){ std::this_thread::sleep_for((factor * (d / 3)) * 1_ns ); });
    CPM_SIMPLE("common", [](std::size_t d){ std::this_thread::sleep_for((factor * (d / 2)) * 3_ns ); });
}
CPM_SECTION("conv")
    CPM_TWO_PASS("std",
        [](std::size_t d){ return std::make_tuple(test{d}); },
        [](std::size_t d, test& d2){ std::this_thread::sleep_for((factor * 5 * (d + d2.d)) * 1_ns ); }
        );
    CPM_TWO_PASS("fast",
        [](std::size_t d){ return std::make_tuple(test{d}); },
        [](std::size_t d, test& d2){ std::this_thread::sleep_for((factor * 3 * (d + d2.d)) * 1_ns ); }
        );
}

You can also set different warmup and repeat values for each section by using CPM_SECTION_O:

CPM_SECTION_O("fft",11,51)
    test a{3};
    test b{5};
    CPM_GLOBAL("std", [&a](std::size_t d){ std::this_thread::sleep_for((factor * d * (d % a.d)) * 1_ns ); }, a);
    CPM_GLOBAL("mkl", [&b](std::size_t d){ std::this_thread::sleep_for((factor * d * (d % b.d)) * 1_ns ); }, b);
}

will be warmup 11 times and run 51 times.

The size policy can also be changed for the complete section (cannot be changed independently for benchmarks inside the section):

CPM_SECTION_P("mmul",
    NARY_POLICY(STD_STOP_POLICY, VALUES_POLICY(1,2,3,4,5,6), VALUES_POLICY(2,4,8,16,32,64)))
    test a{3};
    test b{5};
    CPM_GLOBAL("std", [&a](std::size_t d1,std::size_t d2, std::size_t d3){ /* Something */ }, a);
    CPM_GLOBAL("mkl", [&a](std::size_t d1,std::size_t d2, std::size_t d3){ /* Something */ }, a);
    CPM_GLOBAL("bla", [&a](std::size_t d1,std::size_t d2, std::size_t d3){ /* Something */ }, a);
}

Run

Once your benchmarks and sections are defined, you can build you program as a normal C++ main and run it. You can pass several options:

./debug/bin/full -h
Usage:
  ./debug/bin/full [OPTION...]

  -n, --name arg           Benchmark name
  -t, --tag arg            Tag name
  -c, --configuration arg  Configuration
  -o, --output arg         Output folder
  -h, --help               Print help

The tag is used to distinguish between runs, I recommend that you use a SCM identifier for the tag. If you want to run your program with different configurations (compiler options for instance), you'll have to set the configuration with the --configuration option.

Here is a possible output:

 Start CPM benchmarks
    Results will be automatically saved in /home/wichtounet/dev/cpm/results/10.cpm
    Each test is warmed-up 10 times
    Each test is repeated 50 times
    Time Sun Jun 14 15:33:51 2015

    Tag: 10
    Configuration:
    Compiler: clang-3.5.0
    Operating System: Linux x86_64 3.16.5-gentoo


 simple_a(10) : mean: 52.5us (52.3us,52.7us) stddev: 675ns min: 48.5us max: 53.3us througput: 190KEs
 simple_a(100) : mean: 50.1us (48us,52.2us) stddev: 7.53us min: 7.61us max: 52.3us througput: 2MEs
 simple_a(1000) : mean: 52.7us (52.7us,52.7us) stddev: 48.7ns min: 52.7us max: 53us througput: 19MEs
 simple_a(10000) : mean: 62.6us (62.6us,62.7us) stddev: 124ns min: 62.6us max: 63.5us througput: 160MEs
 simple_a(100000) : mean: 161us (159us,162us) stddev: 5.41us min: 132us max: 163us througput: 622MEs
 simple_a(1000000) : mean: 1.16ms (1.16ms,1.17ms) stddev: 7.66us min: 1.15ms max: 1.18ms througput: 859MEs

-----------------------------------------
|            gemm |       std |     mkl |
-----------------------------------------
|           10x10 | 51.7189us | 64.64ns |
|         100x100 | 52.4336us | 63.42ns |
|       1000x1000 | 56.0097us |  63.2ns |
|     10000x10000 | 95.6123us | 63.52ns |
|   100000x100000 | 493.795us | 63.48ns |
| 1000000x1000000 | 4.46646ms |  63.8ns |
-----------------------------------------

The program will give you for each benchmark, the mean duration (with confidence interval), the standard deviation of the samples, the min and max duration and an estimated throughput. The throughput is simply using the size and the mean duration. Each section is directly compared with an array-like output. Once the benchmark is run, a JSON report will be generated inside the output folder.

Continuous Monitoring

Once you have run the benchmark, you can use the CPM program to generate the web reports. It will generate:

  • 1 performance graph for each benchmark and section
  • 1 graph comparing the performances over time of your benchmark sections if you have run the benchmark several time
  • 1 graph comparing different compiler if you have compiled your program with different compiler
  • 1 graph comparing different configuration if you have run the benchmark with different configuration
  • 1 table summary for each benchmark / section

First you have to build and install the CPM program (you can have a look at the Readme for more informations.

Several options are available:

Usage:
  cpm [OPTION...]  results_folder

      --time-sizes             Display multiple sizes in the time graphs
  -t, --theme arg              Theme name [raw,bootstrap,boostrap-tabs] (default:bootstrap)
  -c, --hctheme theme_name     Highcharts Theme name [std,dark_unica] (default:dark_unica)
  -o, --output output_folder   Output folder (default:reports)
      --input arg              Input results
  -s, --sort-by-tag            Sort by tag instaed of time
  -p, --pages                  General several HTML pages (one per bench/section)
  -d, --disable-time           Disable time graphs
      --disable-compiler       Disable compiler graphs
      --disable-configuration  Disable configuration graphs
      --disable-summary        Disable summary table
  -h, --help                   Print help

There are 3 themes:

  • bootstrap: The default theme, using Bootstrap to make a responsive interface.
  • bootstrap-tabs: Similar to the bootstrap theme except that only is displayed at the same time for each benchmark, with tabs.
  • raw : A very basic theme, only using Highcharts library for graphs. It is very minimalistic

For instance, here are how the reports are generated for the ETL benchmark:

cpm -p -s -t bootstrap -c dark_unica -o reports results

Here is the graph generated for the "R = A + B + C" benchmark and different compilers:

/images/cpm_etl_compiler.png

and its summary:

/images/cpm_etl_summary.png

Here is the graph for a 2D convolution with ETL:

/images/cpm_etl_section.png

And the graph for different configurations of ETL and the dense matrix matrix multiplication:

/images/cpm_etl_configuration.png

Conclusion and Future Work

Although CPM is already working, there are several things that could be done to improve it further:

  • The generated web report could benefit from a global summary.
  • The throughput evaluation should be evaluated more carefully.
  • The tool should automatically evaluate the number of times that each tests should be run to have a good result instead of global warmup and repeat constants.
  • A better bootstrapping procedure should be used to determine the quality of the results and compute the confidence intervals.
  • The performances of the website with lots of graphs should be improved.
  • Make CPM more general-purpose to support larger needs.

Here it is, I have summed most of the features of the CPM Continuous Performance Analysis tool. I hope that it will be helpful to some of you as well.

If you have other ideas or want to contribute something to the project, you can directly open an issue or a pull request on Github. Or contact me via this site or Github.

Comments

C++17 Fold Expressions
   Posted:


Variadic Templates

C++11 introduced variadic template to the languages. This new feature allows to write template functions and classes taking an arbitrary number of template parameters. This a feature I really like and I already used it quite a lot in my different libraries. Here is a very simple example computing the sum of the parameters:

auto old_sum(){
    return 0;
}

template<typename T1, typename... T>
auto old_sum(T1 s, T... ts){
    return s + old_sum(ts...);;
}

What can be seen here is a typical use of variadic templates. Almost all the time, is is necessary to use recursion and several functions to unpack the parameters and process them. There is only one way to unpack the arguments, by using the ... operator that simply put comma between arguments. Even if it works well, it is a bit heavy on the code. This will likely be completely optimized to a series of addition by the compiler, but it may still happen in more complicated functions that this is not done. Moreover, the intent is not always clear with that.

That is why C++17 introduced an extension for the variadic template, fold expressions.

Fold expressions

Fold expressions are a new way to unpack variadic parameters with operators. For now, only Clang 3.6 supports C++17 fold expression, with the -std=c++1z flag. That is the compiler I used to validate the examples of this post.

The syntax is bit disturbing at first but quite logical:

( pack op ... )             //(1)
( ... op pack )             //(2)
( pack op ... op init )     //(3)
( init op ... op pack )     //(4)

Where pack is an unexpanded parameter pack, op an operator and init a value. The version (1) is a right fold that is expanded like (P1 op (P2 op (P3 ... (PN-1 op PN)))). The version (2) is a left fold where the expansion is taken from the left. The (3) and (4) versions are almost the value except for an init value. Only some operators (+,*,&,|,&&,||, ,) have defined init values and can be used with the versions (1) and (2). The other operators can only be used with an init value.

For instance, here is how we could write the sum functions with fold expressions:

template<typename... T>
auto fold_sum_1(T... s){
    return (... + s);
}

I personally think it is much better, it clearly states our intent and does not need recursion. By default, the init value used for addition is 0, but you can change it:

template<typename... T>
auto fold_sum_2(T... s){
    return (1 + ... + s);
}

This will yield the sum of the elements plus one.

This can be also very practical to print some elements for instance:

template<typename ...Args>
void print_1(Args&&... args) {
    (std::cout << ... << args) << '\n';
}

And this can even be used when doing Template Metaprogramming, for instance here is a TMP version of the and operator:

template<bool... B>
struct fold_and : std::integral_constant<bool, (B && ...)> {};

Conclusion

C++17 fold expressions are a really nice additions to the language that makes working with variadic templates much easier. This already makes me wish for C++17 release :)

The source code for the examples are available on Github: https://github.com/wichtounet/articles/blob/master/src/fold_expressions.cpp

Comments