<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Blog blog("Baptiste Wicht"); (Posts about gcc)</title><link>https://baptiste-wicht.com/</link><description></description><atom:link href="https://baptiste-wicht.com/categories/gcc.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><lastBuildDate>Sun, 15 Feb 2026 06:57:40 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Decrease DLL neural network compilation time with C++17</title><link>https://baptiste-wicht.com/posts/2018/02/decrease-dll-neural-network-compilation-time-with-c%2B%2B17.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;Just last week, &lt;a class="reference external" href="https://baptiste-wicht.com/posts/2018/02/c%2B%2B17-migration-of-expression-templates-library-etl.html"&gt;I've migrated my Expression Templates Library (ETL) library to C++17&lt;/a&gt;,
it is now also done in my Deep Learning Library (DLL) library. In ETL, this
resulted in a &lt;em&gt;much nicer code overall&lt;/em&gt;, but no real improvement in compilation
time.&lt;/p&gt;
&lt;p&gt;The objective of the migration of DLL was two-fold. First, I also wanted to
simplify some code, especially with &lt;code&gt;if constexpr&lt;/code&gt;. But I also especially
wanted to try to reduce the compilation time. In the past,
&lt;a class="reference external" href="https://baptiste-wicht.com/posts/2017/09/how-i-made-deep-learning-library-38-faster-to-compile-optimization-and-cpp17-if-constexpr.html"&gt;I've already tried a few changes with C++17&lt;/a&gt;, with good results on the compilation of the entire test suite.
While this is very good, this is not very representative of users of the library.
Indeed, normally you'll have only one network in your source file not several.
The new changes will especially help in the case of many networks, but less in
the case of a single network per source file.&lt;/p&gt;
&lt;p&gt;This time, I decided to test the compilation on the examples. I've tested the
eight official examples from the DLL library:&lt;/p&gt;
&lt;ol class="arabic simple" start="0"&gt;
&lt;li&gt;&lt;p&gt;mnist_dbn: A fully-connected Deep Belief Network (DBN) on the MNIST data set
with three layers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;char_cnn: A special CNN with embeddings and merge and group layers for text
recognition&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;imagenet_cnn: A 12 layers Convolutional Neural Network (CNN) for Imagenet&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_ae: A simple two-layers auto-encoder for MNIST&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_cnn: A simple 6 layers CNN for MNIST&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_deep_ae: A deep auto-encoder for MNIST, only fully-connected&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_lstm: A Recurrent Neural Network (RNN) with Long Short Term Memory
(LSTM) cells&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_mlp: A simple fully-connected network for MNIST, with dropout&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_rnn: A simple RNN with simple cells for MNIST&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is really representative of what users can do with the library and I think
it's a much better for compilation time.&lt;/p&gt;
&lt;p&gt;For reference, you can find &lt;a class="reference external" href="https://github.com/wichtounet/dll/tree/master/examples/src"&gt;the source code of all the examples online&lt;/a&gt;.&lt;/p&gt;
&lt;section id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;Let's start with the results. I've tested this at different stages of the
migration with clang 5 and GCC 7.2. I tested the following steps:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;The original C++14 version&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simply compiling in c++17 mode (-std=c++17)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Using the C++17 version of the ETL library&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Upgrading DLL to C++17 (without ETL)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ETL and DLL in C++17 versions&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I've compiled each example independently in release_debug mode. Here are the
results for G++ 7.2:&lt;/p&gt;
&lt;table class="align-center"&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Example&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;0&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;3&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;4&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;5&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;6&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;7&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;8&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;C++14&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.818&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.944&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.511&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.403&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.998&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.911&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.745&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.974&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.006&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;-std=c++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.358&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.409&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.707&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.810&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.042&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.896&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.635&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.134&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.027&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;ETL C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;36.045&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.000&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.942&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.322&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.840&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.747&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.151&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.208&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.939&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;DLL C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.251&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.577&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.854&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.653&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.758&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.851&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.606&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.098&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.146&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Final C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.289&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.133&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.939&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.232&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.753&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.526&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.326&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.116&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;17.819&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Final Improvement&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.62%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.49%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.67%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.11%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.15%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.27%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.69%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.52%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.24%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The difference by just enabling c++17 is not significant. On the other hand,
some significant gain can be obtained by using the C++17 version of ETL,
especially for the DBN version and for the CNN versions. Except for the DBN
case, the migration of DLL to C++17 did not bring any significant advantage.
When everything is combined, the gains are more important :) In the best case,
the example is 14.6% faster to compile.&lt;/p&gt;
&lt;p&gt;Let's see if it's the same with clang++ 5.0:&lt;/p&gt;
&lt;table class="align-center"&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Example&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;0&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;3&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;4&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;5&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;6&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;7&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;8&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;C++14&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.690&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.753&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.488&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.146&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.926&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;17.708&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.806&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.207&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;20.858&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;-std=c++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.502&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.664&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.990&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.027&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.510&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;17.630&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.465&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.161&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;20.860&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;ETL C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.386&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.008&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.896&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.519&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.269&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.995&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.897&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.383&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.809&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;DLL C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.252&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.592&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.250&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.131&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.782&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;17.606&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.595&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.126&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;20.782&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Final C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.470&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.154&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.881&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.415&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.279&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;17.078&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.808&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.497&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.761&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Final Improvement&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.28%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.60%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.52%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.52%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.15%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.55%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.34%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.69%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.25%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;First of all, as I have seen time after time, clang is still slower than GCC.
It's a not a big difference, but still significant. Overall, the gains are a bit
higher on clang than on GCC, but not by much. Interestingly, the migration of
DLL to C++17 is less interesting in terms of compilation time for clang. It
seems even to slow down compilation on some examples. On the other hand, the
migration of ETL is more important than on GCC.&lt;/p&gt;
&lt;p&gt;Overall, every example is faster to compile using both libraries in C++17, but
we don't have spectacular speed-ups. With clang, we have speedups from 3.3% to
15.3%. With GCC, we have speedup  from 1.1% to 14.6%. It's not very high, but
I'm already satisfied with these results.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="c-17-in-dll"&gt;
&lt;h2&gt;C++17 in DLL&lt;/h2&gt;
&lt;p&gt;Overall, the migration of DLL to C++17 was quite similar to that of ETL. You can
take a look at my &lt;a class="reference external" href="https://baptiste-wicht.com/posts/2018/02/c%2B%2B17-migration-of-expression-templates-library-etl.html"&gt;previous article&lt;/a&gt;
if you want more details on C++17 features I've used.&lt;/p&gt;
&lt;p&gt;I've &lt;em&gt;replaced a lot of SFINAE functions&lt;/em&gt; with &lt;code&gt;if constexpr&lt;/code&gt;. I've also
replaced a lot of &lt;code&gt;statif_if&lt;/code&gt; with &lt;code&gt;if constexpr&lt;/code&gt;. There was a large
number of these in DLL's code. I also enabled all the &lt;code&gt;constexpr&lt;/code&gt; that
were commented for this exact time :)&lt;/p&gt;
&lt;p&gt;I was also thinking that I could replace a lot of meta-programming stuff with
&lt;em&gt;fold expressions&lt;/em&gt;. While I was able to replace a few of them, most of them were
harder to replace with fold expressions. Indeed, the variadic pack is often
hidden behind another class and therefore the pack is not directly usable from
the network class or the group and merge layers classes. I didn't want to start
a big refactoring just to use a C++17 feature, the current state of this code is
fine.&lt;/p&gt;
&lt;p&gt;I made some use of structured bindings as well, but again not as much as I was
thinking. In fact, a lot of time, I'm assigning the elements of a pair or tuple
to existing variables not declaring new variables and unfortunately, you can
only use structured bindings with &lt;code&gt;auto&lt;/code&gt; declaration.&lt;/p&gt;
&lt;p&gt;Overall, the &lt;em&gt;code is significantly better now&lt;/em&gt;, but there was less impact than
there was on ETL. It's also a smaller code base, so maybe this is normal and my
expectations were too high ;)&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The trunk of DLL is now a C++17 library :) I think this improve the quality of
the code by a nice margin! Even though, there is still some work to be done to
improve the code, especially for the DBN pretraining code, the quality is quite
good now. Moreover, the switch to C++17 made the compilation of neural networks
using the DLL library &lt;em&gt;faster to compile&lt;/em&gt;, from 1.1% in the worst case to 15.3% in
the best case! I don't know when I will release the next version of DLL, but it
will take some time. I'll especially have to polish the RNN support and add
a sequence to sequence loss before I will release the 1.1 version of DLL.&lt;/p&gt;
&lt;p&gt;I'm quite satisfied with C++17 even if I would have liked a bit more features to
play with! I'm already a big fan of &lt;code&gt;if constexpr&lt;/code&gt;, this can make the code
much nicer and fold expressions are much more intuitive than their previous
recursive template counterpart.&lt;/p&gt;
&lt;p&gt;I may also consider migrating some parts of the cpp-utils library, but if I do,
it will only be through the use of conditionals in order not to break the other
projects that are based on the library.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>C++17</category><category>clang</category><category>Compilers</category><category>Deep Learning</category><category>dll</category><category>etl</category><category>gcc</category><category>Machine Learning</category><category>Performance</category><category>projects</category><guid>https://baptiste-wicht.com/posts/2018/02/decrease-dll-neural-network-compilation-time-with-c%2B%2B17.html</guid><pubDate>Wed, 07 Feb 2018 10:39:02 GMT</pubDate></item><item><title>How I made my Deep Learning Library 38% faster to compile (Optimization and C++17 if constexpr)</title><link>https://baptiste-wicht.com/posts/2017/09/how-i-made-deep-learning-library-38-faster-to-compile-optimization-and-cpp17-if-constexpr.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;div&gt;&lt;p&gt;My Deep Learning Library (DLL) project is a C++ library for training and using
artificial neural networks (you can take a look at
&lt;a class="reference external" href="https://baptiste-wicht.com/posts/2017/07/update-on-deep-learning-library-dll-dropout-batch-normalization-adaptive-learning-rates.html"&gt;this post about DLL&lt;/a&gt;
if you want more information).&lt;/p&gt;
&lt;p&gt;While I made a lot of effort to make it as fast as possible to train and run
neural networks, the compilation time has been steadily going up and is becoming
quite annoying. This library is heavily templated and all the matrix operations
are done using my Expression Templates Library (ETL) which is more than
template-heavy itself.&lt;/p&gt;
&lt;p&gt;In this post, I'll present two techniques with which I've been able to reduce
the total compilation of the DLL unit tests by up to 38%.&lt;/p&gt;
&lt;p class="more"&gt;&lt;a href="https://baptiste-wicht.com/posts/2017/09/how-i-made-deep-learning-library-38-faster-to-compile-optimization-and-cpp17-if-constexpr.html"&gt;Read more…&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;</description><category>C++</category><category>C++17</category><category>clang</category><category>Compilers</category><category>dll</category><category>etl</category><category>gcc</category><category>Performance</category><category>projects</category><guid>https://baptiste-wicht.com/posts/2017/09/how-i-made-deep-learning-library-38-faster-to-compile-optimization-and-cpp17-if-constexpr.html</guid><pubDate>Thu, 21 Sep 2017 17:44:34 GMT</pubDate></item><item><title>Compiler benchmark GCC and Clang on C++ library (ETL)</title><link>https://baptiste-wicht.com/posts/2017/08/compiler-benchmark-gcc-clang-cpp-library-etl.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;It's been a while since I've done a benchmark of different compilers on C++
code. Since I've recently
&lt;a class="reference external" href="https://baptiste-wicht.com/posts/2017/08/expression-templates-library-etl-11.html"&gt;released the version 1.1 of my ETL project&lt;/a&gt;
(an optimized matrix/vector computation library with expression templates), I've
decided to use it as the base of my benchmark. It's a C++14 library with a lot
of templates. I'm going to compile the full test suite (124 test cases). This is
done directly on the last release (1.1) code. I'm going to compile once in debug
mode and once in release_debug (release plus debug symbols and assertions) and
record the times for each compiler. The tests were compiled with support for
every option in ETL to account to maximal compilation time. Each compilation was
made using four threads (make -j4). I'm also going to test a few of the
benchmarks to see the difference in runtime performance between the code
generated by each compiler. The benchmark will be compiled in release mode and
its compilation time recorded as well.&lt;/p&gt;
&lt;p&gt;I'm going to test the following compilers:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;GCC-4.9.4&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GCC-5.4.0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GCC-6.3.0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GCC-7.1.0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;clang-3.9.1&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;clang-4.0.1&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;zapcc-1.0 (commercial, based on clang-5.0 trunk)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All have been installed directly using Portage (Gentoo package manager) except
for clang-4.0.1 that has been installed from sources and zapcc since it does not
have a Gentoo package. Since clang package on Gentoo does not support
multislotting, I had to install one version from source and the other from the
package manager. This is also the reason I'm testing less versions of clang,
simply less practical.&lt;/p&gt;
&lt;p&gt;For the purpose of these tests, the exact same options have been used throughout
all the compilers. Normally, I use different options for clang than for GCC
(mainly more aggressive vectorization options on clang). This may not lead to
the best performance for each compiler, but allows for comparison between the
results with defaults optimization level. Here are the main options used:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;In debug mode: -g&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In release_debug mode: -g -O2&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In release mode: -g -O3 -DNDEBUG -fomit-frame-pointer&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In each case, a lot of warnings are enabled and the ETL options are the same.&lt;/p&gt;
&lt;p&gt;All the results have been gathered on a Gentoo machine running on Intel Core
i7-2600 (Sandy Bridge...) @3.4GHz with 4 cores and 8 threads, 12Go of RAM and
a SSD. I do my best to isolate as much as possible the benchmark from
perturbations and that my benchmark code is quite sound, it may well be that
some results are not totally accurate. Moreover, some of the benchmarks are
using multithreading, which may add some noise and unpredictability. When I was
not sure about the results, I ran the benchmarks several time to confirm them
and overall I'm confident of the results.&lt;/p&gt;
&lt;section id="compilation-time"&gt;
&lt;h2&gt;Compilation Time&lt;/h2&gt;
&lt;p&gt;Let's start with the results of the performance of the compilers themselves:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Debug&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Release_Debug&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Benchmark&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;402s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;616s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;100s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;403s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;642s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;95s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;399s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;683s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;102s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;371s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;650s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;105s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;380s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;807s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;106s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;260s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;718s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;92s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;221s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;649s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;108s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Note: For Release_Debug and Benchmark, I only use three threads with zapcc,
because 12Go of RAM is not enough memory for four threads.&lt;/p&gt;
&lt;p&gt;There are some very significant differences between the different compilers.
Overall, clang-4.0.1 is by far the fastest free compiler for Debug mode. When
the tests are compiled with optimizations however, clang is falling behind.
It's quite impressive how clang-4.0.1 manages to be so much faster than
clang-3.9.1 both in debug mode and release mode. Really great work by the clang
team here! With these optimizations, clang-4.0.1 is almost on par with gcc-7.1
in release mode.  For GCC, it seems that the cost of optimization has been going
up quite significantly. However, GCC 7.1 seems to have made optimization faster
and standard compilation much faster as well. If we take into account zapcc,
it's the fastest compiler on debug mode, but it's slower than several gcc
versions on release mode.&lt;/p&gt;
&lt;p&gt;Overall, I'm quite impressed by the performance of clang-4.0.1 which seems
really fast! I'll definitely make more tests with this new version of the
compiler in the near future. It's also good to see that g++-7.1 also did make
the build faster than gcc-6.3. However, the fastest gcc version for optimization
is still gcc-4.9.4 which is already an old branch with low C++ standard support.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="runtime-performance"&gt;
&lt;h2&gt;Runtime Performance&lt;/h2&gt;
&lt;p&gt;Let's now take a look at the quality of the generated code. For some of the
benchmarks, I've included two versions of the algorithm. &lt;em&gt;std&lt;/em&gt; is the most
simple algorithm (the naive one) and &lt;em&gt;vec&lt;/em&gt; is the hand-crafted vectorized and
optimized implementation. All the tests were done on single-precision floating
points.&lt;/p&gt;
&lt;section id="dot-product"&gt;
&lt;h3&gt;Dot product&lt;/h3&gt;
&lt;p&gt;The first benchmark that is run is to compute the dot product between two
vectors. Let's look first at the naive version:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;dot (std)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;500&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;2000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;3000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;4000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;5000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000000&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.96ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;97.12ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;126.07ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;25.91us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;326.49us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.92ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.55ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.22ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.36ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;72.96ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;101.62ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;127.89ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.90us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.39us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;357.63us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.23ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.57ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.32ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;73.31ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;102.88ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;130.16ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.314us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;339.13us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.47ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.95ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.70ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.69ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;70.20ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;104.09ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;130.98ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.90us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.96us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;281.47us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.93ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.58ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.19ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.33ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.69ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;98.69ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;128.60ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.33us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;272.71us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.56ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.19ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.37ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;60.31ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;96.34ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;128.90ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.87us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;270.21us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.23ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.55ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.18ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.35ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;61.14ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;96.92ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;125.95ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.84us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;285.80us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.92ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.55ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.34ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The differences are not very significant between the different compilers. The
clang-based compilers seem to be the compilers producing the fastest code.
Interestingly, there seem to have been a big regression in gcc-6.3 for large
containers, but that has been fixed in gcc-7.1.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;dot (vec)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;500&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;2000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;3000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;4000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;5000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000000&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;48.34ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;80.53ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;114.97ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.72us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.79us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;354.20us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.52ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.19ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.55ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;47.16ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;77.70ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;113.66ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.72us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.71us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;363.86us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.52ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.19ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.56ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;46.39ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;77.67ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;116.28ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.74us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.39us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;452.44us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.45ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.26ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.87ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.49ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.52ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;49.70ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;80.40ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;115.77ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.71us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.46us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;355.16us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.21ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.85ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.49ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.14ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.47ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;46.13ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;78.01ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;114.70ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.66us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.82us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;359.42us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.88ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.53ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.50ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;45.59ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;74.90ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;111.29ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.57us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.47us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;351.31us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.23ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.85ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.49ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.12ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.45ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;45.11ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;75.04ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;111.28ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.59us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.46us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;357.32us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.25ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.53ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.15ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.47ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If we look at the optimized version, the differences are even slower. Again, the
clang-based compilers are producing the fastest executables, but are closely
followed by gcc, except for gcc-6.3 in which we can still see the same
regression as before.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="logistic-sigmoid"&gt;
&lt;h3&gt;Logistic Sigmoid&lt;/h3&gt;
&lt;p&gt;The next test is to check the performance of the sigmoid operation. In that
case, the evaluator of the library will try to use parallelization and
vectorization to compute it. Let's see how the different compilers fare:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sigmoid&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000000&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;8.16us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.23us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.33us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.56us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;259.72us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.78ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.07us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.08us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.39us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.44us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;266.27us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.96ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.13us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.32us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.45us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.99us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;261.81us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.86ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.03us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.09us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.24us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.61us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;252.78us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.71ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.30us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.25us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.57us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.24us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;256.75us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.99ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.47us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.14us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.77us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;26.03us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;235.87us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.81ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.51us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.26us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.48us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.86us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;258.31us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.95ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Interestingly, we can see that gcc-7.1 is the fastest for small vectors while
clang-4.0 is the best for producing code for larger vectors. However, except for
the biggest vector size, the difference is not really significantly. Apparently,
there is a regression in zapcc (or clang-5.0) since it's slower than clang-4.0
at the same level as clang-3.9.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="y-alpha-x-y-axpy"&gt;
&lt;h3&gt;y = alpha * x + y (axpy)&lt;/h3&gt;
&lt;p&gt;The third benchmark is the well-known axpy (y = alpha * x + y). This is entirely
resolved by expressions templates in the library, no specific algorithm is used.
Let's see the results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;saxpy&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000000&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.1ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;61.6ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;374ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.65us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.8us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;518us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.0ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;58.1ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;383ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.87us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;43.2us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;479us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.3ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;59.4ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;371ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.57us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.4us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;452us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.8ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;59.7ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;399ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.78us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;43.1us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;547us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.3ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;53.8ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;297ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.21us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.3us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;466us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.4ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;59.8ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;296ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.31us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.2us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;475us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.0ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;54.0ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;333ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.32us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.7us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;447us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Even on the biggest vector, this is a very fast operation, once vectorized and
parallelized. At this speed, some of the differences observed may not be highly
significant. Again clang-based versions are the fastest versions on this code,
but by a small margin.  There also seems to be a slight regression in gcc-7.1,
but again quite small.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="matrix-matrix-multiplication-gemm"&gt;
&lt;h3&gt;Matrix Matrix multiplication (GEMM)&lt;/h3&gt;
&lt;p&gt;The next benchmark is testing the performance of a Matrix-Matrix Multiplication,
an operation known as GEMM in the BLAS nomenclature. In that case, we test both
the naive and the optimized vectorized implementation. To save some horizontal
space, I've split the tables in two.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sgemm (std)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;20&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;40&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;80&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.04us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;50.15us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;356.42us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.18ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.41ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.56ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;8.14us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;74.77us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;513.64us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.72ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.05ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.92ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;8.03us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.78us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;504.41us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.69ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.87ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.95us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;65.00us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;508.84us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.69ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.84ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.58us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.59us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;222.36us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.73ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.77us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.41ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.00us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;25.47us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;190.56us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.61ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.45us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.80ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.00us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;25.38us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;189.98us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.60ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.43us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.81ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sgemm (std)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;200&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;300&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;400&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;500&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;600&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;700&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;800&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;900&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1200&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;44.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;148.88ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;455.81ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;687.96ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.47s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.98s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.81s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.00s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.91s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;9.52s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;63.17ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;213.01ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;504.83ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;984.90ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.70s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.70s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.03s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.74s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.87s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.905&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.04ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;212.12ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;502.95ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;981.74ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.69s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.69s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.13s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.85s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;8.10s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.08s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;62.57ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;210.72ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;499.68ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;974.94ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.68s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.67s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.99s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.68s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.85s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;13.49s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;27.48ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;90.85ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;219.34ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;419.53ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.72s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.18s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.90s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.44s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.36s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.84s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.01ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;73.90ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;175.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;340.70ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.58s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.93s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.40s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.98s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.79s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.69s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.33ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;75.80ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;181.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;359.13ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.63s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.02s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.52s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.24s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.21s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.62s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This time, the differences between the different compilers are very significant.
The clang compilers are leading the way by a large margin here, with clang-4.0
being the fastest of them (by another nice margin). Indeed, clang-4.0.1 is
producing code that is, on average, about twice faster than the code generated
by the best GCC compiler. Very interestingly as well, we can see a huge
regression starting from GCC-5.4 and that is still here in GCC-7.1. Indeed, the
best GCC version, in the tested versions, is again GCC-4.9.4. Clang is really
doing an excellent job of compiling the GEMM code.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sgemm (vec)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;20&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;40&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;80&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;264.27ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.95us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.28us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.77us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.50us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;60.37us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;271.41ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.99us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.31us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.811us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.116us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;61.00us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;279.72ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.02us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.27us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.39us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.29us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;61.99us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;273.74ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.96us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.81us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.55us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.35us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;71.11us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;296.67ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.34us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.18us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.93us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.15us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;82.60us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;322.68ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.38us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.17us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;20.19us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.17us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;83.64us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;307.49ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.41us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.10us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.72us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.72us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;84.80us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sgemm (vec)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;200&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;300&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;400&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;500&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;600&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;700&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;800&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;900&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1200&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;369.52us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.62ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.17ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;11.74ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.82ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;51.67ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.36ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;111.15ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;387.54us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.60ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.97ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.36ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;12.11ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.37ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.37ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;52.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;65.72ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;112.74ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;384.43us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.74ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.12ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;12.44ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.15ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.87ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;52.59ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;70.074ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;119.22ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;458.05us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.81ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.44ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.86ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;13.43ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.70ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;36.54ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;53.47ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;66.87ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;117.25ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;494.52us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.96ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.80ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;8.88ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.37ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;41.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;60.72ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;72.28ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;123.75ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;511.24us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.04ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.11ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;9.46ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.34ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;27.23ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;58.14ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;72.78ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;128.60ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;492.28us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.03ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.90ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;9.00ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.31ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;25.72ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.09ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;55.79ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;67.88ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;119.92ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;As for the optimized version, it seems that the two families are reversed.
Indeed, GCC is doing a better job than clang here, and although the margin is
not as big as before, it's still significant. We can still observe a small
regression in GCC versions because the 4.9 version is again the fastest. As for
clang versions, it seems that clang-5.0 (used in zapcc) has had some performance
improvements for this case.&lt;/p&gt;
&lt;p&gt;For this case of matrix-matrix multiplication, it's very impressive that the
differences in the non-optimized code are so significant. And it's also
impressive that each family of compilers has its own strength, clang being
seemingly much better at handling unoptimized code while GCC is better at
handling vectorized code.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="convolution-2d"&gt;
&lt;h3&gt;Convolution (2D)&lt;/h3&gt;
&lt;p&gt;The last benchmark that I considered is the case of the valid convolution on 2D
images. The code is quite similar to the GEMM code but more complicated to
optimized due to cache locality.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sconv2_valid (std)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100x50&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;105x50&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;110x55&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;115x55&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;120x60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;125x60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;130x65&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;135x65&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;140x70&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;27.93ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.68ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.62ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;48.23ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;57.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;67.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;78.45ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;92.53ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;105.08ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.60ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;44.94ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;54.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.45ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;76.63ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;89.75ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;105.08ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;121.66ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;140.95ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.10ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;44.99ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;54.34ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.54ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;76.54ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;89.87ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;105.35ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;121.94ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;141.20ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.55ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;45.08ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;54.39ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.48ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;76.51ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;92.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;106.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;125.67ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;143.57ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.42ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.59ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.21ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;26.40ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.03ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;36.26ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;42.35ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;48.87ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;56.29ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.48ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.67ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.34ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;26.50ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;36.58ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;42.61ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;49.33ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;56.80ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.29ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.37ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.00ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;26.10ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.75ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.95ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;41.85ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;48.42ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;55.74ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In that case, we can observe the same as for the GEMM. The clang-based versions
are much producing significantly faster code than the GCC versions. Moreover, we
can also observe the same large regression starting from GCC-5.4.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sconv2_valid (vec)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100x50&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;105x50&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;110x55&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;115x55&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;120x60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;125x60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;130x65&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;135x65&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;140x70&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;878.32us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.07ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.68ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.04ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.06ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.54ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.14ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;853.73us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.03ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.15ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.36ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.76ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.05ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.44ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.13ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;847.95us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.14ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.35ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.74ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.98ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.43ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.90ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.12ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;795.82us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.93ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.05ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.60ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.77ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.69ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.81ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;782.46us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.93ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.05ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.26ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.60ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.84ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.21ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.65ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.84ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;767.58us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.92ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.04ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.25ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.59ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.83ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.62ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.83ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;782.49us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.94ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.06ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.62ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.83ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.65ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.85ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This time, clang manages to produce excellent results. Indeed, all the produced
executables are significantly faster than the versions produced by GCC, except
for GCC-7.1 which is producing similar results. The other versions of GCC are
falling behind it seems. It seems that it was only for the GEMM that clang was
having a lot of troubles handling the optimized code.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Clang seems to have recently done a lot of optimizations regarding compilation
time. Indeed, clang-4.0.1 is much faster for compilation than clang-3.9.
Although GCC-7.1 is faster than GCC-6.3, all the GCC versions are slower than
GCC-4.9.4 which is the fastest at compiling code with optimizations. GCC-7.1 is
the fastest GCC version for compiling code in debug mode.&lt;/p&gt;
&lt;p&gt;In some cases, there is almost no difference between different compilers in the
generated code. However, in more  complex algorithms such as the matrix-matrix
multiplication or the two-dimensional convolution, the differences can be quite
significant. In my tests, Clang have shown to be much better at compiling
unoptimized code. However, and especially in the GEMM case, it seems to be worse
than GCC at handling hand-optimized. I will investigate that case and try to
tailor the code so that clang is having a better time with it.&lt;/p&gt;
&lt;p&gt;For me, it's really weird that the GCC regression, apparently starting from
GCC-5.4, has still not been fixed in GCC 7.1. I was thinking of dropping support
for GCC-4.9 in order to go full C++14 support, but now I may have to reconsider
my position. However, seeing that GCC is generally the best at handling
optimized code (especially for GEMM), I may be able to do the transition, since
the optimized code will be used in most cases.&lt;/p&gt;
&lt;p&gt;As for zapcc, although it is still the fastest compiler in debug mode, with the
new speed of clang-4.0.1, its margin is quite small. Moreover, on optimized
build, it's not as fast as GCC. If you use clang and can have access to zapcc,
it's still quite a good option to save some time.&lt;/p&gt;
&lt;p&gt;Overall, I have been quite pleased by clang-4.0.1 and GCC-7.1, the most recent
versions I have been testing. It seems that they did quite some good work.
I will definitely run some more tests with them and try to adapt the code. I'm
still considering whether I will drop support for some older compilers.&lt;/p&gt;
&lt;p&gt;I hope this comparison was interesting :) My next post will probably be about
the difference in performance between my machine learning framework and other
frameworks to train neural networks.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>C++11</category><category>C++14</category><category>clang</category><category>Compilers</category><category>etl</category><category>gcc</category><category>Performance</category><category>projects</category><guid>https://baptiste-wicht.com/posts/2017/08/compiler-benchmark-gcc-clang-cpp-library-etl.html</guid><pubDate>Mon, 07 Aug 2017 07:16:21 GMT</pubDate></item><item><title>Partial type erasing in Deep Learning Library (DLL) to improve compilation time</title><link>https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;In a previous post, I compared the &lt;a class="reference external" href="https://baptiste-wicht.com/posts/2017/03/disappointing-zapcc-performance-on-deep-learning-library-dll.html"&gt;compilation time on my Deep Learning Library (DLL) project with different compilers&lt;/a&gt;. I realized that the compilation times were quickly going unreasonable for this library, especially for compiling the unit cases which clearly hurts the development of the library. Indeed, you want to be able to run the unit tests reasonably quickly after you integrated new changes.&lt;/p&gt;
&lt;section id="reduce-the-compilation-time"&gt;
&lt;h2&gt;Reduce the compilation time&lt;/h2&gt;
&lt;p&gt;The first thing I did was to split the compilation in three executables: one for
the unit tests, one for the various performance tests and one for the various other
miscellaneous tests. With this, it is much faster to compile only the unit test
cases.&lt;/p&gt;
&lt;p&gt;But this can be improved significantly more. In DLL a network is a variadic
template containing the list of layers, in order. In DLL, there are two main
different ways of declaring a neural networks. In the first version, the fast
version, the layers directly know their sizes:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-1" name="rest_code_7d60f8842b134ce4921751a494b3e333-1" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-2" name="rest_code_7d60f8842b134ce4921751a494b3e333-2" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-3" name="rest_code_7d60f8842b134ce4921751a494b3e333-3" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_layers&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-4" name="rest_code_7d60f8842b134ce4921751a494b3e333-4" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-5" name="rest_code_7d60f8842b134ce4921751a494b3e333-5" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-5"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-6" name="rest_code_7d60f8842b134ce4921751a494b3e333-6" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-6"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;unit_type&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SOFTMAX&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-7" name="rest_code_7d60f8842b134ce4921751a494b3e333-7" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-7"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;sgd_trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-8" name="rest_code_7d60f8842b134ce4921751a494b3e333-8" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-8"&gt;&lt;/a&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-9" name="rest_code_7d60f8842b134ce4921751a494b3e333-9" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-9"&gt;&lt;/a&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;make_unique&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-10" name="rest_code_7d60f8842b134ce4921751a494b3e333-10" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-10"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pretrain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-11" name="rest_code_7d60f8842b134ce4921751a494b3e333-11" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-11"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fine_tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In my opinion, this is the best way to use DLL. This is the fastest and the
clearest. Moreover, the dimensions of the network can be validated at compile
time, which is always better than at runtime. However, the dimensions of the
network cannot be changed at runtime.  For this, there is a different version,
the dynamic version:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-1" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-1" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-2" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-2" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-3" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-3" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_layers&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-4" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-4" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dyn_rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-5" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-5" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-5"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dyn_rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-6" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-6" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-6"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dyn_rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;unit_type&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SOFTMAX&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-7" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-7" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-7"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;sgd_trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-8" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-8" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-8"&gt;&lt;/a&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-9" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-9" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-9"&gt;&lt;/a&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;make_unique&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-10" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-10" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-10"&gt;&lt;/a&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-11" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-11" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-11"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;init_layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-12" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-12" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-12"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;init_layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-13" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-13" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-13"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;init_layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-14" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-14" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-14"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-15" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-15" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-15"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-16" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-16" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-16"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-17" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-17" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-17"&gt;&lt;/a&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-18" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-18" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-18"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pretrain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-19" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-19" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-19"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fine_tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is a bit more verbose, but the configuration can be changed at runtime with
this system. Moreover, this is also faster to compile. On the other hand, there
is some performance slowdown.&lt;/p&gt;
&lt;p&gt;There is also a third version that is a hybrid of the first version:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-1" name="rest_code_763ef33c181844ab925fab666a286e0b-1" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-2" name="rest_code_763ef33c181844ab925fab666a286e0b-2" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dyn_dbn_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-3" name="rest_code_763ef33c181844ab925fab666a286e0b-3" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_layers&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-4" name="rest_code_763ef33c181844ab925fab666a286e0b-4" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-5" name="rest_code_763ef33c181844ab925fab666a286e0b-5" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-5"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-6" name="rest_code_763ef33c181844ab925fab666a286e0b-6" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-6"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;unit_type&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SOFTMAX&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-7" name="rest_code_763ef33c181844ab925fab666a286e0b-7" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-7"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;sgd_trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-8" name="rest_code_763ef33c181844ab925fab666a286e0b-8" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-8"&gt;&lt;/a&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-9" name="rest_code_763ef33c181844ab925fab666a286e0b-9" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-9"&gt;&lt;/a&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;make_unique&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-10" name="rest_code_763ef33c181844ab925fab666a286e0b-10" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-10"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pretrain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-11" name="rest_code_763ef33c181844ab925fab666a286e0b-11" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-11"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fine_tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Only one line was changed compared to the first version, &lt;code&gt;dbn_desc&lt;/code&gt;
becomes &lt;code&gt;dyn_dbn_desc&lt;/code&gt;. What this changes is that all the layers are
automatically transformed into their dynamic versions and all the parameters are
propagated at runtime. This is a form a type erasing since the sizes will not be
propagated at compilation time. But this is simple since the types are simply
transformed from one type to another directly. Behind the scene, it's the
dynamic version using the front-end of the fast version. This is almost as fast
to compile as the dynamic version, but the code is much better. It executes the
same as the dynamic version.&lt;/p&gt;
&lt;p&gt;If we compare the compilation time of the three versions when compiling a single
network and 5 different networks with different architectures, we get the
following results (with clang):&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Model&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Time [s]&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;1 Fast&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;1 Dynamic&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.6&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;1 Hybrid&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.6&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;5 Fast&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;114&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;5 Dynamic&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.6&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;5 Hybrid&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;21.9&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Even with one single network, the compilation time is reduced by 44%. When five
different networks are compilation, time is reduced by 85%. This can be
explained easily. Indeed, for the hybrid and dynamic versions, the layers will
have the same type and therefore a lot of template instantiations will only be
done once instead of five times. This makes a lot of difference since almost
everything is template inside the library.&lt;/p&gt;
&lt;p&gt;Unfortunately, this also has an impact on the runtime of the network:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Model&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Pretrain [s]&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Train [s]&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Fast&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;195&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;114&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Dynamic&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;203&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;123&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Hybrid&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;204&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;122&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;On average, for dense models, the slowdown is between 4% and 8%. For
convolutional models, it is between 10% and 25%. I will definitely work on
trying to make the dynamic and especially the hybrid version faster in the
future, most on the work should be on the matrix library (ETL) that is used.&lt;/p&gt;
&lt;p&gt;Since for test cases, a 20% increase in runtime is not really a problem, tests
being fast already, I decided to add an option to DLL so that everything can be
compiled by default in hybrid model. By using a compilation flag, all the
&lt;code&gt;dbn_desc&lt;/code&gt; are becoming &lt;code&gt;dyn_dbn_desc&lt;/code&gt; and therefore each used
network is becoming a hybrid network. Without a single change in the code, the
compilation time of the entire library can be significantly improved, as seen in
the next section.  This can also be used in user code to improve compilation
time during debugging and experiments and can be turned off for the final
training.&lt;/p&gt;
&lt;p&gt;On my Continuous Integration system, I will build the system in both
configurations. This is not really an issue, since my personal machine at home
is more powerful than what I have available here.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;On a first experiment, I measured the difference before and after this change on
the three executables of the library, with gcc:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Model&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Unit [s]&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Perf [s]&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Misc [s]&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Before&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1029&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;192&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;937&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;After&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;617&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;143&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;619&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.03%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;25.52%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.93%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;It is clear that the speedups are very significant! The compilation is between
25% and 40% faster with the new option. Overall, this is a speedup of 36%!
I also noticed that the compilation takes significantly less memory than before.
Therefore, I decided to rerun the compiler benchmark on the library. In the
previous experiment, zapcc was taking so much memory that it was impossible to
use more than one thread. Let's see how it is faring now. The time to compile
the full unit tests is computed for each compiler. Let's start in debug mode:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Debug&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j3&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;527&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;268&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;182&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;150&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;591&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;303&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;211&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;176&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-5.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;588&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;302&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;209&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;175&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;375&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;187&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;126&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;121&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This time, zapcc is able to scale to four threads without problems. Moreover, it
is always the fastest compiler, by a significant margin, in this configuration.
It is followed by clang and then by gcc for which both versions are about the
same speed.&lt;/p&gt;
&lt;p&gt;If we compile again in release mode:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Release&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j3&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1201&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;615&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;421&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;356&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1041&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;541&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;385&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;321&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-5.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1114&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;579&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;412&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;348&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;897&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;457&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;306&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;em&gt;306&lt;/em&gt;&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The difference in compilation time is very large, it's twice slower to compile
with all optimizations enabled. It also takes significantly more memory. Indeed,
zapcc was not able to compile with 4 threads. Nevertheless, even the results
with three threads are better than the other compilers using four threads. zapcc
is clearly the winner again on this test, followed by gcc4-9 which is faster
than gcc-5.3 which is itself faster than clang. It seems that while clang is
better at frontend than gcc, it is slower for optimizations. Note that this may
also be an indication that clang performs more optimizations than gcc and may
not be slower.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;By using some form of type erasing to simplify the templates types at compile
time, I was able to reduce the overall compilation time of my Deep Learning
Library (DLL) by 36%. Moreover, this can be done by switching a simple
compilation flag. This also very significantly reduce the memory used during the
compilation, allowing zapcc to to compile with up to three threads, compared
with only one before. This makes zapcc the fastest compiler again on this
benchmark. Overall, this will make debugging much easier on this library and
will save me a lot of time.&lt;/p&gt;
&lt;p&gt;In the future, I plan to try to improve compilation time even more. I have a few
ideas, especially in ETL that should significantly improve the compilation time
but that will require a lot of time to implement, so that will likely have to
wait a while. In the coming days, I plan to work on the performance of DLL,
especially for stochastic gradient descent.&lt;/p&gt;
&lt;p&gt;If you want more information on DLL, you can check out the
&lt;a class="reference external" href="https://github.com/wichtounet/dll"&gt;dll Github repository&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>C++11</category><category>clang</category><category>Compilers</category><category>dll</category><category>etl</category><category>gcc</category><category>zapcc</category><guid>https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html</guid><pubDate>Wed, 15 Mar 2017 06:43:44 GMT</pubDate></item><item><title>Disappointing zapcc performance on Deep Learning Library (DLL)</title><link>https://baptiste-wicht.com/posts/2017/03/disappointing-zapcc-performance-on-deep-learning-library-dll.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;One week ago, zapcc 1.0 was released and I've observed it to be much faster than the other
compilers in terms of compile time. This can be seen when
&lt;a class="reference external" href="http://baptiste-wicht.com/posts/2017/03/release-zapcc-10-fast-cpp-compiler.html"&gt;I tested it on my Expression Templates Library (ETL)&lt;/a&gt;. It was almost four
times faster than clang 3.9 and about 2.5 times faster than GCC.&lt;/p&gt;
&lt;p&gt;The ETL library is quite heavy to compile, but still reasonable. This is not the
case for my Deep Learning Library (DLL) where compiling all the test cases takes
a very long time. I have to admit that I have been going overboard with
templates and such and I have now to pay the price. In practice, for the users
of the library, this is not a big problem since only one or two neural networks
will be compiled (and it will take hours to train), but in the test cases, there
are hundreds of them and this is a huge pain. Anyway, enough with the ramble,
I figured it would be very good to test zapcc on it and see what I can gain from
using it.&lt;/p&gt;
&lt;p&gt;In this article, when I speak of a compiler thread, I mean an instance of the
processor, so it's really a process in the Linux world.&lt;/p&gt;
&lt;section id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;However, I soon realized that I would have more issues than I thought. The first
problem is the memory consumed by zapcc. Indeed, it is based on clang and
I always had problem with huge memory consumption from clang on this library and
zapcc has even bigger memory consumption because some information is cached
between runs. The amount of memory that zapcc is able to cache can be configured
in the configuration file. By default, it can use 1.5Go of memory. When zapcc
goes over the memory limit, it simply wipes out its caches. This means that all
the gain for the next compilation will be lost, since the cache will have to be
rebuilt from scratch. This is not a hard limit for the compilation itself.
Indeed, if the compilation itself takes 3Go, it will still be able to complete
it, but it is likely that the cache will be wiped after the compilation.&lt;/p&gt;
&lt;p&gt;When I tried compiling using several threads, it soon used all my memory and
crashed. The same occurs with clang but I can still compile with 3 or 4 threads
without too much issues on this computer. The same also occurs with GCC but it
can still handle 4 or 5 threads (depending on the order of the compilation
units).&lt;/p&gt;
&lt;p&gt;The tests are performed on my desktop computer at work, which is not really
good... I have 12Go of RAM (I had to ask for extra...) and an old Sandy Bridge
processor, but at least I have an SSD (also had to ask for extra).&lt;/p&gt;
&lt;p&gt;I started with testing with only one compiler thread. For zapcc, I set the
maximum memory limit to 8Go. Even with such a limit, the zapcc server restarted
more than 10 times during the compilation of the 84 test cases. After this first
experiment, I increased the number of threads to 2 for each compiler, using 4Go
limit for zapcc. The limit is for each server and each parallel thread will
spawn a new server, so the effective limit is the number of threads times the
limit. Even with two threads, I was unable to finish a compilation with zapcc.
This is quite disappoint for me since clang is able to run with 4 threads in
parallel. Moreover, a big problem with that is that the servers are not always
killed when there is no no more memory, they just hang and use all the memory of
the computer, which is evidently really inconvenient for service processes. When
this happens with clang or gcc, the compiler simply crashes and the memory is
released and make is interrupted. Since zapcc is not able to work with more than
one thread on this computer, the results are the ones with one thread. I was
also surprised to be able to compile the library with clang and four threads,
this was not possible before clang-3.9.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j3&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2250.95&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1256.36&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;912.67&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;760.84&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-5.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2305.37&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1279.49&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;918.08&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;741.38&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2047.61&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;1102.93&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;899.13&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;730.42&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;1483.73&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1483.73&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1483.73&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1483.73&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Difference against Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-27.55%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+25.69%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+39.37%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+50.77%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-5.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-35.66%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+13.75%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+38.09%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+50.03%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-4.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-34.08%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+15.30%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+38.50%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+48.75%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If we look at the results with only one thread, we can see that there still are
some significant improvements when using zapcc, but nowhere near as good as what
was seen in the compilation of ETL. Here, the compilation time is reduced by 34%
compared to gcc and by 27% compared to clang. This is not bad, since it is
faster than the other compilers, but I would have expected better speedups. We
can see that g++-4.9 is slightly faster than g++-5.3, but this is not really
a significant difference. I'm actually very surprised to find that clang is
faster than g++ on this experiment. On ETL, it is always very significantly
slower and before, it was also significantly slower on DLL. I was so used to
this, that I stopped using it on this project. I may have to reconsider my
position when working on this project.&lt;/p&gt;
&lt;p&gt;Let's look at the results with more than two threads. Even with two threads,
every compiler is faster than zapcc. Indeed, zapcc is slower than Clang by 25%
and slower than GCC by about 15%. If we use more threads, the other compilers
are becoming even faster and the slowdowns of zapcc are more important. When
using four threads, zapcc is about 48% slower than gcc and about 50% slower than
clang. This is really showing one big downside of zapcc that has a very large
memory consumption. When it is used to compile really heavy template code, it is
failing very early to use more processes. And even when there is enough memory,
the speedups are not as great as for relatively simpler code.&lt;/p&gt;
&lt;p&gt;One may argue that this is not a fair comparison since zapcc does not have the
same numbers of threads. However, considering that this is the best zapcc can do
on this machine, I would argue that this is a fair comparison in this limited
experimental setting. If we were to have a big machine for compilation, which
I don't have at work, the zapcc results would likely be more interesting, but in
this specific limited case, it shows that zapcc suffers from its high memory
consumption. It should also be taken into account that this experiment was done
with almost nothing else running on the machine (no browser for instance) to
have as much memory as possible available for the compilers. This is not
a common use case.  Most of the days, when I compile something, I have my
browser open, which makes a large difference in memory available, and several
other applications (but consoles and vim instances do not really consume memory
:D).&lt;/p&gt;
&lt;p&gt;This experiment made me realize that the compilation times for this library were
quickly becoming crazy. Most of the time, the complete test suite is only
compiled on my Continuous Integration machine at home which has a much faster
processor and much more RAM. Therefore, it is relatively fast since it uses more
threads to compile.  Nevertheless, this is not a good point that the unit tests
takes so much time to compile. I plan to split the test cases in several sets.
Because, currently the real unit tests are compiled with the performance tests
and other various tests. I'll probably end up generating three executables. This
will help greatly during development. Moreover, I also have a technique to
decrease the compilation time by erasing some template parameters at compilation
time. This is already ready, but has currently a runtime overhead that I will
try to remove and then use this technique everywhere to get back to reasonable
compilation times. I'll also try to see if I can find obvious compilation
bottlenecks in the code.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;To conclude, while zapcc brings some very interesting compilation speedups in
some cases like in my ETL library, it also has some downsides, namely
&lt;strong&gt;huge memory consumption&lt;/strong&gt;. This memory consumption may prevent the use of several
compiler threads and render zapcc much less interesting than other compilers.&lt;/p&gt;
&lt;p&gt;When trying to compile my DLL library on a machine with 12Go of RAM with two
zapcc threads, it was impossible for me to make it complete. While zapcc was
faster with one thread than the other compilers, they were able to use up to
four threads and in the end &lt;strong&gt;zapcc was about twice slower than clang&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I knew that zapcc memory consumption was very large, but I would have not have
expected something so critical. Another feature that would be interesting in
zapcc would be to set a max memory hard limit for the server instead of simply
a limit on the cache they are able to keep in memory. This would prevent hanging
the complete computer when something goes wrong.&lt;/p&gt;
&lt;p&gt;I had a good surprise with clang that was actually faster than GCC and also able
to work with four threads in parallel. This was not the case with previous
version of clang. On ETL, it is still significantly slower than GCC though.&lt;/p&gt;
&lt;p&gt;For now, I'll continue using clang on this DLL project and use zapcc only on my
ETL project. I'll also focus on improving the compilation time on this project
and make it reasonable again.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>Compilers</category><category>dll</category><category>gcc</category><category>projects</category><category>zapcc</category><guid>https://baptiste-wicht.com/posts/2017/03/disappointing-zapcc-performance-on-deep-learning-library-dll.html</guid><pubDate>Thu, 09 Mar 2017 12:41:06 GMT</pubDate></item><item><title>Release of zapcc 1.0 - Fast C++ compiler</title><link>https://baptiste-wicht.com/posts/2017/03/release-zapcc-10-fast-cpp-compiler.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;If you remember, I recently wrote about &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2016/12/zapcc-cpp-compilation-speed-against-gcc-54-and-clang-39.html"&gt;zapcc C++ compilation speed against gcc 5.4 and clang 3.9&lt;/a&gt; in which I was comparing the beta version of zapcc against gcc and clang.&lt;/p&gt;
&lt;p&gt;I just been informed that zapcc was just released in version 1.0. I though it
was a good occasion to test it again. It will be compared against gcc-4.9,
gcc-5.3 and clang-3.9. This version is based on the trunk of clang-5.0.&lt;/p&gt;
&lt;p&gt;Again, I will use my Expression Template Library (&lt;a class="reference external" href="https://github.com/wichtounet/etl/"&gt;ETL&lt;/a&gt;) project. This is a purely header-only
library with lots of templates. I'm going to compile the full test cases. This
is a perfect example for long compilation times.&lt;/p&gt;
&lt;p&gt;The current tests are made on the last version of the library and with slightly
different parameters for compilation, therefore the absolute times are not
comparable, but the speedups should be comparable.&lt;/p&gt;
&lt;p&gt;Just like last time, I have configured zapcc to let is use 2Go RAM per caching
server, which is the maximum allowed. Moreover, I killed the servers before each
tests.&lt;/p&gt;
&lt;section id="debug-results"&gt;
&lt;h2&gt;Debug results&lt;/h2&gt;
&lt;p&gt;Let's start with a debug build, with no optimizations enabled. Every build will
use four threads. This is the equivalent of doing make -j4 debug/bin/etl_test
without the link step.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;190.09s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;200.92s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;313.85&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;81.25&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.86&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-5.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.47&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-4.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.33&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The speedups are even more impressive than last time! zapcc is &lt;strong&gt;almost four
times fast than clang-3.9&lt;/strong&gt; and around &lt;strong&gt;2.5 times faster than GCC-5.3&lt;/strong&gt;.
Interestingly, we can see that gcc-5.3 is slighly slower than GCC-4.9.&lt;/p&gt;
&lt;p&gt;It seems that they have the compiler even faster!&lt;/p&gt;
&lt;/section&gt;
&lt;section id="release-results"&gt;
&lt;h2&gt;Release results&lt;/h2&gt;
&lt;p&gt;Let's look now how the results are looking with optimizations enabled. Again,
every build will use four threads. This is the equivalent of doing make -j4
release_debug/bin/etl_test without the link step.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;252.99&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;264.96&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;361.65&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;237.96&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.51&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-5.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.11&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-4.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.06&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We can see that this time the speedups are not as interesting as they were.
Very interestingly, it's the compiler that suffers the more from the
optimization overhead. Indeed, zapcc is three times slower in release mode than
it was in debug mode. Nevertheless, it still manages to beat the three other
compilers, by about 10% for Gcc and 50% than clang, which is already
interesting.&lt;/p&gt;
&lt;section id="conclusion"&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;To conclude, we have observed that zapcc is always faster than the three
compilers tested in this experiment. Moreover, in debug mode, the speedups are
very significant, it was almost 4 times faster than clang and around 2.5 faster
than gcc.&lt;/p&gt;
&lt;p&gt;I haven't seen any problem with the tool, it's like clang and it should generate
code of the same performance, but just compile it much faster. One problem
I have with zapcc is that it is not based on an already released version of
clang but on the trunk. That means it is hard to be compare with the exact same
version of clang and it is also a risk of running into clang bugs.&lt;/p&gt;
&lt;p&gt;Although the prices have not been published yet, it is indicated on the website
that zapcc is free for non-commercial entities. Which is really great.&lt;/p&gt;
&lt;p&gt;If you want more information, you can go to the
&lt;a class="reference external" href="https://www.zapcc.com/"&gt;official website of zapcc&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>Compilers</category><category>etl</category><category>gcc</category><category>projects</category><category>zapcc</category><guid>https://baptiste-wicht.com/posts/2017/03/release-zapcc-10-fast-cpp-compiler.html</guid><pubDate>Thu, 02 Mar 2017 13:50:04 GMT</pubDate></item><item><title>C++ Compiler benchmark on Expression Templates Library (ETL)</title><link>https://baptiste-wicht.com/posts/2016/12/cpp-compiler-benchmark-on-expression-templates-library-etl.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;In my Expression Templates Library (ETL) project, I have a lot of template heavy
code that needs to run as fast as possible and that is quite intensive to
compile. In this post, I'm going to compare the performance of a few of the
kernels produced by different compilers. I've got GCC 5.4, GCC 6.20 and clang
3.9. I also included zapcc which is based on clang 4.0.&lt;/p&gt;
&lt;p&gt;These tests have been run on an Haswell processor. The automatic parallelization
of ETL has been turned off for these tests.&lt;/p&gt;
&lt;p&gt;Keep in mind that some of the diagrams are presented in logarithmic form.&lt;/p&gt;
&lt;section id="vector-multiplication"&gt;
&lt;h2&gt;Vector multiplication&lt;/h2&gt;
&lt;p&gt;The first kernel is a very simple one, simple element-wise multiplication of two
vectors. Nothing fancy here.&lt;/p&gt;
&lt;div id="mul_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('mul_container', {
        chart: { type: 'column' },
        title: { text: 'Element-wise Vector Multiplication' },
        xAxis: {
            categories: ['10', '100', '1000', '10000', '100000', '1000000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [0.021, 0.040, 0.215, 2.07, 32.1, 403]
        },
        {
            name: 'g++-6.2', data: [0.021, 0.037, 0.208, 2.17, 32.1, 376]
        },
        {
            name: 'clang-3.9', data: [0.027, 0.045, 0.243, 2.43, 32.7, 389]
        },
        {
            name: 'zapcc-4.0', data: [0.026, 0.047, 0.321, 2.5, 32.8, 411]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;For small vectors, clang is significantly slower than gcc-5.4 and gcc6.2. On
vectors from 100'000 elements, the speed is comparable for each compiler,
depending on the memory bandwidth. Overall, gcc-6.2 produces the fastest code
here. clang-4.0 is slightly slower than clang-3.9, but nothing dramatic.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="vector-exponentiation"&gt;
&lt;h2&gt;Vector exponentiation&lt;/h2&gt;
&lt;p&gt;The second kernel is computing the exponentials of each elements of a vector and
storing them in another vector.&lt;/p&gt;
&lt;div id="exp_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('exp_container', {
        chart: { type: 'column' },
        title: { text: 'Element-wise Vector Exponentiation' },
        xAxis: {
            categories: ['10', '100', '1000', '10000', '100000', '1000000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [0.0478, 0.137, 1.12, 9.79, 97.5, 959]
        },
        {
            name: 'g++-6.2', data: [0.0474, 0.132, 1.11, 9.71, 97, 1000]
        },
        {
            name: 'clang-3.9', data: [0.0492, 0.136, 0.959, 9.24, 92.9, 914]
        },
        {
            name: 'zapcc-4.0', data: [0.0488, 0.142, 0.952, 9.25, 91.9, 915]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;Interestingly, this time, clang versions are significantly faster for medium to
large vectors, from 1000 elements and higher, by about 5%. There is no
significant differences between the different versions of each compiler.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="matrix-matrix-multiplication"&gt;
&lt;h2&gt;Matrix-Matrix Multiplication&lt;/h2&gt;
&lt;p&gt;The next kernel I did benchmark with the matrix-matrix multiplication operation.
In that case, the kernel is hand-unrolled and vectorized.&lt;/p&gt;
&lt;div id="gemm_container_small" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;div id="gemm_container_large" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('gemm_container_small', {
        chart: { type: 'column' },
        title: { text: 'Matrix Matrix Multiplication (small)', },
        xAxis: {
            categories: ['10x10', '20x20', '40x40', '60x60', '80x80', '100x100']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [0.159, 0.815, 2.637, 13.849, 17.281, 78.903]
        },
        {
            name: 'g++-6.2', data: [0.162, 0.802, 2.431, 13.531, 17.274, 74.02]
        },
        {
            name: 'clang-3.9', data: [0.179, 1.218, 2.391, 14.981, 15.142, 61.548]
        },
        {
            name: 'zapcc-4.0', data: [0.159, 0.836, 2.712, 13.426, 15.114, 62.241]
        }
        ]
    });
    Highcharts.chart('gemm_container_large', {
        chart: { type: 'column' },
        title: { text: 'Matrix Matrix Multiplication (large)', },
        xAxis: {
            categories: ['200x200', '300x300', '400x400', '500x500', '600x600', '700x700', '800x800', '900x900', '1000x1000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [275.219, 1371, 1837, 5177, 6667, 14981, 17037, 31492, 32813]
        },
        {
            name: 'g++-6.2', data: [267.776, 1362, 1808, 5297, 6859, 15166, 15664, 30666, 33067]
        },
        {
            name: 'clang-3.9', data: [266.033, 1230, 1789, 4825, 6969, 14488, 15916, 30872, 33186]
        },
        {
            name: 'zapcc-4.0', data: [267.806, 1237, 1820, 4909, 7035, 15191, 18193, 33127, 37346]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;There are few differences between the compilers. The first thing is that for
some sizes such as 80x80 and 100x100, clang is significantly faster than GCC, by
more than 10%. The other interesting fact is that for large matrices
zapcc-clang-4.0 is always slower than clang-3.9 which is itself on par with the
two GCC versions. In my opinion, it comes from a regression in clang trunk but
it could also come from zapcc itself.&lt;/p&gt;
&lt;div id="std_gemm_container_large" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('std_gemm_container_large', {
        chart: { type: 'column' },
        title: { text: 'Matrix Matrix Multiplication (naive)', },
        xAxis: {
            categories: ['200x200', '300x300', '400x400', '500x500', '600x600', '700x700', '800x800', '900x900', '1000x1000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (ms)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'ms'},
        series: [
        {
            name: 'g++-5.4', data: [1.195, 4.891, 10.467, 22.400, 33.399,
            58.401, 77.150, 121.392, 148.469]
        },
        {
            name: 'g++-6.2', data: [1.109, 4.540, 9.964, 21.359, 31.904,
            55.282, 72.690, 113.52, 143.27]
        },
        {
            name: 'clang-3.9', data: [0.893, 3.710, 7.287, 16.244, 23.920,
            43.342, 56.771, 91.870, 112.309]
        },
        {
            name: 'zapcc-4.0', data: [5.088, 16.909, 39.632, 77.194, 133.15,
            214.539, 316.01, 447.715, 612.255]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;The results are much more interesting here! First, there is a huge regression in
clang-4.0 (or in zapcc for that matter). Indeed, it is up to 6 times slower than
clang-3.9. Moreover, the clang-3.9 is always significantly faster than gcc-6.2.
Finally, there is a small improvement in gcc-6.2 compared to gcc 5.4.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="fast-fourrier-transform"&gt;
&lt;h2&gt;Fast-Fourrier Transform&lt;/h2&gt;
&lt;p&gt;The following kernel is the performance of a hand-crafted Fast-Fourrier
transform implementation.&lt;/p&gt;
&lt;div id="fft_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('fft_container', {
        chart: { type: 'column' },
        title: { text: 'Fast Fourrier Transform', },
        xAxis: {
            categories: ['100', '1000', '10000', '100000', '1000000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [2.640, 27.515, 308.239, 3427.4, 41695.9]
        },
        {
            name: 'g++-6.2', data: [2.578, 26.194, 298.97, 3348.82, 40783.8]
        },
        {
            name: 'clang-3.9', data: [3.047, 30.514, 333.403, 3569.36,43860.6]
        },
        {
            name: 'zapcc-4.0', data: [3.199,33.304,317.135,4025.18,48445.3]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;On this benchmark, gcc-6.2 is the clear winner. It is significantly faster
than clang-3.9 and clang-4.0. Moreover, gcc-6.2 is also faster than gcc-5.4.
On the contrary, clang-4.0 is significantly slower than clang-3.9 except on one
configuration (10000 elements).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="d-convolution"&gt;
&lt;h2&gt;1D Convolution&lt;/h2&gt;
&lt;p&gt;This kernel is about computing the 1D valid convolution of two vectors.&lt;/p&gt;
&lt;div id="conv1_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('conv1_container', {
        chart: { type: 'column' },
        title: { text: '1D convolution (optimized)', },
        xAxis: {
            categories: ['1000x500', '2000x1000', '3000x1500', '4000x2000',
            '5000x2500', '6000x3000', '7000x3500', '8000x4000', '9000x4500',
            '10000x5000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [11.710, 41.002, 91.201, 158.178,
            248.985, 353.695, 486.676, 634.53, 867.101, 1082.62]
        },
        {
            name: 'g++-6.2', data: [9.307, 40.921, 90.327, 158.734, 248.892,
            354.582, 488.38, 636.899, 869.637, 1084.86]
        },
        {
            name: 'clang-3.9', data: [13.404, 41.409, 95.094, 162.339,
            256.143, 362.34, 498.66, 651.352, 886.465, 1092.24]
        },
        {
            name: 'zapcc-4.0', data: [13.528, 40.886, 94.473, 159.917,
            252.992, 356.63, 493.653, 640.348, 872.282, 1091.36]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;While clang-4.0 is faster than clang-3.9, it is still slightly slower than both
gcc versions. On the GCC side, there is not a lot of difference except on the
1000x500 on which gcc-6.2 is 25% faster.&lt;/p&gt;
&lt;p&gt;And here are the results with the naive implementation:&lt;/p&gt;
&lt;div id="std_conv1_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('std_conv1_container', {
        chart: { type: 'column' },
        title: { text: '1D convolution (naive)', },
        xAxis: {
            categories: ['1000x500', '2000x1000', '3000x1500', '4000x2000',
            '5000x2500', '6000x3000', '7000x3500', '8000x4000', '9000x4500',
            '10000x5000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (ms)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'ms'},
        series: [
        {
            name: 'g++-5.4', data: [0.350, 1.452, 3.260, 5.823, 9.116,
            13.155, 17.922, 23.438, 29.705, 36.683]
        },
        {
            name: 'g++-6.2', data: [0.350, 1.457, 3.262, 5.823, 9.120,
            13.152, 17.922, 23.436, 29.687, 36.665]
        },
        {
            name: 'clang-3.9', data: [0.216, 0.873, 1.974, 3.517, 5.501,
            7.921, 10.793, 14.11, 17.867, 22.068]
        },
        {
            name: 'zapcc-4.0', data: [0.215, 0.873, 1.972, 3.514, 5.501,
            7.928, 10.799, 14.11, 17.879, 22.065]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;Again, on the naive version, clang is much faster than GCC on the naive, by
about 65%. This is a really large speedup.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="d-convolution-1"&gt;
&lt;h2&gt;2D Convolution&lt;/h2&gt;
&lt;p&gt;This next kernel is computing the 2D valid convolution of two matrices&lt;/p&gt;
&lt;div id="conv2_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('conv2_container', {
        chart: { type: 'column' },
        title: { text: '2D Convolution (optimized)', },
        xAxis: {
            categories: ['100x50', '105x50', '110x55', '115x55', '120x60',
            '125x60', '130x65', '135x65', '140x70']
        },
        yAxis: {
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [327.399, 367.389, 441.457, 576.021,
            762.268, 794, 994.06, 1261.71, 1360.57]
        },
        {
            name: 'g++-6.2', data: [327.764, 367.379, 441.993, 572.241,
            761.741, 784.605, 991.717, 1266.55, 1361.59]
        },
        {
            name: 'clang-3.9', data: [330.199, 364.253, 443.483, 580.676,
            763.772, 777.39, 1000.53, 1267.75, 1375.51]
        },
        {
            name: 'zapcc-4.0', data: [339.358, 364.756, 443.807, 575.917,
            761.248, 784.695, 992.29, 1265.04, 1367.33]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;There is no clear difference between the compilers in this code. Every compiler
here has up and down.&lt;/p&gt;
&lt;p&gt;Let's look at the naive implementation of the 2D convolution (units are
milliseconds here not microseconds):&lt;/p&gt;
&lt;div id="std_conv2_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('std_conv2_container', {
        chart: { type: 'column' },
        title: { text: '2D Convolution (naive)', },
        xAxis: {
            categories: ['100x50', '105x50', '110x55', '115x55', '120x60',
            '125x60', '130x65', '135x65', '140x70']
        },
        yAxis: {
            title: { text: 'Time (ms)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'ms'},
        series: [
        {
            name: 'g++-5.4', data: [9.501,11.458,13.888, 16.489, 19.634,
            22.898, 27.012, 31.246, 36.269]
        },
        {
            name: 'g++-6.2', data: [9.502, 11.464, 13.903, 16.484, 19.642,
            22.994, 27.004, 31.248, 36.26]
        },
        {
            name: 'clang-3.9', data: [5.880, 7.136, 8.610, 10.226, 12.164,
            14.247, 17.024, 19.577, 22.510]
        },
        {
            name: 'zapcc-4.0', data: [5.875, 7.091, 8.661, 10.241, 12.218,
            14.302, 16.777, 19.424, 22.472]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;This time the difference is very large! Indeed, clang versions are about 60%
faster than the GCC versions! This is really impressive. Even though this does
not comes close to the optimized. It seems the vectorizer of clang is much more
efficient than the one from GCC.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="d-convolution-2"&gt;
&lt;h2&gt;4D Convolution&lt;/h2&gt;
&lt;p&gt;The final kernel that I'm testing is the batched 4D convolutions that is used a
lot in Deep Learning. This is not really a 4D convolution, but a large number
of 2D convolutions applied on 4D tensors.&lt;/p&gt;
&lt;div id="conv4_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('conv4_container', {
        chart: { type: 'column' },
        title: { text: '4D Convolution', },
        xAxis: {
            categories: ['2x6x3x28x16', '2x6x3x28x16', '2x6x3x28x16',
            '2x6x3x28x16', '2x6x3x28x16', '2x6x3x28x16', '2x6x3x28x16',
            '2x6x3x28x16', '2x6x3x28x16']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (ms)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'ms'},
        series: [
        {
            name: 'g++-5.4', data: [0.095, 0.402, 1.083, 2.237, 3.988,
            6.474, 9.985, 14.132, 19.539]
        },
        {
            name: 'g++-6.2', data: [0.089, 0.413, 1.081, 2.224, 3.990,
            6.462, 9.815, 14.118, 19.612]
        },
        {
            name: 'clang-3.9', data: [0.090, 0.416, 1.108, 2.277, 4.077,
            6.587, 10.024, 14.359, 20.006]
        },
        {
            name: 'zapcc-4.0', data: [0.088, 0.406, 1.080, 2.237, 3.987,
            6.484, 9.827, 14.130, 19.569]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;Again, there are very small differences between each version. The best versions
are the most recent versions of the compiler gcc-6.2 and clang-4.0 on a tie.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Overall, we can see two trends in these results. First, when working with
highly-optimized code, the choice of compiler will not make a huge difference.
On these kind of kernels, gcc-6.2 tend to perform faster than the other
compilers, but only by a very slight margin, except in some cases. On the other
hand, when working with naive implementations, clang versions really did perform
much better than GCC. The clang compiled versions of the 1D and 2D convolutions
are more than 60% faster than their GCC counter parts. This is really
impressive. Overall, clang-4.0 seems to have several performance regressions,
but since it's not still a work in progress, I would not be suprised if these
regressions are not present in the final version. Since the clang-4.0 version is
in fact the clang version used by zapcc, it's also possible that zapcc is
introducing new performance regressions.&lt;/p&gt;
&lt;p&gt;Overall, my advice would be to use GCC-6.2 (or 5.4) on hand-optimized kernels
and clang when you have mostly naive implementations. However, keep in mind that
at least for the example shown here, the naive version optimized by the compiler
never comes close to the highly-optimized version.&lt;/p&gt;
&lt;p&gt;As ever, takes this with a grain of salt, it's only been tested on one project
and one machine, you may obtain very different results on other projects and on
other processors.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>Compilers</category><category>gcc</category><category>Performance</category><category>templates</category><guid>https://baptiste-wicht.com/posts/2016/12/cpp-compiler-benchmark-on-expression-templates-library-etl.html</guid><pubDate>Sun, 11 Dec 2016 13:17:30 GMT</pubDate></item><item><title>zapcc C++ compilation speed against gcc 5.4 and clang 3.9</title><link>https://baptiste-wicht.com/posts/2016/12/zapcc-cpp-compilation-speed-against-gcc-54-and-clang-39.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;A week ago, I compared the &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2016/11/zapcc-a-faster-cpp-compiler.html"&gt;compilation time performance of zapcc against gcc-4.9.3 and clang-3.7&lt;/a&gt;. On debug builds, zapcc was about 2 times faster than gcc and 3 times faster than clang. In this post, I'm going to try some more recent compilers, namely gcc 5.4 and clang 3.9 on the same project. If you want more information on zapcc, read the previous posts, this post will concentrate on results.&lt;/p&gt;
&lt;p&gt;Again, I use my Expression Template Library
(&lt;a class="reference external" href="https://github.com/wichtounet/etl/"&gt;ETL&lt;/a&gt;). This is a purely header-only
library with lots of templates. I'm going to compile the full test cases.&lt;/p&gt;
&lt;p&gt;The results of the two articles are not directly comparable, since they were
obtained on two different computers. The one on which the present results are
done has a less powerful and only 16Go of RAM compared to the 32Go of RAM of my
build machine. Also take into account that that the present results were
obtained on a Desktop machine, there can be some perturbations from background
tasks.&lt;/p&gt;
&lt;p&gt;Just like on the previous results, it does not help using more threads than
physical cores, therefore, the results were only computed on up to 4 cores on
this machine.&lt;/p&gt;
&lt;p&gt;The link time is not taken into account on the results.&lt;/p&gt;
&lt;section id="debug-build"&gt;
&lt;h2&gt;Debug build&lt;/h2&gt;
&lt;p&gt;Let's start with the result of the debug build.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;469s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;230s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;130s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;710s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;371s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;218s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;214s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;112s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;66s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.31&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.31&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.3&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.19&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.05&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.96&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The results are almost the same as the previous test. zapcc is 3.3 times faster
to compile than Clang and around 2 times faster than GCC. It seems that GCC 5.4
is a bit faster than GCC 4.9.3 while clang 3.9 is a bit slower than clang 3.7,
but nothing terribly significant.&lt;/p&gt;
&lt;p&gt;Overall, for debug builds, zapcc can bring a very significant improvement to
your compile times.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="release-build"&gt;
&lt;h2&gt;Release build&lt;/h2&gt;
&lt;p&gt;Let's see what is the status of Release builds. Since the results are comparable
between the numbers of threads, the results here are just for one thread.&lt;/p&gt;
&lt;p&gt;This is more time consuming since a lot of optimizations are enabled and more
features from ETL are enabled as well.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;782s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;960s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;640s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.5&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.22&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;On a release build, the speedups are much less interesting. Nevertheless, they
are still significant. zapcc is still 1.2 times faster than gcc and 1.5 times
faster than clang. Then speedup against clang 3.9 is significantly higher than
it was on my experiment with clang 3.7, it's possible that clang 3.9 is slower
or simply has new optimization passes.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The previous conclusion still holds with modern version of compilers: zapcc is
much faster than other compilers on Debug builds of template heavy code. More
than 3 times faster than clang-3.9 and about 2 times faster than gcc-5.4. Since
it's based on clang, there should not be any issue compiling projects that
already compile with a recent clang. Even though the speedups are less
interesting on a release build, it is still significantly, especially compared
against clang.&lt;/p&gt;
&lt;p&gt;I'm really interested in finding out what will be the pricing for zapcc once
out of the beta or if they will be able to get even faster!&lt;/p&gt;
&lt;p&gt;For the comparison with gcc 4.9.3 and clang 3.7, you can have a look at
&lt;a class="reference external" href="http://baptiste-wicht.com/posts/2016/11/zapcc-a-faster-cpp-compiler.html"&gt;this article&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you want more information about zapcc, you can go to the
&lt;a class="reference external" href="https://www.zapcc.com/"&gt;official website of zapcc&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>Compilers</category><category>etl</category><category>gcc</category><category>meta</category><category>projects</category><guid>https://baptiste-wicht.com/posts/2016/12/zapcc-cpp-compilation-speed-against-gcc-54-and-clang-39.html</guid><pubDate>Mon, 05 Dec 2016 17:46:09 GMT</pubDate></item><item><title>zapcc - a faster C++ compiler</title><link>https://baptiste-wicht.com/posts/2016/11/zapcc-a-faster-cpp-compiler.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;Update: For a comparison against more modern compiler versions, you can read: &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2016/12/zapcc-cpp-compilation-speed-against-gcc-54-and-clang-39.html"&gt;zapcc C++ compilation speed against gcc 5.4 and clang 3.9&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I just joined the private beta program of zapcc. Zapcc is a c++ compiler, based
on Clang which aims at being much faster than other C++ compilers. How they are
doing this is using a caching server that saves some of the compiler structures,
which should speed up compilation a lot. The private beta is free, but once the
compiler is ready, it will be a commercial compiler.&lt;/p&gt;
&lt;p&gt;Every C++ developer knows that compilation time can quickly be an issue when
programs are getting very big and especially when working with template-heavy
code.&lt;/p&gt;
&lt;p&gt;To benchmark this new compiler, I use my Expression Template Library
(&lt;a class="reference external" href="https://github.com/wichtounet/etl/"&gt;ETL&lt;/a&gt;). This is a purely header-only
library with lots of templates. There are lots of test cases which is what I'm
going to compile. I'm going to compare against Clang-3.7 and gcc-4.9.3.&lt;/p&gt;
&lt;p&gt;I have configured zapcc to let is use 2Go RAM per caching server, which is the
maximum allowed. Moreover, I killed the servers before each tests.&lt;/p&gt;
&lt;section id="debug-build"&gt;
&lt;h2&gt;Debug build&lt;/h2&gt;
&lt;p&gt;Let's start with a debug build. In that configuration, there is no optimization
going on and several of the features of the library (GPU, BLAS, ...) are
disabled. This is the fastest way to compile ETL. I gathered this result on
a 4 core, 8 threads, Intel processor, with an SSD.&lt;/p&gt;
&lt;p&gt;The following table presents the results with different number of threads and
the difference of zapcc compared to the other compilers:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j6&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j8&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;350s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;185s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;104s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;94s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;91s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.7&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;513s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;271s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;153s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;145s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;138s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;158s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;87s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;47s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;44s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;42s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.24&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.103&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.25&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.29&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.28&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.21&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.12&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.21&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.13&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.16&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The result is pretty clear! zapcc is around &lt;strong&gt;three times faster than Clang&lt;/strong&gt; and around
&lt;strong&gt;two times faster than GCC&lt;/strong&gt;. This is pretty impressive!&lt;/p&gt;
&lt;p&gt;For those that think than Clang is always faster than GCC, keep in mind that
this is not the case for template-heavy code such as this library. In all my
tests, Clang has always been slower and much memory hungrier than GCC on
template-heavy C++ code. And sometimes the difference is very significant.&lt;/p&gt;
&lt;p&gt;Interestingly, we can also see that going past the physical cores is not really
interesting on this computer. On some computer, the speedups are interesting,
but not on this one. Always benchmark!&lt;/p&gt;
&lt;/section&gt;
&lt;section id="release-build"&gt;
&lt;h2&gt;Release build&lt;/h2&gt;
&lt;p&gt;We have seen the results on a debug build, let's now compare on something a bit
more timely, a release build with all options of ETL enabled (GPU, BLAS, ...),
which should make it significantly longer to compile.&lt;/p&gt;
&lt;p&gt;Again, the table:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j6&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j8&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;628s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;336s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;197s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;189s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;184s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.7&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;663s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;388s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;215s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;212s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;205s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;515s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;281s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;173s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;168s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;158s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.28&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.38&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.26&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.29&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.21&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.30&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.13&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.12&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.16&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This time, we can see that the difference is much lower. Zapcc is &lt;strong&gt;between 1.2
and 1.4 times faster than Clang&lt;/strong&gt; and &lt;strong&gt;between 1.1 and 1.3 times faster than
GCC&lt;/strong&gt;. This shows that most of the speedups from zapcc are in the front end of
the compiler. This is not a lot but still significant over long builds,
especially if you have few threads where the absolute difference would be
higher.&lt;/p&gt;
&lt;p&gt;We can also observe that Clang is now almost on par with GCC which shows that
optimization is faster in Clang while front and backend is faster in gcc.&lt;/p&gt;
&lt;p&gt;You also have to keep in mind that zapcc memory usage is higher than Clang
because of all the caching. Moreover, the server are still up in between
compilations, so this memory usage stays between builds, which may not be what
you want.&lt;/p&gt;
&lt;p&gt;As for runtime, I have not seen any significant difference in performance
between the clang version and the zapcc. According to the official benchmarks
and documentation, there should not be any difference in that between zapcc and
the version of clang on which zapcc is based.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="incremental-build"&gt;
&lt;h2&gt;Incremental build&lt;/h2&gt;
&lt;p&gt;Normally, zapcc should shine at incremental building, but I was unable to show
any speedup when changing a single without killing the zapcc servers. Maybe
I did something wrong in my usage of zapcc.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In conclusion, we can see that zapcc is always faster than both GCC and Clang,
on my template-heavy library. Moreover, on debug builds, it is much faster than
any of the two compilers, being more than 2 times faster than GCC and more than
3 times faster than clang. This is really great. Moreover, I have not seen any
issue with the tool so far, it can seamlessly replace Clang without problem.&lt;/p&gt;
&lt;p&gt;It's a bit weird that you cannot allocate more than 2Go to the zapcc servers.&lt;/p&gt;
&lt;p&gt;For a program, that's really impressive. I hope that they are continuing the
good work and especially that this motivates other compilers to improve the
speed of compilation (especially of templates).&lt;/p&gt;
&lt;p&gt;If you want more information, you can go to the
&lt;a class="reference external" href="https://www.zapcc.com/"&gt;official website of zapcc&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>Compilers</category><category>etl</category><category>gcc</category><category>projects</category><category>zapcc</category><guid>https://baptiste-wicht.com/posts/2016/11/zapcc-a-faster-cpp-compiler.html</guid><pubDate>Sat, 26 Nov 2016 12:17:50 GMT</pubDate></item><item><title>Blazing fast unit test compilation with doctest 1.1</title><link>https://baptiste-wicht.com/posts/2016/09/blazing-fast-unit-test-compilation-with-doctest-11.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;You may remember &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2016/06/reduce-compilation-time-by-another-16-with-catch.html"&gt;my quest for faster compilation times&lt;/a&gt;. I had made several changes to the Catch test framework macros in order to save some compilation at the expense of my test code looking a bit less nice:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_99207774a09f4a598386048725bcc49e-1" name="rest_code_99207774a09f4a598386048725bcc49e-1" href="https://baptiste-wicht.com/posts/2016/09/blazing-fast-unit-test-compilation-with-doctest-11.html#rest_code_99207774a09f4a598386048725bcc49e-1"&gt;&lt;/a&gt;&lt;span class="n"&gt;REQUIRE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;//Before&lt;/span&gt;
&lt;a id="rest_code_99207774a09f4a598386048725bcc49e-2" name="rest_code_99207774a09f4a598386048725bcc49e-2" href="https://baptiste-wicht.com/posts/2016/09/blazing-fast-unit-test-compilation-with-doctest-11.html#rest_code_99207774a09f4a598386048725bcc49e-2"&gt;&lt;/a&gt;&lt;span class="n"&gt;REQUIRE_EQUALS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;//After&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The first line is a little bit better, but using several optimizations, I was
able to dramatically change the compilation time of the test cases of ETL. In
the end, I don't think that the difference between the two lines justifies the
high overhead in compilation times.&lt;/p&gt;
&lt;section id="doctest"&gt;
&lt;h2&gt;doctest&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/onqtam/doctest"&gt;doctest&lt;/a&gt; is a framework quite similar to
Catch but that claims to be much lighter. I tested doctest 1.0 early on, but at
this point it was actually slower than Catch and especially slower than my
versions of the macro.&lt;/p&gt;
&lt;p&gt;Today, doctest 1.1 was released with promises of being even lighter than before
and providing several new ways of speeding up compilation. If you want the
results directly, you can take a look at the next section.&lt;/p&gt;
&lt;p&gt;First of all, this new version improved the basic macros to make expression
decomposition faster. When you use the standard REQUIRE macro, the expression is
composed by using several template techniques and operator overloading. This is
really slow to compile. By removing the need for this decomposition, the fast
Catch macros are much faster to compile.&lt;/p&gt;
&lt;p&gt;Moreover, doctest 1.1 also introduces CHECK_EQ that does not any expression
decomposition. This is close to what I did in my macros expect that it is
directly integrated into the framework and preserves all its features. It is
also possible to bypass the expression checking code by using FAST_CHECK_EQ
macro. In that case, the exceptions are not captured. Finally, a new
configuration option is introduced (DOCTEST_CONFIG_SUPER_FAST_ASSERTS) that
removes some features related to automatic debugger breaks. Since I don't use
the debugger features and I don't need to capture exception everywhere (it's
sufficient for me that the test fails completely if an exception is thrown), I'm
more than eager to use these new features.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;For evaluation, I have compiled the complete test suite of ETL, with 1 thread,
using gcc 4.9.3 with various different options, starting from Catch to doctest
1.1 with all compilation time features. Here are the results, in seconds:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Version&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Time&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;VS Catch&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;VS Fast Catch&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;VS doctest 1.0&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Catch&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;724.22&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Fast Catch&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;464.52&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-36%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;doctest 1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;871.54&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+20%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+87%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;doctest 1.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;614.67&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-16%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+32%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-30%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;REQUIRE_EQ&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;493.97&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-32%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+6%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-43%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;FAST_REQUIRE_EQ&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;439.09&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-39%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-6%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-50%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;SUPER_FAST_ASSERTS&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;411.11&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-43%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-12%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-53%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;As you can see, doctest 1.1 is much faster to compile than doctest 1.0! This is
really great news. Moreover, it is already 16% faster than Catch. When all the
features are used, doctest is 12% faster than my stripped down versions of Catch
macros (and 43% faster than Catch standard macros). This is really cool! It
means that I don't have to do any change in the code (no need to strip macros
myself) and I can gain a lot of compilation time compared to the bare Catch
framework.&lt;/p&gt;
&lt;p&gt;I really think the author of doctest did a great job with the new version.
Although this was not of as much interest for me, there are also a lot of
other changes in the new version. You can consult the
&lt;a class="reference external" href="https://github.com/onqtam/doctest/blob/master/CHANGELOG.md"&gt;changelog&lt;/a&gt; if you want more information.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Overall, doctest 1.1 is much faster to compile than doctest 1.0. Moreover, it
offers very fast macros for test assertions that are much faster to compile
than Catch versions and even faster than the versions I created myself to reduce
compilation time. I really thing this is a great advance for doctest. When
compiling with all the optimizations, doctest 1.1 saves me 50 seconds in
compilation time compared to the fast version of Catch macro and more than
5 minutes compared to the standard version of Catch macros.&lt;/p&gt;
&lt;p&gt;I'll probably start using doctest on my development machine. For now, I'll keep
Catch as well since I need it to generate the unit test reports in XML format
for Sonarqube. Once this feature appears in doctest, I'll probably drop Catch
from ETL and DLL&lt;/p&gt;
&lt;p&gt;If you need blazing fast compilation times for your unit tests, doctest 1.1 is
probably the way to go.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>Catch</category><category>Compilers</category><category>doctest</category><category>etl</category><category>gcc</category><category>Performances</category><category>Tests</category><category>time</category><guid>https://baptiste-wicht.com/posts/2016/09/blazing-fast-unit-test-compilation-with-doctest-11.html</guid><pubDate>Wed, 21 Sep 2016 19:45:13 GMT</pubDate></item><item><title>Improve DLL and ETL Compile Time further</title><link>https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;For a while, the compilation time of my matrix/vector computation library (ETL), based on Expression Templates has become more and more problematic. I've already worked on this problem &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2015/06/how-i-improved-a-bit-compile-time-of-etl.html"&gt;here&lt;/a&gt; and &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2015/06/improve-etl-compile-time-with-precompiled-headers.html"&gt;there&lt;/a&gt;, using some general techniques (pragmas, precompiled headers, header removals and so on). On this post, I'll talk about two major improvements I have been able to do directly in the code.&lt;/p&gt;
&lt;section id="use-of-static-if"&gt;
&lt;h2&gt;Use of static_if&lt;/h2&gt;
&lt;p&gt;Remember &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2015/07/simulate-static_if-with-c11c14.html"&gt;static_if&lt;/a&gt; ? I was able to use it to really reduce the compile time of DLL.&lt;/p&gt;
&lt;p&gt;I wrote a script to time each test case of the DLL project to find the test cases that took the longest to compile. Once I found the best candidate, I isolated the functions that took the longest to compile. It was quite tedious and I did it by hand, primarily by commenting parts of the code and going deeper and deeper in the code. I was quite suprised to find that a single function call (template function of course ;) ) was responsible for 60% of the compilation time of my candidate test case. The function was instantiating a whole bunch of expression templates (to compute the free energy of several models). The function itself was not really optimizable, but what was really interesting is that this function was only used in some very rare cases and that these cases were known at compile-time :) This was a perfect case to use a static_if. And once the call was inside the static_if, the test case was indeed about 60% faster. &lt;strong&gt;This reduced the overall compilation time of DLL by about 30%&lt;/strong&gt;!&lt;/p&gt;
&lt;p&gt;This could also of course also have been achieved by using two functions, one with the call, one empty and selected by SFINAE (Substitution Failure Is Not An Error). I prefer the statif_if version since this really shows the intent and hides SFINAE behind nicer syntax.&lt;/p&gt;
&lt;p&gt;I was also able to use static_if at other places in the DLL code to avoid instantiating some templates, but the improvements were much less dramatic (about 1% of the total compilation time). I was very lucky to find a single function that accounted for so much compile time. After some more tests, I concluded that much of the compilation time of DLL was spent compiling the Expression Templates from my ETL library so I decided to delve into ETL code directly.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="removal-of-std-async"&gt;
&lt;h2&gt;Removal of std::async&lt;/h2&gt;
&lt;p&gt;The second improvement was very surprising. I was working on improving the compilation of ETL and found out that the sum and average reductions of matrices were dramatically slow, about an order of magnitude slower than standard operations on matrices. In parallel (but the two facts are linked), I also found out another weird fact when splitting a file into 10 parts (the file was comprised of 10 test cases). Compiling the 10 parts separarely (and sequentially, not multiple threads) was about 40% faster than compiling the complete file. There was no swapping so it was not a memory issue. This is not expected. Generally, it is faster to compile a big file than to compile its parts separately. The advantage of smaller files is that you can compile them in parallel and that incremental builds are faster (only compile a small part).&lt;/p&gt;
&lt;p&gt;By elimination, I found out that most of the time was spent inside the function that was dispatching in parallel the work for accumulating the sum of a matrix. Here is the function:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-1" name="rest_code_f500b63f097d46e8afba7faf4ac56979-1" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Functor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;AccFunctor&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-2" name="rest_code_f500b63f097d46e8afba7faf4ac56979-2" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-2"&gt;&lt;/a&gt;&lt;span class="kr"&gt;inline&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dispatch_1d_acc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Functor&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;functor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AccFunctor&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;acc_functor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-3" name="rest_code_f500b63f097d46e8afba7faf4ac56979-3" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-4" name="rest_code_f500b63f097d46e8afba7faf4ac56979-4" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-5" name="rest_code_f500b63f097d46e8afba7faf4ac56979-5" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-5"&gt;&lt;/a&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-6" name="rest_code_f500b63f097d46e8afba7faf4ac56979-6" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-6"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-7" name="rest_code_f500b63f097d46e8afba7faf4ac56979-7" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-7"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-8" name="rest_code_f500b63f097d46e8afba7faf4ac56979-8" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-8"&gt;&lt;/a&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-9" name="rest_code_f500b63f097d46e8afba7faf4ac56979-9" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-9"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-10" name="rest_code_f500b63f097d46e8afba7faf4ac56979-10" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-10"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;launch&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;async&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;functor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-11" name="rest_code_f500b63f097d46e8afba7faf4ac56979-11" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-11"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-12" name="rest_code_f500b63f097d46e8afba7faf4ac56979-12" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-12"&gt;&lt;/a&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-13" name="rest_code_f500b63f097d46e8afba7faf4ac56979-13" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-13"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;acc_functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-14" name="rest_code_f500b63f097d46e8afba7faf4ac56979-14" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-14"&gt;&lt;/a&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-15" name="rest_code_f500b63f097d46e8afba7faf4ac56979-15" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-15"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fut&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-16" name="rest_code_f500b63f097d46e8afba7faf4ac56979-16" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-16"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;acc_functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fut&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-17" name="rest_code_f500b63f097d46e8afba7faf4ac56979-17" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-17"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-18" name="rest_code_f500b63f097d46e8afba7faf4ac56979-18" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-18"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-19" name="rest_code_f500b63f097d46e8afba7faf4ac56979-19" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-19"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;acc_functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-20" name="rest_code_f500b63f097d46e8afba7faf4ac56979-20" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-20"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_f500b63f097d46e8afba7faf4ac56979-21" name="rest_code_f500b63f097d46e8afba7faf4ac56979-21" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f500b63f097d46e8afba7faf4ac56979-21"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;There isn't anything really fancy about this function. This takes one functor that will be done in parallel and one function for accumulation.  It dispatches all the work in batch and then accumulates the results. I tried several things to optimize the compilation time of this function, but nothing worked. The line that was consuming all the time was the std::async line. This function was using std::async because the thread pool that I'm generally using does not support returning values from parallel functors. I decided to use a workaround and use my thread pool and I came out with this version:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-1" name="rest_code_f25d473df56343699b3055f4ac6883a4-1" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Functor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;AccFunctor&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-2" name="rest_code_f25d473df56343699b3055f4ac6883a4-2" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-2"&gt;&lt;/a&gt;&lt;span class="kr"&gt;inline&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dispatch_1d_acc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Functor&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;functor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AccFunctor&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;acc_functor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-3" name="rest_code_f25d473df56343699b3055f4ac6883a4-3" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-4" name="rest_code_f25d473df56343699b3055f4ac6883a4-4" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-5" name="rest_code_f25d473df56343699b3055f4ac6883a4-5" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-5"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;cpp&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;default_thread_pool&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-6" name="rest_code_f25d473df56343699b3055f4ac6883a4-6" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-6"&gt;&lt;/a&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-7" name="rest_code_f25d473df56343699b3055f4ac6883a4-7" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-7"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-8" name="rest_code_f25d473df56343699b3055f4ac6883a4-8" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-8"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-9" name="rest_code_f25d473df56343699b3055f4ac6883a4-9" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-9"&gt;&lt;/a&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-10" name="rest_code_f25d473df56343699b3055f4ac6883a4-10" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-10"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sub_functor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;functor&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-11" name="rest_code_f25d473df56343699b3055f4ac6883a4-11" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-11"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-12" name="rest_code_f25d473df56343699b3055f4ac6883a4-12" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-12"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-13" name="rest_code_f25d473df56343699b3055f4ac6883a4-13" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-13"&gt;&lt;/a&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-14" name="rest_code_f25d473df56343699b3055f4ac6883a4-14" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-14"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-15" name="rest_code_f25d473df56343699b3055f4ac6883a4-15" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-15"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;do_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sub_functor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-16" name="rest_code_f25d473df56343699b3055f4ac6883a4-16" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-16"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-17" name="rest_code_f25d473df56343699b3055f4ac6883a4-17" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-17"&gt;&lt;/a&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-18" name="rest_code_f25d473df56343699b3055f4ac6883a4-18" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-18"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;acc_functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-19" name="rest_code_f25d473df56343699b3055f4ac6883a4-19" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-19"&gt;&lt;/a&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-20" name="rest_code_f25d473df56343699b3055f4ac6883a4-20" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-20"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-21" name="rest_code_f25d473df56343699b3055f4ac6883a4-21" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-21"&gt;&lt;/a&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-22" name="rest_code_f25d473df56343699b3055f4ac6883a4-22" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-22"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fut&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-23" name="rest_code_f25d473df56343699b3055f4ac6883a4-23" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-23"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;acc_functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fut&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-24" name="rest_code_f25d473df56343699b3055f4ac6883a4-24" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-24"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-25" name="rest_code_f25d473df56343699b3055f4ac6883a4-25" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-25"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-26" name="rest_code_f25d473df56343699b3055f4ac6883a4-26" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-26"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;acc_functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;functor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-27" name="rest_code_f25d473df56343699b3055f4ac6883a4-27" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-27"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_f25d473df56343699b3055f4ac6883a4-28" name="rest_code_f25d473df56343699b3055f4ac6883a4-28" href="https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html#rest_code_f25d473df56343699b3055f4ac6883a4-28"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I simply preallocate space for all the threads and create a new functor calling the input functor and saving its result inside the vector. It is less nice, but it works well. And it compiles MUCH faster. This &lt;strong&gt;reduced the compilation time&lt;/strong&gt; of my biggest test case &lt;strong&gt;by a factor of 8&lt;/strong&gt; (from 344 seconds to 44 seconds). This is really crazy. It also fixed the problem where splitting the test case was faster than big file (it is now twice faster to compile the big files than compiling all the small files separately). &lt;strong&gt;This reduced the total compilation time of dll by about 400%&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;As of now, I still have no idea why this makes such a big difference. I have looked at the std::async code, but I haven't found a valid reason for this slowdown. If someone has any idea, I'd be very glad to discuss in the comments below.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="improving-the-template-instantiation-tree"&gt;
&lt;h2&gt;Improving the template instantiation tree&lt;/h2&gt;
&lt;p&gt;I recently discovered the templight tool that is a profiler for templates (pretty cool). After some time, I was able to build it and use it on ETL. For now, I haven't been able to reduce compile time a lot, but I have been able to reduce the template instantiation tree a lot seeing that some instantiations were completely useless and I optimized the code to remove them.&lt;/p&gt;
&lt;p&gt;I won't be go into much details here because I plan to write a post on this subject in the coming days.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In conclusion, I would say that it is pretty hard to improve the compile time of complex C++ programs once you have gone through all the standard methods. However, I was very happy to found that &lt;strong&gt;two optimizations in the source code reduced the overall compilation of DLL by almost 500%&lt;/strong&gt;. I will continue working on this, but for now, the compilation time is much more reasonable.&lt;/p&gt;
&lt;p&gt;I hope the two main facts in this article were interesting. If you have similar experience, comments or ideas for further improvements, I'd be glad to discuss them with you in the comments :)&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>Compilers</category><category>dll</category><category>etl</category><category>gcc</category><category>Performances</category><guid>https://baptiste-wicht.com/posts/2016/01/improve-dll-and-etl-compile-time-further.html</guid><pubDate>Fri, 29 Jan 2016 16:02:34 GMT</pubDate></item><item><title>Improve ETL compile-time with Precompiled Headers</title><link>https://baptiste-wicht.com/posts/2015/06/improve-etl-compile-time-with-precompiled-headers.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;Very recently, I started trying to improve the compile-time of the ETL test suite. While not critical, it is always better to have tests that compile as fast as possible. In a &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2015/06/how-i-improved-a-bit-compile-time-of-etl.html"&gt;previous post&lt;/a&gt;, I was able to improve the time a bit by improve the makefile, using pragra once and avoiding &lt;cite&gt;&amp;lt;iostream&amp;gt;&lt;/cite&gt; headers. With these techniques, I reduced the compile-time from 87.5 to 84.1, which is not bad, but not as good as I would have expected.&lt;/p&gt;
&lt;p&gt;In the previous, I had not tried to use Precompiled Headers (PCH) to improve the compile time, so I thought it would be a good time to do it.&lt;/p&gt;
&lt;section id="precompiled-headers"&gt;
&lt;h2&gt;Precompiled Headers&lt;/h2&gt;
&lt;p&gt;Precompiled Headers are an option of the compiler, where one header gets compiled. Normally, you only compile source files into object files, but you can also compile headers, although it is not the same thing. When a compiler compiles a header, it can do a lot of preprocessing (macros, includes, AST, symbols) and then store all the results into a precompiled header file. Once you compile the source files, the compiler will try to use the precompiled header file instead of the real header file. Of course, this can breaks the C++ standard since with that a header can not have different behaviour based on macros for instance. For these reasons (and probably implementation reasons as well), precompiled headers are really limited.&lt;/p&gt;
&lt;p&gt;If we take the case of G++, G++ will consider the precompiled header file instead of the standard header only if (for a complete list, take a look at the GCC docs):&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;The same compilation flags are the same between the two compilations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The same compiler binary is used for the compilations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Only one precompiled header can be used in each compilation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The same macros must be defined&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The include of the header must be before every possible C/C++ token&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If all these conditions are met and you try to &lt;cite&gt;#include "header.hpp&lt;/cite&gt; and there is a header.hpp.gch (the precompiled file) available in the search path, then the precompiled header will be taken instead of the standard one.&lt;/p&gt;
&lt;p&gt;With clang, it is a bit different because the precompiled header cannot be included automatically, but has to be included explicitely in the source code, meaning you have to modify your code for this technique to work. This is a bad thing in my opinion, you never should have to modify your code to profit from a compiler feature. This is why I haven't used and don't plan to use precompiled headers with clang.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="how-to"&gt;
&lt;h2&gt;How-to&lt;/h2&gt;
&lt;p&gt;Once you know all the conditions for a precompiled header to be automatically included, it is quite straightforward to use them.&lt;/p&gt;
&lt;p&gt;To generate a PCH file is easy:&lt;/p&gt;
&lt;pre class="literal-block"&gt;g++ options header.hpp&lt;/pre&gt;
&lt;p&gt;This will generate header.hpp.gch. When you compile your source file using header.hpp, you don't have anything to do, you just have to compile it as usually and if all the conditions are met, the PCH file will be used instead of the other header.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="results-and-conclusion"&gt;
&lt;h2&gt;Results and conclusion&lt;/h2&gt;
&lt;p&gt;I added precompiled header support into my &lt;a class="reference external" href="https://github.com/wichtounet/make-utils"&gt;make-utils&lt;/a&gt; collection of Makefile utilities and tested it on ETL. I have precompiled a header that itself included Catch and ETL. Almost all test files are including this header. With this change, I went from 84 seconds to 78seconds. Headers are taking 1.5seconds to be precompiled. This is a nice result I think. If your application is not as template-heavy as mine or if you have more source files, you should expect better improvements.&lt;/p&gt;
&lt;p&gt;To conclude, even if precompiled headers are a sound way to reduce compile-time, they are really limited to some cases. I'm not a fan of the feature overally. It is not portable between compilers and not standard. Anyway, if you are really in need of saving some time, you should not hesitate too much ;)&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>Compilers</category><category>etl</category><category>gcc</category><category>Performances</category><guid>https://baptiste-wicht.com/posts/2015/06/improve-etl-compile-time-with-precompiled-headers.html</guid><pubDate>Sat, 20 Jun 2015 13:08:31 GMT</pubDate></item><item><title>How I improved (a bit) compile time of ETL ?</title><link>https://baptiste-wicht.com/posts/2015/06/how-i-improved-a-bit-compile-time-of-etl.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;Recently I read several articles about C++ and compile time and I wondered if I could improve the compile time of my Expression Template Library (ETL) project. ETL is a header-only and template-heavy library. I'm not going to the change the design completely or to use type erasure techniques to reduce the compile time, ETL is all about performance.&lt;/p&gt;
&lt;p&gt;As a disclaimer, don't expect fancy results from this post, I haven't been able to reduce compile time a lot, but I still wanted to share my experience.&lt;/p&gt;
&lt;p&gt;I've used g++-4.9.2 to perform these tests.&lt;/p&gt;
&lt;p&gt;I'm compiling the complete test suite (around 6900 source lines of codes in 36 files) in release mode. Each test file includes the ETL (around 10K SLOC). Each test is run with 8 threads (make -j8). For each result, I have run a complete build 5 times and taken the best result as the final result. Everything is run on a SSD and I have more than enough RAM to handle all the compilation in parallel.&lt;/p&gt;
&lt;p&gt;The reference build time was 87.5 seconds.&lt;/p&gt;
&lt;section id="compile-and-generate-dependency-files-at-the-same-time"&gt;
&lt;h2&gt;Compile and generate dependency files at the same time&lt;/h2&gt;
&lt;p&gt;To help write my makefiles, I'm using a set of functions that I have written. This includes automatic dependency generation using -MM -MT options of the compiler. Until now, I had two targets, one to compile the cpp file into the object file and another one to generate the dependency file. I recently saw that compilers were able to do both at the same time! Clang, G++ and the Intel compiler all have a -MD -MF options that lets you generate the dependency file at the same time you compile your file, saving you at least one read of the file.&lt;/p&gt;
&lt;p&gt;My compilation rule in my makefile has now become:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code makefile"&gt;&lt;a id="rest_code_4330aaf031294e77b1583904915c57b7-1" name="rest_code_4330aaf031294e77b1583904915c57b7-1" href="https://baptiste-wicht.com/posts/2015/06/how-i-improved-a-bit-compile-time-of-etl.html#rest_code_4330aaf031294e77b1583904915c57b7-1"&gt;&lt;/a&gt;&lt;span class="nf"&gt;release/$(1)/%.cpp.o&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;/%.&lt;span class="n"&gt;cpp&lt;/span&gt;
&lt;a id="rest_code_4330aaf031294e77b1583904915c57b7-2" name="rest_code_4330aaf031294e77b1583904915c57b7-2" href="https://baptiste-wicht.com/posts/2015/06/how-i-improved-a-bit-compile-time-of-etl.html#rest_code_4330aaf031294e77b1583904915c57b7-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;@&lt;span class="w"&gt; &lt;/span&gt;mkdir&lt;span class="w"&gt; &lt;/span&gt;-p&lt;span class="w"&gt; &lt;/span&gt;release/&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;/
&lt;a id="rest_code_4330aaf031294e77b1583904915c57b7-3" name="rest_code_4330aaf031294e77b1583904915c57b7-3" href="https://baptiste-wicht.com/posts/2015/06/how-i-improved-a-bit-compile-time-of-etl.html#rest_code_4330aaf031294e77b1583904915c57b7-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;CXX&lt;span class="k"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;CXX_FLAGS&lt;span class="k"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;RELEASE_FLAGS&lt;span class="k"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-MD&lt;span class="w"&gt; &lt;/span&gt;-MF&lt;span class="w"&gt; &lt;/span&gt;release/&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;/&lt;span class="nv"&gt;$$&lt;/span&gt;*.cpp.d&lt;span class="w"&gt; &lt;/span&gt;-o&lt;span class="w"&gt; &lt;/span&gt;release/&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;/&lt;span class="nv"&gt;$$&lt;/span&gt;*.cpp.o&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;/&lt;span class="nv"&gt;$$&lt;/span&gt;*.cpp
&lt;a id="rest_code_4330aaf031294e77b1583904915c57b7-4" name="rest_code_4330aaf031294e77b1583904915c57b7-4" href="https://baptiste-wicht.com/posts/2015/06/how-i-improved-a-bit-compile-time-of-etl.html#rest_code_4330aaf031294e77b1583904915c57b7-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;@&lt;span class="w"&gt; &lt;/span&gt;sed&lt;span class="w"&gt; &lt;/span&gt;-i&lt;span class="w"&gt; &lt;/span&gt;-e&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'s@^\(.*\)\.o:@\1.d \1.o:@'&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;release/&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;/&lt;span class="nv"&gt;$$&lt;/span&gt;*.cpp.d
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This reduced the compilation time to 86.8 seconds. Not that much reduction, but it still is quite nice to know that. I would have expected this to reduce more the compile time.&lt;/p&gt;
&lt;section id="use-pragma-once"&gt;
&lt;h3&gt;Use #pragma once&lt;/h3&gt;
&lt;p&gt;Normally, I'm not a fan of #pragma since it is not standard, but for now ETL only supports three compilers and only very recent of them, so I have the guarantee that #pragma once is available, so what the hell!&lt;/p&gt;
&lt;p&gt;I've replaced all the include guards by single #pragma once directives.&lt;/p&gt;
&lt;p&gt;Again, the results are not impressive, this reduced the compile time to 86.2 seconds. I would only advise to use this if you are sure of the compilers you want to support and you need the extra time.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="avoid-iostream"&gt;
&lt;h3&gt;Avoid &amp;lt;iostream&amp;gt;&lt;/h3&gt;
&lt;p&gt;I've read that the &amp;lt;iostream&amp;gt; header was one of the slowest to compile of the STL. It is only one that is included several times in my headers only for stream operators and it turns out that there is a &amp;lt;iosfwd&amp;gt; header that forward declares a lot of things from the &amp;lt;iostream&amp;gt; and other I/O headers.&lt;/p&gt;
&lt;p&gt;By replacing all &amp;lt;iostream&amp;gt; include by &amp;lt;iosfwd&amp;gt;, compile time has gone down to 84.1 seconds.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;By using the three techniques, I've reduced the compile time from 87.5 to 84.1 seconds. I would have honestly hoped for more improvements, but this is a already a good start.&lt;/p&gt;
&lt;p&gt;As a side note, clang compile time is 45.2 seconds under the same conditions (was 46.2 seconds before the optimizations). It is really much faster :) I'm still using GCC a lot since in several cases, it does generate much better code and in average, the generated code if faster (on my benchmarks at least). I don't have the numbers for icc, but icc is definitely the slowest of the three. When I have it available (at work), I use for release build before running something. The generated executables are generally faster (I only use Intel processors) and sometimes the difference can be quite important.&lt;/p&gt;
&lt;p&gt;If you have ideas to reduce further the compile time on this test case, I'd be glad to hear them and put them to the test.&lt;/p&gt;
&lt;p&gt;I hope that this small experience would be helpful to some of you :)&lt;/p&gt;
&lt;/section&gt;
&lt;section id="other-techniques"&gt;
&lt;h3&gt;Other techniques&lt;/h3&gt;
&lt;p&gt;There are several other techniques that you can use to reduce compile time:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Precompiled Headers are supported by both Clang and GCC, altough not in a compatible. I haven't tested this in a while, but it is quite effective and a very interesting technique. The main problem with this is that is not standard and not compatible between compilers. But it probably is the most efficient techniques when you have lots of headers and lots of templates as in my case.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Unity builds can make full rebuild much faster. I personally don't like unity builds especially because it is only really good for full builds and you generally don't do full rebuilds that much (I know, I know, this is also the test done in this article :) ). Moreover, it also sucks at doing parallel builds.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pimpl idioms and other type erasure techniques can reduce compile time a lot. If it is well done, it can be implemented without so much overhead.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Explicit instantiation of templates can also help, but only in the case of a user program. In the case of a library itself, you cannot do anything.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reduce inclusions and use forward declarations, obviously...&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use tools like distcc (I very rarely use it) and ccache (I generally use it).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Update your compiler&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Upgrade your computer ;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;...&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;
&lt;/section&gt;</description><category>C++</category><category>Compilers</category><category>gcc</category><category>Performance</category><guid>https://baptiste-wicht.com/posts/2015/06/how-i-improved-a-bit-compile-time-of-etl.html</guid><pubDate>Tue, 16 Jun 2015 20:00:21 GMT</pubDate></item><item><title>GCC 4.7 vs CLang 3.1 on eddic</title><link>https://baptiste-wicht.com/posts/2012/11/gcc-4-7-clang-3-1-eddic.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;&lt;a href="http://www.baptiste-wicht.com/2012/11/eddic-compiles-with-clang-3-1/" title="eddic compiles with CLang 3.1"&gt;Now that eddic can be compiled with CLang&lt;/a&gt;, I wanted to compare the differences in compilation time and in performance of the generated executable between those two compilers. The tests are done using GCC 4.7.2 and CLang 3.1 on Gentoo.&lt;/p&gt;
&lt;h3&gt;Compilation Time&lt;/h3&gt;

&lt;p&gt;The first thing that I tested has been the compilation time of the two compilers to compile eddic with different flags. I tested the compilation in debug mode and with -O2 and -O3.&lt;/p&gt;
&lt;div id="graph_0" style="width: 400px; height: 300px;"&gt;&lt;/div&gt;
&lt;p&gt;&lt;input id="button_graph_0" type="button" value="Logarithmic scale"&gt;
&lt;script type="text/javascript"&gt;function draw_graph_0(){var graph=new google.visualization.ColumnChart(document.getElementById('graph_0'));var data=google.visualization.arrayToDataTable([['Options','GCC','CLang'],['-g',234.59,119.59],['-O2',273.02,178.22],['-O3',276.87,183.78],]);var options={title:"Compilation Time - Less is better",animation:{duration:1200,easing:"in"},width:'400px',height:'300px',hAxis:{title:"Options"},vAxis:{title:"Seconds",viewWindow:{min:0}}};graph.draw(data,options);var button=document.getElementById('button_graph_0');button.onclick=function(){if(options.vAxis.logScale){button.value="Logarithmic Scale";}else{button.value="Normal scale";}options.vAxis.logScale=!options.vAxis.logScale;graph.draw(data,options);};}&lt;/script&gt;
&lt;/p&gt;
&lt;p&gt;The most interesting fact in these results is that CLang is much faster than GCC. It takes twice less times to compile eddic with CLang in debug mode than with GCC. The impact on optimizations on CLang's compilation is also more important than on GCC. For both compilers, -O3 does not seems to add a lot of overhead.&lt;/p&gt;
&lt;h3&gt;Runtime performance&lt;/h3&gt;

&lt;p&gt;Then, I tested the performance of the generated executable. I tested it on three things, the whole test suite and two test cases that I know are the slowest for the EDDI Compiler. For each case, I took the slowest value of 5 consecutive executions.&lt;/p&gt;
&lt;div id="graph_1" style="width: 600px; height: 400px;"&gt;&lt;/div&gt;
&lt;p&gt;&lt;input id="button_graph_1" type="button" value="Logarithmic scale"&gt;
&lt;script type="text/javascript"&gt;function draw_graph_1(){var graph=new google.visualization.ColumnChart(document.getElementById('graph_1'));var data=google.visualization.arrayToDataTable([['Compiler','GCC -O2','GCC -O3','CLang -O2','CLang -O3'],['testsuite',6.58,6.59,6.74,6.58],['assembly',1.2,1.2,1.2,1.2],['linked_list',0.51,0.5,0.49,0.49],]);var options={title:"Runtime Performance - Less is better",animation:{duration:1200,easing:"in"},width:'600px',height:'400px',hAxis:{title:"Options"},vAxis:{title:"Seconds",viewWindow:{min:0}}};graph.draw(data,options);var button=document.getElementById('button_graph_1');button.onclick=function(){if(options.vAxis.logScale){button.value="Logarithmic Scale";}else{button.value="Normal scale";}options.vAxis.logScale=!options.vAxis.logScale;graph.draw(data,options);};}&lt;/script&gt;
&lt;/p&gt;
&lt;p&gt;The difference are very small. In -02, GCC performs a bit better, but in -O3, the performance are equivalent. I was a bit disappointed by the results, because I thought that there would be higher differences. It seems that CLang is not as far from GCC that some people would like to say. It also certainly depends on the program being compiled.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;It is clear that CLang is much faster than GCC to compile eddic. Moreover, the performance of the generated executable are almost similar.&lt;/p&gt;
&lt;p&gt;I will continue to use CLang as my development compiler and switches between the two when I'm doing performance benchmarking. I will try to update the benchmark once new versions of GCC / CLang are available.&lt;/p&gt;
&lt;script type="text/javascript"&gt;function draw_visualization(){draw_graph_0();draw_graph_1();}google.setOnLoadCallback(draw_visualization);&lt;/script&gt;</description><category>Benchmarks</category><category>clang</category><category>Compilers</category><category>EDDI</category><category>gcc</category><category>Performances</category><guid>https://baptiste-wicht.com/posts/2012/11/gcc-4-7-clang-3-1-eddic.html</guid><pubDate>Mon, 12 Nov 2012 08:28:44 GMT</pubDate></item><item><title>eddic compiles with CLang 3.1</title><link>https://baptiste-wicht.com/posts/2012/11/eddic-compiles-with-clang-3-1.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;I finally added support for compiling eddic with LLVM CLang 3.1 !&lt;/p&gt;
&lt;p&gt;The current development version can be completely compiled with CLang. Starting with the version 1.1.4, all versions of eddic will be support GCC and CLang. &lt;/p&gt;
&lt;p&gt;The changes have not been as painful as I first thought. &lt;/p&gt;
&lt;ul&gt;
    &lt;li&gt;The main problem that I has was about a static const variable of a class that had no user-constructor. GCC allows that, but it is not standard compliant and CLang was complaining. &lt;/li&gt;
    &lt;li&gt;Another problem that I encountered was about the used of bit flags and Template Meta Programming. I simplified that by the use of a simple type traits and it worked. I don't really know why this does not worked at first. &lt;/li&gt;
    &lt;li&gt;The remaining effort was to fix the several warnings that CLang had. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CLang also fixed a bug in my code with a warning on a assignment that was not supposed to be an assignment, thanks CLang. &lt;/p&gt;
&lt;p&gt;The most interesting fact about CLang is that &lt;strong&gt;is it twice faster to build eddic than GCC&lt;/strong&gt;. I think I'm gonna use it during development to fasten the compile time. Moreover, even if I only worked two days with it, it seems that the error messages are indeed better than the GCC's ones. &lt;/p&gt;
&lt;p&gt;I haven't tried to compare the performances of eddic in both cases, but I will do that in the future, soon after the 1.1.4 version is released. &lt;/p&gt;
&lt;p&gt;I tried the CLang static analyzer on eddic but it didn't found any bugs. Moreover, it crashed on several of my files. I didn't found why for now, but I will continue to investigate, perhaps I'm not using it correctly. &lt;/p&gt;
&lt;p&gt;I expect to publish the next version of eddic in the next two weeks. This version has much more improvements that I thought at first and I have less time to work now that &lt;a href="http://www.baptiste-wicht.com/2012/09/back-in-berkeley-california/" title="Back in Berkeley, California" target="_blank"&gt;I'm working on my Master thesis&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;More informations on CLang: &lt;a href="http://clang.llvm.org/" title="CLang official site"&gt;The official site&lt;/a&gt;.&lt;/p&gt;</description><category>clang</category><category>Compilers</category><category>EDDI</category><category>gcc</category><category>Linux</category><guid>https://baptiste-wicht.com/posts/2012/11/eddic-compiles-with-clang-3-1.html</guid><pubDate>Thu, 01 Nov 2012 08:11:05 GMT</pubDate></item><item><title>Back in Berkeley, California</title><link>https://baptiste-wicht.com/posts/2012/09/back-in-berkeley-california.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;I arrived yesterday to Berkeley, California.&lt;/p&gt;
&lt;p&gt;Just like I did my Bachelor thesis in Lawrence Berkeley National Laboratory (LBNL), I will do my Master Thesis there too. The thesis will last a bit less than a semester.&lt;/p&gt;
&lt;p&gt;During my Master Thesis I will try to use profiling samples from the Linux perf tools in GCC or Clang to optimize processor cache usage (avoid cache and page faults).&lt;/p&gt;
&lt;p&gt;I will try to publish some posts about that during the semester if I have time.&lt;/p&gt;</description><category>Compilers</category><category>gcc</category><category>Others</category><category>Personal</category><category>The site</category><guid>https://baptiste-wicht.com/posts/2012/09/back-in-berkeley-california.html</guid><pubDate>Thu, 13 Sep 2012 08:35:43 GMT</pubDate></item><item><title>Install the Insight Debugger on Linux Mint (works for Ubuntu too)</title><link>https://baptiste-wicht.com/posts/2012/01/install-insight-debugger-linux-mint-ubuntu.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;Insight is a very good debugger based on gdb. I prefer it over ddd or kdbg as I find it clearer and easier to use. Moreover, this debugger is also the one used in the book &lt;strong&gt;Assembly language Step by Step, for Linux&lt;/strong&gt;. However, Insight has been removed from Debian packages already more than a year ago. &lt;/p&gt;
&lt;p&gt;But, thanks to SevenMachines, a PPA repository is available to install it on Linux Mint (works also on Ubuntu and Ubuntu-based Linux distributions). &lt;/p&gt;
&lt;p&gt;To add the repository to your apt sources, add the following lines to the /etc/apt/sources.list file:&lt;/p&gt;
&lt;pre&gt;deb http://ppa.launchpad.net/sevenmachines/dev/ubuntu natty main 
deb-src http://ppa.launchpad.net/sevenmachines/dev/ubuntu natty main &lt;/pre&gt;

&lt;p&gt;and update your apt sources: &lt;/p&gt;
&lt;pre&gt;sudo apt-get update&lt;/pre&gt;

&lt;p&gt;Then you can install insight: &lt;/p&gt;
&lt;pre&gt;sudo apt-get install insight&lt;/pre&gt;

&lt;p&gt;And now you are ready to use Insight as your debugger. &lt;/p&gt;
&lt;p&gt;If you don't trust this PPA repository, you can also try it to install it from the sources (http://sources.redhat.com/insight/), but doesn't seem to very simple to install it. I wasn't able to build it on my Linux Mint 12.&lt;/p&gt;</description><category>Assembly</category><category>C++</category><category>gcc</category><category>Linux</category><category>Mint</category><category>Tools</category><guid>https://baptiste-wicht.com/posts/2012/01/install-insight-debugger-linux-mint-ubuntu.html</guid><pubDate>Thu, 26 Jan 2012 08:28:41 GMT</pubDate></item><item><title>Diploma Thesis : Inlining Assistance for large-scale object-oriented applications</title><link>https://baptiste-wicht.com/posts/2011/10/diploma-thesis-inlining-assistance-for-large-scale-object-oriented-applications.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;div&gt;&lt;p&gt;One month ago, my diploma thesis has been accepted and I got my Bachelor of Science in Computer Science.&lt;/p&gt;
&lt;p&gt;I made my diploma thesis at Lawrence Berkeley National Laboratory, Berkeley, California. I was in the team responsible of the developmenet of the ATLAS Software for the LHC in Cern. The title of my thesis is &lt;strong&gt;Inlining Assistance for large-scale object-oriented applications&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The goal of this project was to create a C++ analyzer to find the best functions and call sites to inline. The input of the analyzer is a call graph generated by CallGrind of the Valgrind project.&lt;/p&gt;
&lt;p&gt;The functions and call sites to inline are computed using a heuristic, called the temperature. This heuristic is based on the cost of calling the given function, the frequency of calls and the size of the function. The cost of calling a function is based on the number of parameters, the virtuality of the function and the shared object the function is located in.&lt;/p&gt;
&lt;p&gt;The analyzer is also able to find clusters of call sites. A cluster is a set of hot call sites related to each other. It can also finds the functions that should be moved from one library to the other or the function that should not be virtual by testing the use of each function in a class hierarchy.&lt;/p&gt;
&lt;p&gt;To achieve this project, it has been necessary to study in details how a function is called on the Linux platform. The inlining optimization has also been studied to know what were the advantages and the problems of this technique.&lt;/p&gt;
&lt;p&gt;To retrieve the information about the sizes and the virtuality of the function, it has been necessary to read the shared libraries and executables files. For that, we used &lt;em&gt;libelf&lt;/em&gt;. The virtuality of a function is calculated by reading each virtual table and searching for the function in the virtual tables content.&lt;/p&gt;
&lt;p&gt;The graph manipulation is made by the &lt;em&gt;Boost Graph Library&lt;/em&gt;. As it was an advanced library, it has helped me improving my skills in specific topics like templates, traits or Template Metaprogramming.&lt;/p&gt;
&lt;p&gt;The analyzer is able to run on the Linux platform on any program that has been compiled using gcc.&lt;/p&gt;
&lt;p class="more"&gt;&lt;a href="https://baptiste-wicht.com/posts/2011/10/diploma-thesis-inlining-assistance-for-large-scale-object-oriented-applications.html"&gt;Read more…&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;</description><category>Boost</category><category>C++</category><category>Compilers</category><category>gcc</category><category>Linux</category><category>Optimization</category><category>Performances</category><category>Personal</category><guid>https://baptiste-wicht.com/posts/2011/10/diploma-thesis-inlining-assistance-for-large-scale-object-oriented-applications.html</guid><pubDate>Mon, 03 Oct 2011 06:44:17 GMT</pubDate></item><item><title>How to install a specific version of GCC on Ubuntu 11.04 (natty)</title><link>https://baptiste-wicht.com/posts/2011/06/install-specific-version-gcc-ubuntu.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;div&gt;&lt;p&gt;Sometimes you need to install a specific version of gcc for some reasons, for example when you need to have the same compiler version as the one used by your team. &lt;/p&gt;
&lt;p&gt;In that, the package manager doesn't help because not every version of gcc is packaged in every version of Ubuntu. So you must install it by hand and it can take a little time and there is some things that has to be done in order to work. &lt;/p&gt;
&lt;p&gt;I'm talking here of Ubuntu 11.04 (natty), because this is the version I installed Ubuntu on. This procedure will certainly work but you could have a problem with some dependencies that are installed in natty and not in your version or in the contrary have a dependency already installed. &lt;/p&gt;
&lt;p&gt;So this article will detail every step to install a specific version of gcc &lt;/p&gt;
&lt;p class="more"&gt;&lt;a href="https://baptiste-wicht.com/posts/2011/06/install-specific-version-gcc-ubuntu.html"&gt;Read more…&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;</description><category>C++</category><category>gcc</category><category>Linux</category><guid>https://baptiste-wicht.com/posts/2011/06/install-specific-version-gcc-ubuntu.html</guid><pubDate>Fri, 17 Jun 2011 06:18:29 GMT</pubDate></item></channel></rss>