<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Blog blog("Baptiste Wicht"); (Posts about clang)</title><link>https://baptiste-wicht.com/</link><description></description><atom:link href="https://baptiste-wicht.com/categories/clang.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><lastBuildDate>Sun, 15 Feb 2026 06:57:39 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Decrease DLL neural network compilation time with C++17</title><link>https://baptiste-wicht.com/posts/2018/02/decrease-dll-neural-network-compilation-time-with-c%2B%2B17.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;Just last week, &lt;a class="reference external" href="https://baptiste-wicht.com/posts/2018/02/c%2B%2B17-migration-of-expression-templates-library-etl.html"&gt;I've migrated my Expression Templates Library (ETL) library to C++17&lt;/a&gt;,
it is now also done in my Deep Learning Library (DLL) library. In ETL, this
resulted in a &lt;em&gt;much nicer code overall&lt;/em&gt;, but no real improvement in compilation
time.&lt;/p&gt;
&lt;p&gt;The objective of the migration of DLL was two-fold. First, I also wanted to
simplify some code, especially with &lt;code&gt;if constexpr&lt;/code&gt;. But I also especially
wanted to try to reduce the compilation time. In the past,
&lt;a class="reference external" href="https://baptiste-wicht.com/posts/2017/09/how-i-made-deep-learning-library-38-faster-to-compile-optimization-and-cpp17-if-constexpr.html"&gt;I've already tried a few changes with C++17&lt;/a&gt;, with good results on the compilation of the entire test suite.
While this is very good, this is not very representative of users of the library.
Indeed, normally you'll have only one network in your source file not several.
The new changes will especially help in the case of many networks, but less in
the case of a single network per source file.&lt;/p&gt;
&lt;p&gt;This time, I decided to test the compilation on the examples. I've tested the
eight official examples from the DLL library:&lt;/p&gt;
&lt;ol class="arabic simple" start="0"&gt;
&lt;li&gt;&lt;p&gt;mnist_dbn: A fully-connected Deep Belief Network (DBN) on the MNIST data set
with three layers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;char_cnn: A special CNN with embeddings and merge and group layers for text
recognition&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;imagenet_cnn: A 12 layers Convolutional Neural Network (CNN) for Imagenet&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_ae: A simple two-layers auto-encoder for MNIST&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_cnn: A simple 6 layers CNN for MNIST&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_deep_ae: A deep auto-encoder for MNIST, only fully-connected&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_lstm: A Recurrent Neural Network (RNN) with Long Short Term Memory
(LSTM) cells&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_mlp: A simple fully-connected network for MNIST, with dropout&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mnist_rnn: A simple RNN with simple cells for MNIST&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is really representative of what users can do with the library and I think
it's a much better for compilation time.&lt;/p&gt;
&lt;p&gt;For reference, you can find &lt;a class="reference external" href="https://github.com/wichtounet/dll/tree/master/examples/src"&gt;the source code of all the examples online&lt;/a&gt;.&lt;/p&gt;
&lt;section id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;Let's start with the results. I've tested this at different stages of the
migration with clang 5 and GCC 7.2. I tested the following steps:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;The original C++14 version&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simply compiling in c++17 mode (-std=c++17)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Using the C++17 version of the ETL library&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Upgrading DLL to C++17 (without ETL)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ETL and DLL in C++17 versions&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I've compiled each example independently in release_debug mode. Here are the
results for G++ 7.2:&lt;/p&gt;
&lt;table class="align-center"&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Example&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;0&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;3&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;4&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;5&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;6&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;7&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;8&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;C++14&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.818&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.944&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.511&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.403&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.998&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.911&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.745&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.974&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.006&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;-std=c++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.358&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.409&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.707&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.810&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.042&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.896&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.635&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.134&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.027&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;ETL C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;36.045&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.000&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.942&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.322&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.840&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.747&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.151&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.208&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.939&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;DLL C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.251&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.577&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.854&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.653&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.758&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.851&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.606&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.098&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.146&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Final C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.289&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.133&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.939&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.232&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.753&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.526&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.326&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.116&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;17.819&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Final Improvement&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.62%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.49%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.67%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.11%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.15%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.27%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.69%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.52%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.24%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The difference by just enabling c++17 is not significant. On the other hand,
some significant gain can be obtained by using the C++17 version of ETL,
especially for the DBN version and for the CNN versions. Except for the DBN
case, the migration of DLL to C++17 did not bring any significant advantage.
When everything is combined, the gains are more important :) In the best case,
the example is 14.6% faster to compile.&lt;/p&gt;
&lt;p&gt;Let's see if it's the same with clang++ 5.0:&lt;/p&gt;
&lt;table class="align-center"&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Example&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;0&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;3&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;4&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;5&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;6&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;7&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;8&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;C++14&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.690&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.753&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.488&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.146&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.926&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;17.708&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.806&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.207&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;20.858&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;-std=c++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.502&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.664&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.990&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.027&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.510&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;17.630&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.465&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.161&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;20.860&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;ETL C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.386&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.008&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.896&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.519&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.269&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.995&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.897&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.383&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.809&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;DLL C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.252&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.592&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.250&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.131&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.782&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;17.606&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.595&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.126&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;20.782&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Final C++17&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.470&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.154&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.881&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.415&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.279&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;17.078&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.808&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.497&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.761&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Final Improvement&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.28%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.60%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.52%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.52%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.15%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.55%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.34%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.69%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.25%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;First of all, as I have seen time after time, clang is still slower than GCC.
It's a not a big difference, but still significant. Overall, the gains are a bit
higher on clang than on GCC, but not by much. Interestingly, the migration of
DLL to C++17 is less interesting in terms of compilation time for clang. It
seems even to slow down compilation on some examples. On the other hand, the
migration of ETL is more important than on GCC.&lt;/p&gt;
&lt;p&gt;Overall, every example is faster to compile using both libraries in C++17, but
we don't have spectacular speed-ups. With clang, we have speedups from 3.3% to
15.3%. With GCC, we have speedup  from 1.1% to 14.6%. It's not very high, but
I'm already satisfied with these results.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="c-17-in-dll"&gt;
&lt;h2&gt;C++17 in DLL&lt;/h2&gt;
&lt;p&gt;Overall, the migration of DLL to C++17 was quite similar to that of ETL. You can
take a look at my &lt;a class="reference external" href="https://baptiste-wicht.com/posts/2018/02/c%2B%2B17-migration-of-expression-templates-library-etl.html"&gt;previous article&lt;/a&gt;
if you want more details on C++17 features I've used.&lt;/p&gt;
&lt;p&gt;I've &lt;em&gt;replaced a lot of SFINAE functions&lt;/em&gt; with &lt;code&gt;if constexpr&lt;/code&gt;. I've also
replaced a lot of &lt;code&gt;statif_if&lt;/code&gt; with &lt;code&gt;if constexpr&lt;/code&gt;. There was a large
number of these in DLL's code. I also enabled all the &lt;code&gt;constexpr&lt;/code&gt; that
were commented for this exact time :)&lt;/p&gt;
&lt;p&gt;I was also thinking that I could replace a lot of meta-programming stuff with
&lt;em&gt;fold expressions&lt;/em&gt;. While I was able to replace a few of them, most of them were
harder to replace with fold expressions. Indeed, the variadic pack is often
hidden behind another class and therefore the pack is not directly usable from
the network class or the group and merge layers classes. I didn't want to start
a big refactoring just to use a C++17 feature, the current state of this code is
fine.&lt;/p&gt;
&lt;p&gt;I made some use of structured bindings as well, but again not as much as I was
thinking. In fact, a lot of time, I'm assigning the elements of a pair or tuple
to existing variables not declaring new variables and unfortunately, you can
only use structured bindings with &lt;code&gt;auto&lt;/code&gt; declaration.&lt;/p&gt;
&lt;p&gt;Overall, the &lt;em&gt;code is significantly better now&lt;/em&gt;, but there was less impact than
there was on ETL. It's also a smaller code base, so maybe this is normal and my
expectations were too high ;)&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The trunk of DLL is now a C++17 library :) I think this improve the quality of
the code by a nice margin! Even though, there is still some work to be done to
improve the code, especially for the DBN pretraining code, the quality is quite
good now. Moreover, the switch to C++17 made the compilation of neural networks
using the DLL library &lt;em&gt;faster to compile&lt;/em&gt;, from 1.1% in the worst case to 15.3% in
the best case! I don't know when I will release the next version of DLL, but it
will take some time. I'll especially have to polish the RNN support and add
a sequence to sequence loss before I will release the 1.1 version of DLL.&lt;/p&gt;
&lt;p&gt;I'm quite satisfied with C++17 even if I would have liked a bit more features to
play with! I'm already a big fan of &lt;code&gt;if constexpr&lt;/code&gt;, this can make the code
much nicer and fold expressions are much more intuitive than their previous
recursive template counterpart.&lt;/p&gt;
&lt;p&gt;I may also consider migrating some parts of the cpp-utils library, but if I do,
it will only be through the use of conditionals in order not to break the other
projects that are based on the library.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>C++17</category><category>clang</category><category>Compilers</category><category>Deep Learning</category><category>dll</category><category>etl</category><category>gcc</category><category>Machine Learning</category><category>Performance</category><category>projects</category><guid>https://baptiste-wicht.com/posts/2018/02/decrease-dll-neural-network-compilation-time-with-c%2B%2B17.html</guid><pubDate>Wed, 07 Feb 2018 10:39:02 GMT</pubDate></item><item><title>How I made my Deep Learning Library 38% faster to compile (Optimization and C++17 if constexpr)</title><link>https://baptiste-wicht.com/posts/2017/09/how-i-made-deep-learning-library-38-faster-to-compile-optimization-and-cpp17-if-constexpr.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;div&gt;&lt;p&gt;My Deep Learning Library (DLL) project is a C++ library for training and using
artificial neural networks (you can take a look at
&lt;a class="reference external" href="https://baptiste-wicht.com/posts/2017/07/update-on-deep-learning-library-dll-dropout-batch-normalization-adaptive-learning-rates.html"&gt;this post about DLL&lt;/a&gt;
if you want more information).&lt;/p&gt;
&lt;p&gt;While I made a lot of effort to make it as fast as possible to train and run
neural networks, the compilation time has been steadily going up and is becoming
quite annoying. This library is heavily templated and all the matrix operations
are done using my Expression Templates Library (ETL) which is more than
template-heavy itself.&lt;/p&gt;
&lt;p&gt;In this post, I'll present two techniques with which I've been able to reduce
the total compilation of the DLL unit tests by up to 38%.&lt;/p&gt;
&lt;p class="more"&gt;&lt;a href="https://baptiste-wicht.com/posts/2017/09/how-i-made-deep-learning-library-38-faster-to-compile-optimization-and-cpp17-if-constexpr.html"&gt;Read more…&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;</description><category>C++</category><category>C++17</category><category>clang</category><category>Compilers</category><category>dll</category><category>etl</category><category>gcc</category><category>Performance</category><category>projects</category><guid>https://baptiste-wicht.com/posts/2017/09/how-i-made-deep-learning-library-38-faster-to-compile-optimization-and-cpp17-if-constexpr.html</guid><pubDate>Thu, 21 Sep 2017 17:44:34 GMT</pubDate></item><item><title>Compiler benchmark GCC and Clang on C++ library (ETL)</title><link>https://baptiste-wicht.com/posts/2017/08/compiler-benchmark-gcc-clang-cpp-library-etl.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;It's been a while since I've done a benchmark of different compilers on C++
code. Since I've recently
&lt;a class="reference external" href="https://baptiste-wicht.com/posts/2017/08/expression-templates-library-etl-11.html"&gt;released the version 1.1 of my ETL project&lt;/a&gt;
(an optimized matrix/vector computation library with expression templates), I've
decided to use it as the base of my benchmark. It's a C++14 library with a lot
of templates. I'm going to compile the full test suite (124 test cases). This is
done directly on the last release (1.1) code. I'm going to compile once in debug
mode and once in release_debug (release plus debug symbols and assertions) and
record the times for each compiler. The tests were compiled with support for
every option in ETL to account to maximal compilation time. Each compilation was
made using four threads (make -j4). I'm also going to test a few of the
benchmarks to see the difference in runtime performance between the code
generated by each compiler. The benchmark will be compiled in release mode and
its compilation time recorded as well.&lt;/p&gt;
&lt;p&gt;I'm going to test the following compilers:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;GCC-4.9.4&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GCC-5.4.0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GCC-6.3.0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GCC-7.1.0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;clang-3.9.1&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;clang-4.0.1&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;zapcc-1.0 (commercial, based on clang-5.0 trunk)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All have been installed directly using Portage (Gentoo package manager) except
for clang-4.0.1 that has been installed from sources and zapcc since it does not
have a Gentoo package. Since clang package on Gentoo does not support
multislotting, I had to install one version from source and the other from the
package manager. This is also the reason I'm testing less versions of clang,
simply less practical.&lt;/p&gt;
&lt;p&gt;For the purpose of these tests, the exact same options have been used throughout
all the compilers. Normally, I use different options for clang than for GCC
(mainly more aggressive vectorization options on clang). This may not lead to
the best performance for each compiler, but allows for comparison between the
results with defaults optimization level. Here are the main options used:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;In debug mode: -g&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In release_debug mode: -g -O2&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In release mode: -g -O3 -DNDEBUG -fomit-frame-pointer&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In each case, a lot of warnings are enabled and the ETL options are the same.&lt;/p&gt;
&lt;p&gt;All the results have been gathered on a Gentoo machine running on Intel Core
i7-2600 (Sandy Bridge...) @3.4GHz with 4 cores and 8 threads, 12Go of RAM and
a SSD. I do my best to isolate as much as possible the benchmark from
perturbations and that my benchmark code is quite sound, it may well be that
some results are not totally accurate. Moreover, some of the benchmarks are
using multithreading, which may add some noise and unpredictability. When I was
not sure about the results, I ran the benchmarks several time to confirm them
and overall I'm confident of the results.&lt;/p&gt;
&lt;section id="compilation-time"&gt;
&lt;h2&gt;Compilation Time&lt;/h2&gt;
&lt;p&gt;Let's start with the results of the performance of the compilers themselves:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Debug&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Release_Debug&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Benchmark&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;402s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;616s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;100s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;403s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;642s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;95s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;399s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;683s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;102s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;371s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;650s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;105s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;380s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;807s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;106s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;260s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;718s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;92s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;221s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;649s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;108s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Note: For Release_Debug and Benchmark, I only use three threads with zapcc,
because 12Go of RAM is not enough memory for four threads.&lt;/p&gt;
&lt;p&gt;There are some very significant differences between the different compilers.
Overall, clang-4.0.1 is by far the fastest free compiler for Debug mode. When
the tests are compiled with optimizations however, clang is falling behind.
It's quite impressive how clang-4.0.1 manages to be so much faster than
clang-3.9.1 both in debug mode and release mode. Really great work by the clang
team here! With these optimizations, clang-4.0.1 is almost on par with gcc-7.1
in release mode.  For GCC, it seems that the cost of optimization has been going
up quite significantly. However, GCC 7.1 seems to have made optimization faster
and standard compilation much faster as well. If we take into account zapcc,
it's the fastest compiler on debug mode, but it's slower than several gcc
versions on release mode.&lt;/p&gt;
&lt;p&gt;Overall, I'm quite impressed by the performance of clang-4.0.1 which seems
really fast! I'll definitely make more tests with this new version of the
compiler in the near future. It's also good to see that g++-7.1 also did make
the build faster than gcc-6.3. However, the fastest gcc version for optimization
is still gcc-4.9.4 which is already an old branch with low C++ standard support.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="runtime-performance"&gt;
&lt;h2&gt;Runtime Performance&lt;/h2&gt;
&lt;p&gt;Let's now take a look at the quality of the generated code. For some of the
benchmarks, I've included two versions of the algorithm. &lt;em&gt;std&lt;/em&gt; is the most
simple algorithm (the naive one) and &lt;em&gt;vec&lt;/em&gt; is the hand-crafted vectorized and
optimized implementation. All the tests were done on single-precision floating
points.&lt;/p&gt;
&lt;section id="dot-product"&gt;
&lt;h3&gt;Dot product&lt;/h3&gt;
&lt;p&gt;The first benchmark that is run is to compute the dot product between two
vectors. Let's look first at the naive version:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;dot (std)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;500&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;2000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;3000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;4000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;5000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000000&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.96ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;97.12ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;126.07ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;25.91us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;326.49us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.92ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.55ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.22ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.36ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;72.96ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;101.62ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;127.89ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.90us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.39us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;357.63us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.23ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.57ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.32ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;73.31ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;102.88ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;130.16ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.314us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;339.13us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.47ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.95ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.70ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.69ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;70.20ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;104.09ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;130.98ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.90us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.96us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;281.47us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.93ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.58ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.19ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.33ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.69ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;98.69ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;128.60ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.33us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;272.71us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.56ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.19ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.37ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;60.31ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;96.34ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;128.90ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.87us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;270.21us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.23ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.55ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.18ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.35ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;61.14ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;96.92ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;125.95ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.84us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;285.80us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.92ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.55ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.34ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The differences are not very significant between the different compilers. The
clang-based compilers seem to be the compilers producing the fastest code.
Interestingly, there seem to have been a big regression in gcc-6.3 for large
containers, but that has been fixed in gcc-7.1.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;dot (vec)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;500&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;2000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;3000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;4000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;5000000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000000&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;48.34ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;80.53ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;114.97ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.72us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.79us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;354.20us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.52ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.19ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.55ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;47.16ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;77.70ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;113.66ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.72us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.71us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;363.86us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.52ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.19ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.56ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;46.39ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;77.67ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;116.28ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.74us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.39us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;452.44us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.45ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.26ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.87ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.49ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.52ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;49.70ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;80.40ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;115.77ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.71us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.46us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;355.16us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.21ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.85ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.49ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.14ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.47ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;46.13ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;78.01ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;114.70ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.66us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.82us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;359.42us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.88ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.53ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.50ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;45.59ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;74.90ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;111.29ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.57us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.47us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;351.31us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.23ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.85ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.49ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.12ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.45ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;45.11ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;75.04ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;111.28ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.59us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.46us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;357.32us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.25ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.89ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.53ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.15ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.47ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If we look at the optimized version, the differences are even slower. Again, the
clang-based compilers are producing the fastest executables, but are closely
followed by gcc, except for gcc-6.3 in which we can still see the same
regression as before.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="logistic-sigmoid"&gt;
&lt;h3&gt;Logistic Sigmoid&lt;/h3&gt;
&lt;p&gt;The next test is to check the performance of the sigmoid operation. In that
case, the evaluator of the library will try to use parallelization and
vectorization to compute it. Let's see how the different compilers fare:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sigmoid&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000000&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;8.16us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.23us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.33us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.56us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;259.72us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.78ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.07us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.08us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.39us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.44us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;266.27us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.96ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.13us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.32us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.45us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.99us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;261.81us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.86ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.03us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.09us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.24us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.61us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;252.78us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.71ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.30us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.25us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.57us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.24us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;256.75us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.99ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.47us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.14us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.77us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;26.03us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;235.87us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.81ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.51us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.26us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;6.48us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.86us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;258.31us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.95ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Interestingly, we can see that gcc-7.1 is the fastest for small vectors while
clang-4.0 is the best for producing code for larger vectors. However, except for
the biggest vector size, the difference is not really significantly. Apparently,
there is a regression in zapcc (or clang-5.0) since it's slower than clang-4.0
at the same level as clang-3.9.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="y-alpha-x-y-axpy"&gt;
&lt;h3&gt;y = alpha * x + y (axpy)&lt;/h3&gt;
&lt;p&gt;The third benchmark is the well-known axpy (y = alpha * x + y). This is entirely
resolved by expressions templates in the library, no specific algorithm is used.
Let's see the results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;saxpy&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000000&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.1ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;61.6ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;374ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.65us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.8us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;518us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.0ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;58.1ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;383ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.87us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;43.2us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;479us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.3ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;59.4ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;371ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.57us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.4us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;452us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.8ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;59.7ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;399ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.78us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;43.1us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;547us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.3ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;53.8ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;297ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.21us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.3us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;466us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.4ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;59.8ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;296ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.31us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.2us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;475us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;32.0ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;54.0ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;333ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.32us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.7us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;447us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Even on the biggest vector, this is a very fast operation, once vectorized and
parallelized. At this speed, some of the differences observed may not be highly
significant. Again clang-based versions are the fastest versions on this code,
but by a small margin.  There also seems to be a slight regression in gcc-7.1,
but again quite small.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="matrix-matrix-multiplication-gemm"&gt;
&lt;h3&gt;Matrix Matrix multiplication (GEMM)&lt;/h3&gt;
&lt;p&gt;The next benchmark is testing the performance of a Matrix-Matrix Multiplication,
an operation known as GEMM in the BLAS nomenclature. In that case, we test both
the naive and the optimized vectorized implementation. To save some horizontal
space, I've split the tables in two.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sgemm (std)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;20&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;40&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;80&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.04us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;50.15us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;356.42us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.18ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.41ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.56ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;8.14us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;74.77us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;513.64us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.72ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.05ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.92ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;8.03us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.78us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;504.41us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.69ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.87ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.95us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;65.00us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;508.84us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.69ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.84ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.58us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;28.59us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;222.36us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.73ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.77us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.41ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.00us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;25.47us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;190.56us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.61ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.45us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.80ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.00us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;25.38us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;189.98us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.60ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.43us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.81ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sgemm (std)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;200&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;300&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;400&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;500&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;600&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;700&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;800&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;900&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1200&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;44.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;148.88ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;455.81ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;687.96ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.47s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.98s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.81s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.00s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.91s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;9.52s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;63.17ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;213.01ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;504.83ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;984.90ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.70s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.70s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.03s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.74s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.87s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.905&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.04ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;212.12ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;502.95ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;981.74ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.69s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.69s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.13s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.85s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;8.10s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.08s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;62.57ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;210.72ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;499.68ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;974.94ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.68s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.67s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.99s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.68s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.85s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;13.49s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;27.48ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;90.85ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;219.34ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;419.53ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.72s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.18s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.90s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.44s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.36s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.84s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.01ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;73.90ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;175.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;340.70ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.58s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.93s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.40s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.98s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.79s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.69s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.33ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;75.80ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;181.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;359.13ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.63s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.02s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.52s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.24s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.21s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;5.62s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This time, the differences between the different compilers are very significant.
The clang compilers are leading the way by a large margin here, with clang-4.0
being the fastest of them (by another nice margin). Indeed, clang-4.0.1 is
producing code that is, on average, about twice faster than the code generated
by the best GCC compiler. Very interestingly as well, we can see a huge
regression starting from GCC-5.4 and that is still here in GCC-7.1. Indeed, the
best GCC version, in the tested versions, is again GCC-4.9.4. Clang is really
doing an excellent job of compiling the GEMM code.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sgemm (vec)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;10&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;20&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;40&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;80&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;264.27ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.95us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.28us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.77us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;23.50us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;60.37us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;271.41ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.99us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.31us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.811us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.116us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;61.00us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;279.72ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.02us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.27us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.39us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.29us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;61.99us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;273.74ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.96us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.81us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.55us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.35us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;71.11us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;296.67ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.34us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.18us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.93us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.15us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;82.60us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;322.68ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.38us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.17us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;20.19us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.17us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;83.64us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;307.49ns&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.41us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.10us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;19.72us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.72us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;84.80us&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sgemm (vec)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;200&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;300&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;400&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;500&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;600&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;700&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;800&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;900&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1000&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;1200&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;369.52us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.62ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.17ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;11.74ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.82ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;51.67ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.36ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;111.15ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;387.54us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.60ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.97ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.36ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;12.11ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.37ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.37ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;52.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;65.72ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;112.74ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;384.43us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.74ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.12ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;12.44ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.15ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;34.87ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;52.59ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;70.074ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;119.22ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;458.05us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.81ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.44ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;7.86ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;13.43ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;24.70ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;36.54ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;53.47ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;66.87ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;117.25ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;494.52us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.96ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.80ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;8.88ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;29.37ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;41.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;60.72ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;72.28ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;123.75ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;511.24us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.04ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.11ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;9.46ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.34ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;27.23ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;38.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;58.14ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;72.78ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;128.60ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;492.28us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.03ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.90ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;9.00ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;14.31ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;25.72ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.09ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;55.79ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;67.88ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;119.92ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;As for the optimized version, it seems that the two families are reversed.
Indeed, GCC is doing a better job than clang here, and although the margin is
not as big as before, it's still significant. We can still observe a small
regression in GCC versions because the 4.9 version is again the fastest. As for
clang versions, it seems that clang-5.0 (used in zapcc) has had some performance
improvements for this case.&lt;/p&gt;
&lt;p&gt;For this case of matrix-matrix multiplication, it's very impressive that the
differences in the non-optimized code are so significant. And it's also
impressive that each family of compilers has its own strength, clang being
seemingly much better at handling unoptimized code while GCC is better at
handling vectorized code.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="convolution-2d"&gt;
&lt;h3&gt;Convolution (2D)&lt;/h3&gt;
&lt;p&gt;The last benchmark that I considered is the case of the valid convolution on 2D
images. The code is quite similar to the GEMM code but more complicated to
optimized due to cache locality.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sconv2_valid (std)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100x50&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;105x50&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;110x55&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;115x55&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;120x60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;125x60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;130x65&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;135x65&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;140x70&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;27.93ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.68ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.62ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;48.23ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;57.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;67.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;78.45ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;92.53ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;105.08ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.60ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;44.94ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;54.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.45ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;76.63ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;89.75ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;105.08ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;121.66ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;140.95ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.10ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;44.99ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;54.34ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.54ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;76.54ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;89.87ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;105.35ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;121.94ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;141.20ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;37.55ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;45.08ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;54.39ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;64.48ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;76.51ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;92.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;106.16ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;125.67ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;143.57ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.42ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.59ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.21ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;26.40ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.03ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;36.26ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;42.35ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;48.87ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;56.29ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.48ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.67ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.34ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;26.50ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;31.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;36.58ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;42.61ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;49.33ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;56.80ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;15.29ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;18.37ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;22.00ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;26.10ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30.75ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;35.95ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;41.85ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;48.42ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;55.74ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In that case, we can observe the same as for the GEMM. The clang-based versions
are much producing significantly faster code than the GCC versions. Moreover, we
can also observe the same large regression starting from GCC-5.4.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;sconv2_valid (vec)&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;100x50&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;105x50&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;110x55&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;115x55&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;120x60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;125x60&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;130x65&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;135x65&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;140x70&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.4&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;878.32us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.07ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.68ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.04ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.06ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.54ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;4.14ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;853.73us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.03ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.15ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.36ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.76ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.05ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.44ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.91ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.13ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-6.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;847.95us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.02ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.14ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.35ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.74ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.98ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.43ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.90ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.12ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-7.1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;795.82us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.93ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.05ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.60ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.77ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.69ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.81ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;782.46us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.93ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.05ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.26ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.60ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.84ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.21ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.65ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.84ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-4.0.1&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;767.58us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.92ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.04ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.25ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.59ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.83ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.20ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.62ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.83ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;782.49us&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;0.94ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.06ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.27ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.62ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.83ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.24ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.65ms&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.85ms&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This time, clang manages to produce excellent results. Indeed, all the produced
executables are significantly faster than the versions produced by GCC, except
for GCC-7.1 which is producing similar results. The other versions of GCC are
falling behind it seems. It seems that it was only for the GEMM that clang was
having a lot of troubles handling the optimized code.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Clang seems to have recently done a lot of optimizations regarding compilation
time. Indeed, clang-4.0.1 is much faster for compilation than clang-3.9.
Although GCC-7.1 is faster than GCC-6.3, all the GCC versions are slower than
GCC-4.9.4 which is the fastest at compiling code with optimizations. GCC-7.1 is
the fastest GCC version for compiling code in debug mode.&lt;/p&gt;
&lt;p&gt;In some cases, there is almost no difference between different compilers in the
generated code. However, in more  complex algorithms such as the matrix-matrix
multiplication or the two-dimensional convolution, the differences can be quite
significant. In my tests, Clang have shown to be much better at compiling
unoptimized code. However, and especially in the GEMM case, it seems to be worse
than GCC at handling hand-optimized. I will investigate that case and try to
tailor the code so that clang is having a better time with it.&lt;/p&gt;
&lt;p&gt;For me, it's really weird that the GCC regression, apparently starting from
GCC-5.4, has still not been fixed in GCC 7.1. I was thinking of dropping support
for GCC-4.9 in order to go full C++14 support, but now I may have to reconsider
my position. However, seeing that GCC is generally the best at handling
optimized code (especially for GEMM), I may be able to do the transition, since
the optimized code will be used in most cases.&lt;/p&gt;
&lt;p&gt;As for zapcc, although it is still the fastest compiler in debug mode, with the
new speed of clang-4.0.1, its margin is quite small. Moreover, on optimized
build, it's not as fast as GCC. If you use clang and can have access to zapcc,
it's still quite a good option to save some time.&lt;/p&gt;
&lt;p&gt;Overall, I have been quite pleased by clang-4.0.1 and GCC-7.1, the most recent
versions I have been testing. It seems that they did quite some good work.
I will definitely run some more tests with them and try to adapt the code. I'm
still considering whether I will drop support for some older compilers.&lt;/p&gt;
&lt;p&gt;I hope this comparison was interesting :) My next post will probably be about
the difference in performance between my machine learning framework and other
frameworks to train neural networks.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>C++11</category><category>C++14</category><category>clang</category><category>Compilers</category><category>etl</category><category>gcc</category><category>Performance</category><category>projects</category><guid>https://baptiste-wicht.com/posts/2017/08/compiler-benchmark-gcc-clang-cpp-library-etl.html</guid><pubDate>Mon, 07 Aug 2017 07:16:21 GMT</pubDate></item><item><title>Partial type erasing in Deep Learning Library (DLL) to improve compilation time</title><link>https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;In a previous post, I compared the &lt;a class="reference external" href="https://baptiste-wicht.com/posts/2017/03/disappointing-zapcc-performance-on-deep-learning-library-dll.html"&gt;compilation time on my Deep Learning Library (DLL) project with different compilers&lt;/a&gt;. I realized that the compilation times were quickly going unreasonable for this library, especially for compiling the unit cases which clearly hurts the development of the library. Indeed, you want to be able to run the unit tests reasonably quickly after you integrated new changes.&lt;/p&gt;
&lt;section id="reduce-the-compilation-time"&gt;
&lt;h2&gt;Reduce the compilation time&lt;/h2&gt;
&lt;p&gt;The first thing I did was to split the compilation in three executables: one for
the unit tests, one for the various performance tests and one for the various other
miscellaneous tests. With this, it is much faster to compile only the unit test
cases.&lt;/p&gt;
&lt;p&gt;But this can be improved significantly more. In DLL a network is a variadic
template containing the list of layers, in order. In DLL, there are two main
different ways of declaring a neural networks. In the first version, the fast
version, the layers directly know their sizes:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-1" name="rest_code_7d60f8842b134ce4921751a494b3e333-1" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-2" name="rest_code_7d60f8842b134ce4921751a494b3e333-2" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-3" name="rest_code_7d60f8842b134ce4921751a494b3e333-3" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_layers&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-4" name="rest_code_7d60f8842b134ce4921751a494b3e333-4" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-5" name="rest_code_7d60f8842b134ce4921751a494b3e333-5" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-5"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-6" name="rest_code_7d60f8842b134ce4921751a494b3e333-6" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-6"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;unit_type&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SOFTMAX&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-7" name="rest_code_7d60f8842b134ce4921751a494b3e333-7" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-7"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;sgd_trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-8" name="rest_code_7d60f8842b134ce4921751a494b3e333-8" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-8"&gt;&lt;/a&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-9" name="rest_code_7d60f8842b134ce4921751a494b3e333-9" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-9"&gt;&lt;/a&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;make_unique&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-10" name="rest_code_7d60f8842b134ce4921751a494b3e333-10" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-10"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pretrain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_7d60f8842b134ce4921751a494b3e333-11" name="rest_code_7d60f8842b134ce4921751a494b3e333-11" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_7d60f8842b134ce4921751a494b3e333-11"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fine_tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In my opinion, this is the best way to use DLL. This is the fastest and the
clearest. Moreover, the dimensions of the network can be validated at compile
time, which is always better than at runtime. However, the dimensions of the
network cannot be changed at runtime.  For this, there is a different version,
the dynamic version:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-1" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-1" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-2" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-2" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-3" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-3" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_layers&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-4" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-4" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dyn_rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-5" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-5" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-5"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dyn_rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-6" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-6" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-6"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dyn_rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;unit_type&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SOFTMAX&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-7" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-7" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-7"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;sgd_trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-8" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-8" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-8"&gt;&lt;/a&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-9" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-9" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-9"&gt;&lt;/a&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;make_unique&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-10" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-10" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-10"&gt;&lt;/a&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-11" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-11" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-11"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;init_layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-12" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-12" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-12"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;init_layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-13" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-13" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-13"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;init_layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-14" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-14" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-14"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-15" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-15" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-15"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-16" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-16" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-16"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layer_get&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-17" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-17" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-17"&gt;&lt;/a&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-18" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-18" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-18"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pretrain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_a1f611be63fc41de86b7fa61258d3df1-19" name="rest_code_a1f611be63fc41de86b7fa61258d3df1-19" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_a1f611be63fc41de86b7fa61258d3df1-19"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fine_tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is a bit more verbose, but the configuration can be changed at runtime with
this system. Moreover, this is also faster to compile. On the other hand, there
is some performance slowdown.&lt;/p&gt;
&lt;p&gt;There is also a third version that is a hybrid of the first version:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-1" name="rest_code_763ef33c181844ab925fab666a286e0b-1" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-2" name="rest_code_763ef33c181844ab925fab666a286e0b-2" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dyn_dbn_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-3" name="rest_code_763ef33c181844ab925fab666a286e0b-3" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_layers&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-4" name="rest_code_763ef33c181844ab925fab666a286e0b-4" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-5" name="rest_code_763ef33c181844ab925fab666a286e0b-5" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-5"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-6" name="rest_code_763ef33c181844ab925fab666a286e0b-6" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-6"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rbm_desc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;unit_type&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SOFTMAX&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;layer_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-7" name="rest_code_763ef33c181844ab925fab666a286e0b-7" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-7"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;sgd_trainer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dll&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;dbn_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-8" name="rest_code_763ef33c181844ab925fab666a286e0b-8" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-8"&gt;&lt;/a&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-9" name="rest_code_763ef33c181844ab925fab666a286e0b-9" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-9"&gt;&lt;/a&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;make_unique&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;network_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-10" name="rest_code_763ef33c181844ab925fab666a286e0b-10" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-10"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pretrain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_763ef33c181844ab925fab666a286e0b-11" name="rest_code_763ef33c181844ab925fab666a286e0b-11" href="https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html#rest_code_763ef33c181844ab925fab666a286e0b-11"&gt;&lt;/a&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fine_tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Only one line was changed compared to the first version, &lt;code&gt;dbn_desc&lt;/code&gt;
becomes &lt;code&gt;dyn_dbn_desc&lt;/code&gt;. What this changes is that all the layers are
automatically transformed into their dynamic versions and all the parameters are
propagated at runtime. This is a form a type erasing since the sizes will not be
propagated at compilation time. But this is simple since the types are simply
transformed from one type to another directly. Behind the scene, it's the
dynamic version using the front-end of the fast version. This is almost as fast
to compile as the dynamic version, but the code is much better. It executes the
same as the dynamic version.&lt;/p&gt;
&lt;p&gt;If we compare the compilation time of the three versions when compiling a single
network and 5 different networks with different architectures, we get the
following results (with clang):&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Model&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Time [s]&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;1 Fast&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;30&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;1 Dynamic&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.6&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;1 Hybrid&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.6&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;5 Fast&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;114&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;5 Dynamic&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;16.6&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;5 Hybrid&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;21.9&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Even with one single network, the compilation time is reduced by 44%. When five
different networks are compilation, time is reduced by 85%. This can be
explained easily. Indeed, for the hybrid and dynamic versions, the layers will
have the same type and therefore a lot of template instantiations will only be
done once instead of five times. This makes a lot of difference since almost
everything is template inside the library.&lt;/p&gt;
&lt;p&gt;Unfortunately, this also has an impact on the runtime of the network:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Model&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Pretrain [s]&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Train [s]&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Fast&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;195&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;114&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Dynamic&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;203&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;123&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Hybrid&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;204&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;122&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;On average, for dense models, the slowdown is between 4% and 8%. For
convolutional models, it is between 10% and 25%. I will definitely work on
trying to make the dynamic and especially the hybrid version faster in the
future, most on the work should be on the matrix library (ETL) that is used.&lt;/p&gt;
&lt;p&gt;Since for test cases, a 20% increase in runtime is not really a problem, tests
being fast already, I decided to add an option to DLL so that everything can be
compiled by default in hybrid model. By using a compilation flag, all the
&lt;code&gt;dbn_desc&lt;/code&gt; are becoming &lt;code&gt;dyn_dbn_desc&lt;/code&gt; and therefore each used
network is becoming a hybrid network. Without a single change in the code, the
compilation time of the entire library can be significantly improved, as seen in
the next section.  This can also be used in user code to improve compilation
time during debugging and experiments and can be turned off for the final
training.&lt;/p&gt;
&lt;p&gt;On my Continuous Integration system, I will build the system in both
configurations. This is not really an issue, since my personal machine at home
is more powerful than what I have available here.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;On a first experiment, I measured the difference before and after this change on
the three executables of the library, with gcc:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Model&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Unit [s]&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Perf [s]&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;Misc [s]&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Before&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1029&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;192&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;937&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;After&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;617&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;143&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;619&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;40.03%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;25.52%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;33.93%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;It is clear that the speedups are very significant! The compilation is between
25% and 40% faster with the new option. Overall, this is a speedup of 36%!
I also noticed that the compilation takes significantly less memory than before.
Therefore, I decided to rerun the compiler benchmark on the library. In the
previous experiment, zapcc was taking so much memory that it was impossible to
use more than one thread. Let's see how it is faring now. The time to compile
the full unit tests is computed for each compiler. Let's start in debug mode:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Debug&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j3&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;527&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;268&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;182&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;150&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;591&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;303&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;211&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;176&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-5.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;588&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;302&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;209&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;175&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;375&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;187&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;126&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;121&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This time, zapcc is able to scale to four threads without problems. Moreover, it
is always the fastest compiler, by a significant margin, in this configuration.
It is followed by clang and then by gcc for which both versions are about the
same speed.&lt;/p&gt;
&lt;p&gt;If we compile again in release mode:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Release&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j3&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1201&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;615&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;421&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;356&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1041&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;541&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;385&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;321&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-5.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1114&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;579&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;412&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;348&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;897&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;457&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;306&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;em&gt;306&lt;/em&gt;&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The difference in compilation time is very large, it's twice slower to compile
with all optimizations enabled. It also takes significantly more memory. Indeed,
zapcc was not able to compile with 4 threads. Nevertheless, even the results
with three threads are better than the other compilers using four threads. zapcc
is clearly the winner again on this test, followed by gcc4-9 which is faster
than gcc-5.3 which is itself faster than clang. It seems that while clang is
better at frontend than gcc, it is slower for optimizations. Note that this may
also be an indication that clang performs more optimizations than gcc and may
not be slower.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;By using some form of type erasing to simplify the templates types at compile
time, I was able to reduce the overall compilation time of my Deep Learning
Library (DLL) by 36%. Moreover, this can be done by switching a simple
compilation flag. This also very significantly reduce the memory used during the
compilation, allowing zapcc to to compile with up to three threads, compared
with only one before. This makes zapcc the fastest compiler again on this
benchmark. Overall, this will make debugging much easier on this library and
will save me a lot of time.&lt;/p&gt;
&lt;p&gt;In the future, I plan to try to improve compilation time even more. I have a few
ideas, especially in ETL that should significantly improve the compilation time
but that will require a lot of time to implement, so that will likely have to
wait a while. In the coming days, I plan to work on the performance of DLL,
especially for stochastic gradient descent.&lt;/p&gt;
&lt;p&gt;If you want more information on DLL, you can check out the
&lt;a class="reference external" href="https://github.com/wichtounet/dll"&gt;dll Github repository&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>C++11</category><category>clang</category><category>Compilers</category><category>dll</category><category>etl</category><category>gcc</category><category>zapcc</category><guid>https://baptiste-wicht.com/posts/2017/03/partial-type-erasing-deep-learning-library-dll-improve-compilation-time.html</guid><pubDate>Wed, 15 Mar 2017 06:43:44 GMT</pubDate></item><item><title>Use clang-tidy for static analysis and integration in Sonarqube</title><link>https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;clang-tidy is an extensive linter C++. It provides a complete framework for
analysis of C++ code. Some of the checks are very simple but some of them are
very complete and most of the checks from the clang-static-analyzer are
integrated into clang-tidy.&lt;/p&gt;
&lt;section id="usage"&gt;
&lt;h2&gt;Usage&lt;/h2&gt;
&lt;p&gt;If you want to see the list of checks available on clang-tidy, you can use the
list-checks options:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code text"&gt;&lt;a id="rest_code_79c9b5980f31494da3f5473c9794173d-1" name="rest_code_79c9b5980f31494da3f5473c9794173d-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_79c9b5980f31494da3f5473c9794173d-1"&gt;&lt;/a&gt;clang-tidy -list-checks
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can then choose the tests you are interested in and perform an analysis of
your code. For, it is highly recommended to use a Clang compilation database,
you can have a look at Bear to generate this compilation database if you don't
have it yet. The usage of clang-tidy, is pretty simple, you set the list of
checks you want, the header on which you want to have warnings reported and the
list of source files to analyse:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code text"&gt;&lt;a id="rest_code_fac07a40bed04ecdaf3186f7a2a4f8b7-1" name="rest_code_fac07a40bed04ecdaf3186f7a2a4f8b7-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_fac07a40bed04ecdaf3186f7a2a4f8b7-1"&gt;&lt;/a&gt;clang-tidy -checks='*' -header-filter="^include" -p . src/*.cpp
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You'll very likely see a lot of warnings. And you will very likely see a lot of
false positives and a lot of warnings you don't agree too. For insance, there
are a lot of warnings from the CPP Core Guidelines and the Google Guidelines
that I don't follow in my coding. You should not take the complete list of tests
as rule, you should devise your own list of what you really want to fix in your
code. If you want to disable one check X, you can use the - operation:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code text"&gt;&lt;a id="rest_code_4d9a227041f14cdb899028a3263db103-1" name="rest_code_4d9a227041f14cdb899028a3263db103-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_4d9a227041f14cdb899028a3263db103-1"&gt;&lt;/a&gt;clang-tidy -checks='*,-X' -header-filter="^include" -p . src/*.cpp
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can also enable the checks one by one or parts of them with *:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code text"&gt;&lt;a id="rest_code_5f0de224e0c34215ad7a1414e6ccc2d4-1" name="rest_code_5f0de224e0c34215ad7a1414e6ccc2d4-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_5f0de224e0c34215ad7a1414e6ccc2d4-1"&gt;&lt;/a&gt;clang-tidy -checks='google-*' -header-filter="^include" -p . src/*.cpp
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;One problem with the clang-tidy tool is that it is utterly slow, especially if
you enable the clang-static-analyzer checks. Moreover, if you use it like it is
set before, it will only use one thread for the complete set of files. This may
not be an issue on small projects, but this will definitely be a big issue for
large projects and template-heavy code (like my ETL project). You could create
an implicit target into your Makefile to use it on each file independently and
then use the -j option of make to make them in parallel, but it not really
practical.&lt;/p&gt;
&lt;p&gt;For this, I just discovered that clang propose a Python script,
run-clang-tidy.py that does it all for us! On Gentoo, it is installed at
/usr/share/run-clang-tidy.py.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code text"&gt;&lt;a id="rest_code_f234c722a3fd4ee2a6f459f353e709b9-1" name="rest_code_f234c722a3fd4ee2a6f459f353e709b9-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_f234c722a3fd4ee2a6f459f353e709b9-1"&gt;&lt;/a&gt;run-clang-tidy.py -checks='*' -header-filter="^include" -p . -j9
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This will automatically run clang-tidy on each file from the compilation
database and use 9 threads to perform the checks. This is definitely much
faster. For me, this is the best way to run clang-tidy.&lt;/p&gt;
&lt;p&gt;One small point I don't like is that the script always print the list of enabled
checks. For, this I changed this line in the script:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code python"&gt;&lt;a id="rest_code_134e53a7677a4cb181b8818e171f81b8-1" name="rest_code_134e53a7677a4cb181b8818e171f81b8-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_134e53a7677a4cb181b8818e171f81b8-1"&gt;&lt;/a&gt;&lt;span class="n"&gt;invocation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clang_tidy_binary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'-list-checks'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;with:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code python"&gt;&lt;a id="rest_code_9735adffacb94560847b73b19a97a766-1" name="rest_code_9735adffacb94560847b73b19a97a766-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_9735adffacb94560847b73b19a97a766-1"&gt;&lt;/a&gt;&lt;span class="n"&gt;invocation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clang_tidy_binary&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This makes it more quiet.&lt;/p&gt;
&lt;p&gt;One thing I didn't mention is that clang-tidy is able to fix some of the errors
directly if you use the -fix option. Personally, I don't like this, but for
a large code base and a carefully selected set of checks, this could be really
useful. Note that not all the checks are automatically fixable by clang-tidy.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;I have run clang-tidy on my cpp-utils library and here some interesting results.
I have not run all the checks, here is the command I used:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code text"&gt;&lt;a id="rest_code_000e7d77ea2f4126a99aa298414e240e-1" name="rest_code_000e7d77ea2f4126a99aa298414e240e-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_000e7d77ea2f4126a99aa298414e240e-1"&gt;&lt;/a&gt;/usr/share/clang/run-clang-tidy.py -p . -header-filter '^include/cpp_utils' -checks='cert-*,cppcoreguidelines-*,google-*,llvm-*,misc-*,modernize-*,performance-*,readility-*,-cppcoreguidelines-pro-type-reinterpret-cast,-cppcoreguidelines-pro-bounds-pointer-arithmetic,-google-readability-namespace-comments,-llvm-namespace-comment,-llvm-include-order,-google-runtime-references' -j9 2&amp;gt;/dev/null  | /usr/bin/zgrep -v "^clang-tidy"
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let's go over some warnings I got:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code text"&gt;&lt;a id="rest_code_254bb39497e7434e9ac0514efba6cc97-1" name="rest_code_254bb39497e7434e9ac0514efba6cc97-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_254bb39497e7434e9ac0514efba6cc97-1"&gt;&lt;/a&gt;include/cpp_utils/assert.hpp:91:103: warning: consider replacing 'long' with 'int64' [google-runtime-int]
&lt;a id="rest_code_254bb39497e7434e9ac0514efba6cc97-2" name="rest_code_254bb39497e7434e9ac0514efba6cc97-2" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_254bb39497e7434e9ac0514efba6cc97-2"&gt;&lt;/a&gt;void assertion_failed_msg(const CharT* expr, const char* msg, const char* function, const char* file, long line) {
&lt;a id="rest_code_254bb39497e7434e9ac0514efba6cc97-3" name="rest_code_254bb39497e7434e9ac0514efba6cc97-3" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_254bb39497e7434e9ac0514efba6cc97-3"&gt;&lt;/a&gt;                                                                                                      ^
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I got this one several times. It is indeed more portable to use &lt;code&gt;int64&lt;/code&gt; rather than &lt;code&gt;long&lt;/code&gt;.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code text"&gt;&lt;a id="rest_code_578f328265dc48259924068776f50c66-1" name="rest_code_578f328265dc48259924068776f50c66-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_578f328265dc48259924068776f50c66-1"&gt;&lt;/a&gt;include/cpp_utils/aligned_allocator.hpp:53:9: warning: use 'using' instead of 'typedef' [modernize-use-using]
&lt;a id="rest_code_578f328265dc48259924068776f50c66-2" name="rest_code_578f328265dc48259924068776f50c66-2" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_578f328265dc48259924068776f50c66-2"&gt;&lt;/a&gt;        typedef aligned_allocator&amp;lt;U, A&amp;gt; other;
&lt;a id="rest_code_578f328265dc48259924068776f50c66-3" name="rest_code_578f328265dc48259924068776f50c66-3" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_578f328265dc48259924068776f50c66-3"&gt;&lt;/a&gt;        ^
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This one is part of the modernize checks, indicating that one should use
&lt;code&gt;using&lt;/code&gt; rather than a &lt;code&gt;typedef&lt;/code&gt; and I completely agree.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_450acfd6718e477ba1875a755e0e1d42-1" name="rest_code_450acfd6718e477ba1875a755e0e1d42-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_450acfd6718e477ba1875a755e0e1d42-1"&gt;&lt;/a&gt;&lt;span class="n"&gt;include&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cpp_utils&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;aligned_allocator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hpp&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;79&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;warning&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;use&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="err"&gt;'&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;define&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;trivial&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;constructor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;modernize&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;use&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;a id="rest_code_450acfd6718e477ba1875a755e0e1d42-2" name="rest_code_450acfd6718e477ba1875a755e0e1d42-2" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_450acfd6718e477ba1875a755e0e1d42-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;aligned_allocator&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
&lt;a id="rest_code_450acfd6718e477ba1875a755e0e1d42-3" name="rest_code_450acfd6718e477ba1875a755e0e1d42-3" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_450acfd6718e477ba1875a755e0e1d42-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;
&lt;a id="rest_code_450acfd6718e477ba1875a755e0e1d42-4" name="rest_code_450acfd6718e477ba1875a755e0e1d42-4" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_450acfd6718e477ba1875a755e0e1d42-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Another one from the modernize checks that I really like. This is completely
true.&lt;/p&gt;
&lt;!-- code.:

include/cpp_utils/maybe_parallel.hpp:33:5: warning: constructors that are callable with a single argument must be marked explicit to avoid unintentional implicit conversions [google-explicit-constructor]
    thread_pool(Args... /*args*/){
    ^
    explicit --&gt;
&lt;p&gt;I don't agree that every constructor with one argument should be explicit,
sometimes you want implicit conversion. Nevertheless, this particular case is
very interesting since it is variadic, it can have one template argument and as
thus it can be implicitly converted from anything, which is pretty bad I think.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_702844926a8b44fd907bed80ef8efd43-1" name="rest_code_702844926a8b44fd907bed80ef8efd43-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_702844926a8b44fd907bed80ef8efd43-1"&gt;&lt;/a&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;array_wrapper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cpp&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;warning&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;casts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;discouraged&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;use&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;readability&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;casting&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;a id="rest_code_702844926a8b44fd907bed80ef8efd43-2" name="rest_code_702844926a8b44fd907bed80ef8efd43-2" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_702844926a8b44fd907bed80ef8efd43-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_702844926a8b44fd907bed80ef8efd43-3" name="rest_code_702844926a8b44fd907bed80ef8efd43-3" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_702844926a8b44fd907bed80ef8efd43-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;
&lt;a id="rest_code_702844926a8b44fd907bed80ef8efd43-4" name="rest_code_702844926a8b44fd907bed80ef8efd43-4" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_702844926a8b44fd907bed80ef8efd43-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;On this one, I completely agree, C-style casts should be avoided and much
clearer C++ style casts should be preferred.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code cpp"&gt;&lt;a id="rest_code_6578c08477e94800b6e3b2f05e63c29e-1" name="rest_code_6578c08477e94800b6e3b2f05e63c29e-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_6578c08477e94800b6e3b2f05e63c29e-1"&gt;&lt;/a&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;wichtounet&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cpp_utils_test&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;include&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cpp_utils&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;aligned_allocator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hpp&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;126&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;warning&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;thrown&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;exception&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;not&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nothrow&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;constructible&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cert&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;err60&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;cpp&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;a id="rest_code_6578c08477e94800b6e3b2f05e63c29e-2" name="rest_code_6578c08477e94800b6e3b2f05e63c29e-2" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_6578c08477e94800b6e3b2f05e63c29e-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;throw&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;length_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"aligned_allocator&amp;lt;T&amp;gt;::allocate() - Integer overflow."&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_6578c08477e94800b6e3b2f05e63c29e-3" name="rest_code_6578c08477e94800b6e3b2f05e63c29e-3" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_6578c08477e94800b6e3b2f05e63c29e-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;                  &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is one of the checks I don't agree with. Even though it makes sense to
prefer exception that are nothrow copy constructible, they should be caught by
const reference anyway. Moreover, this is here an exception from the standard
library.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code text"&gt;&lt;a id="rest_code_754f710dd1fa4fdc80849ce0fd4c1558-1" name="rest_code_754f710dd1fa4fdc80849ce0fd4c1558-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_754f710dd1fa4fdc80849ce0fd4c1558-1"&gt;&lt;/a&gt;/home/wichtounet/dev/cpp_utils_test/include/cpp_utils/aligned_allocator.hpp:141:40: warning: do not use const_cast [cppcoreguidelines-pro-type-const-cast]
&lt;a id="rest_code_754f710dd1fa4fdc80849ce0fd4c1558-2" name="rest_code_754f710dd1fa4fdc80849ce0fd4c1558-2" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_754f710dd1fa4fdc80849ce0fd4c1558-2"&gt;&lt;/a&gt;        free((reinterpret_cast&amp;lt;void**&amp;gt;(const_cast&amp;lt;std::remove_const_t&amp;lt;T&amp;gt;*&amp;gt;(ptr)))[-1]);
&lt;a id="rest_code_754f710dd1fa4fdc80849ce0fd4c1558-3" name="rest_code_754f710dd1fa4fdc80849ce0fd4c1558-3" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_754f710dd1fa4fdc80849ce0fd4c1558-3"&gt;&lt;/a&gt;                                       ^
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In general, I agree that using const_cast should be avoided as much as possible.
But there are some cases where they make sense. In this particular case, I don't
modify the object itself but some memory before the object that is unrelated and
I initialize myself.&lt;/p&gt;
&lt;p&gt;I also had a few false positives, but overall nothing too bad. I'm quite
satisfied with the quality of the results. I'll fix these warnings in the coming
week.&lt;/p&gt;
&lt;p&gt;Integration in Sonarqube&lt;/p&gt;
&lt;p&gt;The sonar-cxx plugin just integrated support for clang-tidy in main. You need
to build the version yourself, the 0.9.8-SNAPSHOT version. You then can use
something like this in your sonar-project.properties file:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code text"&gt;&lt;a id="rest_code_7a0cba05299c4de8ad611df67def1a8f-1" name="rest_code_7a0cba05299c4de8ad611df67def1a8f-1" href="https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html#rest_code_7a0cba05299c4de8ad611df67def1a8f-1"&gt;&lt;/a&gt;sonar.cxx.clangtidy.reportPath=clang-tidy-report
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;and sonar-cxx will parse the results and integrate the issues in your sonar
report.&lt;/p&gt;
&lt;p&gt;Here is an example:&lt;/p&gt;
&lt;img alt="/images/sonar-cxx-clang-tidy.png" src="https://baptiste-wicht.com/images/sonar-cxx-clang-tidy.png"&gt;
&lt;p&gt;You can see two of the warnings from clang-tidy :)&lt;/p&gt;
&lt;p&gt;For now, I haven't integrate this in my Continuous Integration system because
I'm still having issues with clang-tidy and the compilation database. Because
the compilation contains absolute paths to the file and to the current
directory, it cannot be shared directly between servers. I have to find a way to
fix that so that clang-tidy can use on the other computer. I'll probably wait
till the sonar-cxx 0.9.8 version is released before integrating all this in
Sonarqube, but this is a great news for this plugin :)&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;clang-tidy is C++ linter that can analyze your code and checks for hundreds of
problems in it. With it, I have found some very interesting problems in the code
of my cpp_utils library. Moreover, you can now integrate it Sonarqube by using
the sonar-cxx plugin. Since it is a bit slow, I'll probably not integrate it in
my bigger projects, but I'll integrate at least in the cpp_utils library when
sonar-cxx 0.9.8 will be released.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>projects</category><category>Sonar</category><guid>https://baptiste-wicht.com/posts/2017/03/clang-tidy-static-analysis-integration-in-sonarqube.html</guid><pubDate>Sat, 11 Mar 2017 08:54:00 GMT</pubDate></item><item><title>Disappointing zapcc performance on Deep Learning Library (DLL)</title><link>https://baptiste-wicht.com/posts/2017/03/disappointing-zapcc-performance-on-deep-learning-library-dll.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;One week ago, zapcc 1.0 was released and I've observed it to be much faster than the other
compilers in terms of compile time. This can be seen when
&lt;a class="reference external" href="http://baptiste-wicht.com/posts/2017/03/release-zapcc-10-fast-cpp-compiler.html"&gt;I tested it on my Expression Templates Library (ETL)&lt;/a&gt;. It was almost four
times faster than clang 3.9 and about 2.5 times faster than GCC.&lt;/p&gt;
&lt;p&gt;The ETL library is quite heavy to compile, but still reasonable. This is not the
case for my Deep Learning Library (DLL) where compiling all the test cases takes
a very long time. I have to admit that I have been going overboard with
templates and such and I have now to pay the price. In practice, for the users
of the library, this is not a big problem since only one or two neural networks
will be compiled (and it will take hours to train), but in the test cases, there
are hundreds of them and this is a huge pain. Anyway, enough with the ramble,
I figured it would be very good to test zapcc on it and see what I can gain from
using it.&lt;/p&gt;
&lt;p&gt;In this article, when I speak of a compiler thread, I mean an instance of the
processor, so it's really a process in the Linux world.&lt;/p&gt;
&lt;section id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;However, I soon realized that I would have more issues than I thought. The first
problem is the memory consumed by zapcc. Indeed, it is based on clang and
I always had problem with huge memory consumption from clang on this library and
zapcc has even bigger memory consumption because some information is cached
between runs. The amount of memory that zapcc is able to cache can be configured
in the configuration file. By default, it can use 1.5Go of memory. When zapcc
goes over the memory limit, it simply wipes out its caches. This means that all
the gain for the next compilation will be lost, since the cache will have to be
rebuilt from scratch. This is not a hard limit for the compilation itself.
Indeed, if the compilation itself takes 3Go, it will still be able to complete
it, but it is likely that the cache will be wiped after the compilation.&lt;/p&gt;
&lt;p&gt;When I tried compiling using several threads, it soon used all my memory and
crashed. The same occurs with clang but I can still compile with 3 or 4 threads
without too much issues on this computer. The same also occurs with GCC but it
can still handle 4 or 5 threads (depending on the order of the compilation
units).&lt;/p&gt;
&lt;p&gt;The tests are performed on my desktop computer at work, which is not really
good... I have 12Go of RAM (I had to ask for extra...) and an old Sandy Bridge
processor, but at least I have an SSD (also had to ask for extra).&lt;/p&gt;
&lt;p&gt;I started with testing with only one compiler thread. For zapcc, I set the
maximum memory limit to 8Go. Even with such a limit, the zapcc server restarted
more than 10 times during the compilation of the 84 test cases. After this first
experiment, I increased the number of threads to 2 for each compiler, using 4Go
limit for zapcc. The limit is for each server and each parallel thread will
spawn a new server, so the effective limit is the number of threads times the
limit. Even with two threads, I was unable to finish a compilation with zapcc.
This is quite disappoint for me since clang is able to run with 4 threads in
parallel. Moreover, a big problem with that is that the servers are not always
killed when there is no no more memory, they just hang and use all the memory of
the computer, which is evidently really inconvenient for service processes. When
this happens with clang or gcc, the compiler simply crashes and the memory is
released and make is interrupted. Since zapcc is not able to work with more than
one thread on this computer, the results are the ones with one thread. I was
also surprised to be able to compile the library with clang and four threads,
this was not possible before clang-3.9.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j3&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2250.95&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1256.36&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;912.67&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;760.84&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;gcc-5.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2305.37&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1279.49&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;918.08&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;741.38&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2047.61&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;1102.93&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;899.13&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;730.42&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc-1.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;strong&gt;1483.73&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1483.73&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1483.73&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1483.73&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Difference against Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-27.55%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+25.69%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+39.37%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+50.77%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-5.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-35.66%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+13.75%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+38.09%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+50.03%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-4.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;-34.08%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+15.30%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+38.50%&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;+48.75%&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If we look at the results with only one thread, we can see that there still are
some significant improvements when using zapcc, but nowhere near as good as what
was seen in the compilation of ETL. Here, the compilation time is reduced by 34%
compared to gcc and by 27% compared to clang. This is not bad, since it is
faster than the other compilers, but I would have expected better speedups. We
can see that g++-4.9 is slightly faster than g++-5.3, but this is not really
a significant difference. I'm actually very surprised to find that clang is
faster than g++ on this experiment. On ETL, it is always very significantly
slower and before, it was also significantly slower on DLL. I was so used to
this, that I stopped using it on this project. I may have to reconsider my
position when working on this project.&lt;/p&gt;
&lt;p&gt;Let's look at the results with more than two threads. Even with two threads,
every compiler is faster than zapcc. Indeed, zapcc is slower than Clang by 25%
and slower than GCC by about 15%. If we use more threads, the other compilers
are becoming even faster and the slowdowns of zapcc are more important. When
using four threads, zapcc is about 48% slower than gcc and about 50% slower than
clang. This is really showing one big downside of zapcc that has a very large
memory consumption. When it is used to compile really heavy template code, it is
failing very early to use more processes. And even when there is enough memory,
the speedups are not as great as for relatively simpler code.&lt;/p&gt;
&lt;p&gt;One may argue that this is not a fair comparison since zapcc does not have the
same numbers of threads. However, considering that this is the best zapcc can do
on this machine, I would argue that this is a fair comparison in this limited
experimental setting. If we were to have a big machine for compilation, which
I don't have at work, the zapcc results would likely be more interesting, but in
this specific limited case, it shows that zapcc suffers from its high memory
consumption. It should also be taken into account that this experiment was done
with almost nothing else running on the machine (no browser for instance) to
have as much memory as possible available for the compilers. This is not
a common use case.  Most of the days, when I compile something, I have my
browser open, which makes a large difference in memory available, and several
other applications (but consoles and vim instances do not really consume memory
:D).&lt;/p&gt;
&lt;p&gt;This experiment made me realize that the compilation times for this library were
quickly becoming crazy. Most of the time, the complete test suite is only
compiled on my Continuous Integration machine at home which has a much faster
processor and much more RAM. Therefore, it is relatively fast since it uses more
threads to compile.  Nevertheless, this is not a good point that the unit tests
takes so much time to compile. I plan to split the test cases in several sets.
Because, currently the real unit tests are compiled with the performance tests
and other various tests. I'll probably end up generating three executables. This
will help greatly during development. Moreover, I also have a technique to
decrease the compilation time by erasing some template parameters at compilation
time. This is already ready, but has currently a runtime overhead that I will
try to remove and then use this technique everywhere to get back to reasonable
compilation times. I'll also try to see if I can find obvious compilation
bottlenecks in the code.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;To conclude, while zapcc brings some very interesting compilation speedups in
some cases like in my ETL library, it also has some downsides, namely
&lt;strong&gt;huge memory consumption&lt;/strong&gt;. This memory consumption may prevent the use of several
compiler threads and render zapcc much less interesting than other compilers.&lt;/p&gt;
&lt;p&gt;When trying to compile my DLL library on a machine with 12Go of RAM with two
zapcc threads, it was impossible for me to make it complete. While zapcc was
faster with one thread than the other compilers, they were able to use up to
four threads and in the end &lt;strong&gt;zapcc was about twice slower than clang&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I knew that zapcc memory consumption was very large, but I would have not have
expected something so critical. Another feature that would be interesting in
zapcc would be to set a max memory hard limit for the server instead of simply
a limit on the cache they are able to keep in memory. This would prevent hanging
the complete computer when something goes wrong.&lt;/p&gt;
&lt;p&gt;I had a good surprise with clang that was actually faster than GCC and also able
to work with four threads in parallel. This was not the case with previous
version of clang. On ETL, it is still significantly slower than GCC though.&lt;/p&gt;
&lt;p&gt;For now, I'll continue using clang on this DLL project and use zapcc only on my
ETL project. I'll also focus on improving the compilation time on this project
and make it reasonable again.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>Compilers</category><category>dll</category><category>gcc</category><category>projects</category><category>zapcc</category><guid>https://baptiste-wicht.com/posts/2017/03/disappointing-zapcc-performance-on-deep-learning-library-dll.html</guid><pubDate>Thu, 09 Mar 2017 12:41:06 GMT</pubDate></item><item><title>Release of zapcc 1.0 - Fast C++ compiler</title><link>https://baptiste-wicht.com/posts/2017/03/release-zapcc-10-fast-cpp-compiler.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;If you remember, I recently wrote about &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2016/12/zapcc-cpp-compilation-speed-against-gcc-54-and-clang-39.html"&gt;zapcc C++ compilation speed against gcc 5.4 and clang 3.9&lt;/a&gt; in which I was comparing the beta version of zapcc against gcc and clang.&lt;/p&gt;
&lt;p&gt;I just been informed that zapcc was just released in version 1.0. I though it
was a good occasion to test it again. It will be compared against gcc-4.9,
gcc-5.3 and clang-3.9. This version is based on the trunk of clang-5.0.&lt;/p&gt;
&lt;p&gt;Again, I will use my Expression Template Library (&lt;a class="reference external" href="https://github.com/wichtounet/etl/"&gt;ETL&lt;/a&gt;) project. This is a purely header-only
library with lots of templates. I'm going to compile the full test cases. This
is a perfect example for long compilation times.&lt;/p&gt;
&lt;p&gt;The current tests are made on the last version of the library and with slightly
different parameters for compilation, therefore the absolute times are not
comparable, but the speedups should be comparable.&lt;/p&gt;
&lt;p&gt;Just like last time, I have configured zapcc to let is use 2Go RAM per caching
server, which is the maximum allowed. Moreover, I killed the servers before each
tests.&lt;/p&gt;
&lt;section id="debug-results"&gt;
&lt;h2&gt;Debug results&lt;/h2&gt;
&lt;p&gt;Let's start with a debug build, with no optimizations enabled. Every build will
use four threads. This is the equivalent of doing make -j4 debug/bin/etl_test
without the link step.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;190.09s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;200.92s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;313.85&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;81.25&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.86&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-5.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.47&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-4.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.33&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The speedups are even more impressive than last time! zapcc is &lt;strong&gt;almost four
times fast than clang-3.9&lt;/strong&gt; and around &lt;strong&gt;2.5 times faster than GCC-5.3&lt;/strong&gt;.
Interestingly, we can see that gcc-5.3 is slighly slower than GCC-4.9.&lt;/p&gt;
&lt;p&gt;It seems that they have the compiler even faster!&lt;/p&gt;
&lt;/section&gt;
&lt;section id="release-results"&gt;
&lt;h2&gt;Release results&lt;/h2&gt;
&lt;p&gt;Let's look now how the results are looking with optimizations enabled. Again,
every build will use four threads. This is the equivalent of doing make -j4
release_debug/bin/etl_test without the link step.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;252.99&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.3.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;264.96&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;361.65&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;237.96&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.51&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-5.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.11&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC-4.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.06&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We can see that this time the speedups are not as interesting as they were.
Very interestingly, it's the compiler that suffers the more from the
optimization overhead. Indeed, zapcc is three times slower in release mode than
it was in debug mode. Nevertheless, it still manages to beat the three other
compilers, by about 10% for Gcc and 50% than clang, which is already
interesting.&lt;/p&gt;
&lt;section id="conclusion"&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;To conclude, we have observed that zapcc is always faster than the three
compilers tested in this experiment. Moreover, in debug mode, the speedups are
very significant, it was almost 4 times faster than clang and around 2.5 faster
than gcc.&lt;/p&gt;
&lt;p&gt;I haven't seen any problem with the tool, it's like clang and it should generate
code of the same performance, but just compile it much faster. One problem
I have with zapcc is that it is not based on an already released version of
clang but on the trunk. That means it is hard to be compare with the exact same
version of clang and it is also a risk of running into clang bugs.&lt;/p&gt;
&lt;p&gt;Although the prices have not been published yet, it is indicated on the website
that zapcc is free for non-commercial entities. Which is really great.&lt;/p&gt;
&lt;p&gt;If you want more information, you can go to the
&lt;a class="reference external" href="https://www.zapcc.com/"&gt;official website of zapcc&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>Compilers</category><category>etl</category><category>gcc</category><category>projects</category><category>zapcc</category><guid>https://baptiste-wicht.com/posts/2017/03/release-zapcc-10-fast-cpp-compiler.html</guid><pubDate>Thu, 02 Mar 2017 13:50:04 GMT</pubDate></item><item><title>C++ Compiler benchmark on Expression Templates Library (ETL)</title><link>https://baptiste-wicht.com/posts/2016/12/cpp-compiler-benchmark-on-expression-templates-library-etl.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;In my Expression Templates Library (ETL) project, I have a lot of template heavy
code that needs to run as fast as possible and that is quite intensive to
compile. In this post, I'm going to compare the performance of a few of the
kernels produced by different compilers. I've got GCC 5.4, GCC 6.20 and clang
3.9. I also included zapcc which is based on clang 4.0.&lt;/p&gt;
&lt;p&gt;These tests have been run on an Haswell processor. The automatic parallelization
of ETL has been turned off for these tests.&lt;/p&gt;
&lt;p&gt;Keep in mind that some of the diagrams are presented in logarithmic form.&lt;/p&gt;
&lt;section id="vector-multiplication"&gt;
&lt;h2&gt;Vector multiplication&lt;/h2&gt;
&lt;p&gt;The first kernel is a very simple one, simple element-wise multiplication of two
vectors. Nothing fancy here.&lt;/p&gt;
&lt;div id="mul_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('mul_container', {
        chart: { type: 'column' },
        title: { text: 'Element-wise Vector Multiplication' },
        xAxis: {
            categories: ['10', '100', '1000', '10000', '100000', '1000000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [0.021, 0.040, 0.215, 2.07, 32.1, 403]
        },
        {
            name: 'g++-6.2', data: [0.021, 0.037, 0.208, 2.17, 32.1, 376]
        },
        {
            name: 'clang-3.9', data: [0.027, 0.045, 0.243, 2.43, 32.7, 389]
        },
        {
            name: 'zapcc-4.0', data: [0.026, 0.047, 0.321, 2.5, 32.8, 411]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;For small vectors, clang is significantly slower than gcc-5.4 and gcc6.2. On
vectors from 100'000 elements, the speed is comparable for each compiler,
depending on the memory bandwidth. Overall, gcc-6.2 produces the fastest code
here. clang-4.0 is slightly slower than clang-3.9, but nothing dramatic.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="vector-exponentiation"&gt;
&lt;h2&gt;Vector exponentiation&lt;/h2&gt;
&lt;p&gt;The second kernel is computing the exponentials of each elements of a vector and
storing them in another vector.&lt;/p&gt;
&lt;div id="exp_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('exp_container', {
        chart: { type: 'column' },
        title: { text: 'Element-wise Vector Exponentiation' },
        xAxis: {
            categories: ['10', '100', '1000', '10000', '100000', '1000000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [0.0478, 0.137, 1.12, 9.79, 97.5, 959]
        },
        {
            name: 'g++-6.2', data: [0.0474, 0.132, 1.11, 9.71, 97, 1000]
        },
        {
            name: 'clang-3.9', data: [0.0492, 0.136, 0.959, 9.24, 92.9, 914]
        },
        {
            name: 'zapcc-4.0', data: [0.0488, 0.142, 0.952, 9.25, 91.9, 915]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;Interestingly, this time, clang versions are significantly faster for medium to
large vectors, from 1000 elements and higher, by about 5%. There is no
significant differences between the different versions of each compiler.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="matrix-matrix-multiplication"&gt;
&lt;h2&gt;Matrix-Matrix Multiplication&lt;/h2&gt;
&lt;p&gt;The next kernel I did benchmark with the matrix-matrix multiplication operation.
In that case, the kernel is hand-unrolled and vectorized.&lt;/p&gt;
&lt;div id="gemm_container_small" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;div id="gemm_container_large" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('gemm_container_small', {
        chart: { type: 'column' },
        title: { text: 'Matrix Matrix Multiplication (small)', },
        xAxis: {
            categories: ['10x10', '20x20', '40x40', '60x60', '80x80', '100x100']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [0.159, 0.815, 2.637, 13.849, 17.281, 78.903]
        },
        {
            name: 'g++-6.2', data: [0.162, 0.802, 2.431, 13.531, 17.274, 74.02]
        },
        {
            name: 'clang-3.9', data: [0.179, 1.218, 2.391, 14.981, 15.142, 61.548]
        },
        {
            name: 'zapcc-4.0', data: [0.159, 0.836, 2.712, 13.426, 15.114, 62.241]
        }
        ]
    });
    Highcharts.chart('gemm_container_large', {
        chart: { type: 'column' },
        title: { text: 'Matrix Matrix Multiplication (large)', },
        xAxis: {
            categories: ['200x200', '300x300', '400x400', '500x500', '600x600', '700x700', '800x800', '900x900', '1000x1000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [275.219, 1371, 1837, 5177, 6667, 14981, 17037, 31492, 32813]
        },
        {
            name: 'g++-6.2', data: [267.776, 1362, 1808, 5297, 6859, 15166, 15664, 30666, 33067]
        },
        {
            name: 'clang-3.9', data: [266.033, 1230, 1789, 4825, 6969, 14488, 15916, 30872, 33186]
        },
        {
            name: 'zapcc-4.0', data: [267.806, 1237, 1820, 4909, 7035, 15191, 18193, 33127, 37346]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;There are few differences between the compilers. The first thing is that for
some sizes such as 80x80 and 100x100, clang is significantly faster than GCC, by
more than 10%. The other interesting fact is that for large matrices
zapcc-clang-4.0 is always slower than clang-3.9 which is itself on par with the
two GCC versions. In my opinion, it comes from a regression in clang trunk but
it could also come from zapcc itself.&lt;/p&gt;
&lt;div id="std_gemm_container_large" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('std_gemm_container_large', {
        chart: { type: 'column' },
        title: { text: 'Matrix Matrix Multiplication (naive)', },
        xAxis: {
            categories: ['200x200', '300x300', '400x400', '500x500', '600x600', '700x700', '800x800', '900x900', '1000x1000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (ms)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'ms'},
        series: [
        {
            name: 'g++-5.4', data: [1.195, 4.891, 10.467, 22.400, 33.399,
            58.401, 77.150, 121.392, 148.469]
        },
        {
            name: 'g++-6.2', data: [1.109, 4.540, 9.964, 21.359, 31.904,
            55.282, 72.690, 113.52, 143.27]
        },
        {
            name: 'clang-3.9', data: [0.893, 3.710, 7.287, 16.244, 23.920,
            43.342, 56.771, 91.870, 112.309]
        },
        {
            name: 'zapcc-4.0', data: [5.088, 16.909, 39.632, 77.194, 133.15,
            214.539, 316.01, 447.715, 612.255]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;The results are much more interesting here! First, there is a huge regression in
clang-4.0 (or in zapcc for that matter). Indeed, it is up to 6 times slower than
clang-3.9. Moreover, the clang-3.9 is always significantly faster than gcc-6.2.
Finally, there is a small improvement in gcc-6.2 compared to gcc 5.4.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="fast-fourrier-transform"&gt;
&lt;h2&gt;Fast-Fourrier Transform&lt;/h2&gt;
&lt;p&gt;The following kernel is the performance of a hand-crafted Fast-Fourrier
transform implementation.&lt;/p&gt;
&lt;div id="fft_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('fft_container', {
        chart: { type: 'column' },
        title: { text: 'Fast Fourrier Transform', },
        xAxis: {
            categories: ['100', '1000', '10000', '100000', '1000000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [2.640, 27.515, 308.239, 3427.4, 41695.9]
        },
        {
            name: 'g++-6.2', data: [2.578, 26.194, 298.97, 3348.82, 40783.8]
        },
        {
            name: 'clang-3.9', data: [3.047, 30.514, 333.403, 3569.36,43860.6]
        },
        {
            name: 'zapcc-4.0', data: [3.199,33.304,317.135,4025.18,48445.3]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;On this benchmark, gcc-6.2 is the clear winner. It is significantly faster
than clang-3.9 and clang-4.0. Moreover, gcc-6.2 is also faster than gcc-5.4.
On the contrary, clang-4.0 is significantly slower than clang-3.9 except on one
configuration (10000 elements).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="d-convolution"&gt;
&lt;h2&gt;1D Convolution&lt;/h2&gt;
&lt;p&gt;This kernel is about computing the 1D valid convolution of two vectors.&lt;/p&gt;
&lt;div id="conv1_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('conv1_container', {
        chart: { type: 'column' },
        title: { text: '1D convolution (optimized)', },
        xAxis: {
            categories: ['1000x500', '2000x1000', '3000x1500', '4000x2000',
            '5000x2500', '6000x3000', '7000x3500', '8000x4000', '9000x4500',
            '10000x5000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [11.710, 41.002, 91.201, 158.178,
            248.985, 353.695, 486.676, 634.53, 867.101, 1082.62]
        },
        {
            name: 'g++-6.2', data: [9.307, 40.921, 90.327, 158.734, 248.892,
            354.582, 488.38, 636.899, 869.637, 1084.86]
        },
        {
            name: 'clang-3.9', data: [13.404, 41.409, 95.094, 162.339,
            256.143, 362.34, 498.66, 651.352, 886.465, 1092.24]
        },
        {
            name: 'zapcc-4.0', data: [13.528, 40.886, 94.473, 159.917,
            252.992, 356.63, 493.653, 640.348, 872.282, 1091.36]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;While clang-4.0 is faster than clang-3.9, it is still slightly slower than both
gcc versions. On the GCC side, there is not a lot of difference except on the
1000x500 on which gcc-6.2 is 25% faster.&lt;/p&gt;
&lt;p&gt;And here are the results with the naive implementation:&lt;/p&gt;
&lt;div id="std_conv1_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('std_conv1_container', {
        chart: { type: 'column' },
        title: { text: '1D convolution (naive)', },
        xAxis: {
            categories: ['1000x500', '2000x1000', '3000x1500', '4000x2000',
            '5000x2500', '6000x3000', '7000x3500', '8000x4000', '9000x4500',
            '10000x5000']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (ms)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'ms'},
        series: [
        {
            name: 'g++-5.4', data: [0.350, 1.452, 3.260, 5.823, 9.116,
            13.155, 17.922, 23.438, 29.705, 36.683]
        },
        {
            name: 'g++-6.2', data: [0.350, 1.457, 3.262, 5.823, 9.120,
            13.152, 17.922, 23.436, 29.687, 36.665]
        },
        {
            name: 'clang-3.9', data: [0.216, 0.873, 1.974, 3.517, 5.501,
            7.921, 10.793, 14.11, 17.867, 22.068]
        },
        {
            name: 'zapcc-4.0', data: [0.215, 0.873, 1.972, 3.514, 5.501,
            7.928, 10.799, 14.11, 17.879, 22.065]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;Again, on the naive version, clang is much faster than GCC on the naive, by
about 65%. This is a really large speedup.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="d-convolution-1"&gt;
&lt;h2&gt;2D Convolution&lt;/h2&gt;
&lt;p&gt;This next kernel is computing the 2D valid convolution of two matrices&lt;/p&gt;
&lt;div id="conv2_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('conv2_container', {
        chart: { type: 'column' },
        title: { text: '2D Convolution (optimized)', },
        xAxis: {
            categories: ['100x50', '105x50', '110x55', '115x55', '120x60',
            '125x60', '130x65', '135x65', '140x70']
        },
        yAxis: {
            title: { text: 'Time (us)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'us'},
        series: [
        {
            name: 'g++-5.4', data: [327.399, 367.389, 441.457, 576.021,
            762.268, 794, 994.06, 1261.71, 1360.57]
        },
        {
            name: 'g++-6.2', data: [327.764, 367.379, 441.993, 572.241,
            761.741, 784.605, 991.717, 1266.55, 1361.59]
        },
        {
            name: 'clang-3.9', data: [330.199, 364.253, 443.483, 580.676,
            763.772, 777.39, 1000.53, 1267.75, 1375.51]
        },
        {
            name: 'zapcc-4.0', data: [339.358, 364.756, 443.807, 575.917,
            761.248, 784.695, 992.29, 1265.04, 1367.33]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;There is no clear difference between the compilers in this code. Every compiler
here has up and down.&lt;/p&gt;
&lt;p&gt;Let's look at the naive implementation of the 2D convolution (units are
milliseconds here not microseconds):&lt;/p&gt;
&lt;div id="std_conv2_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('std_conv2_container', {
        chart: { type: 'column' },
        title: { text: '2D Convolution (naive)', },
        xAxis: {
            categories: ['100x50', '105x50', '110x55', '115x55', '120x60',
            '125x60', '130x65', '135x65', '140x70']
        },
        yAxis: {
            title: { text: 'Time (ms)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'ms'},
        series: [
        {
            name: 'g++-5.4', data: [9.501,11.458,13.888, 16.489, 19.634,
            22.898, 27.012, 31.246, 36.269]
        },
        {
            name: 'g++-6.2', data: [9.502, 11.464, 13.903, 16.484, 19.642,
            22.994, 27.004, 31.248, 36.26]
        },
        {
            name: 'clang-3.9', data: [5.880, 7.136, 8.610, 10.226, 12.164,
            14.247, 17.024, 19.577, 22.510]
        },
        {
            name: 'zapcc-4.0', data: [5.875, 7.091, 8.661, 10.241, 12.218,
            14.302, 16.777, 19.424, 22.472]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;This time the difference is very large! Indeed, clang versions are about 60%
faster than the GCC versions! This is really impressive. Even though this does
not comes close to the optimized. It seems the vectorizer of clang is much more
efficient than the one from GCC.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="d-convolution-2"&gt;
&lt;h2&gt;4D Convolution&lt;/h2&gt;
&lt;p&gt;The final kernel that I'm testing is the batched 4D convolutions that is used a
lot in Deep Learning. This is not really a 4D convolution, but a large number
of 2D convolutions applied on 4D tensors.&lt;/p&gt;
&lt;div id="conv4_container" style="min-width: 310px; height:400px; margin: 0 auto; "&gt;&lt;/div&gt;
&lt;script&gt;
$(function () {
    Highcharts.chart('conv4_container', {
        chart: { type: 'column' },
        title: { text: '4D Convolution', },
        xAxis: {
            categories: ['2x6x3x28x16', '2x6x3x28x16', '2x6x3x28x16',
            '2x6x3x28x16', '2x6x3x28x16', '2x6x3x28x16', '2x6x3x28x16',
            '2x6x3x28x16', '2x6x3x28x16']
        },
        yAxis: {
            type: 'logarithmic',
            title: { text: 'Time (ms)' },
            plotLines: [{
                value: 0,
                width: 1,
                color: '#808080'
            }]
        },
        tooltip: {valueSuffix: 'ms'},
        series: [
        {
            name: 'g++-5.4', data: [0.095, 0.402, 1.083, 2.237, 3.988,
            6.474, 9.985, 14.132, 19.539]
        },
        {
            name: 'g++-6.2', data: [0.089, 0.413, 1.081, 2.224, 3.990,
            6.462, 9.815, 14.118, 19.612]
        },
        {
            name: 'clang-3.9', data: [0.090, 0.416, 1.108, 2.277, 4.077,
            6.587, 10.024, 14.359, 20.006]
        },
        {
            name: 'zapcc-4.0', data: [0.088, 0.406, 1.080, 2.237, 3.987,
            6.484, 9.827, 14.130, 19.569]
        }
        ]
    });
});
&lt;/script&gt;&lt;p&gt;Again, there are very small differences between each version. The best versions
are the most recent versions of the compiler gcc-6.2 and clang-4.0 on a tie.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Overall, we can see two trends in these results. First, when working with
highly-optimized code, the choice of compiler will not make a huge difference.
On these kind of kernels, gcc-6.2 tend to perform faster than the other
compilers, but only by a very slight margin, except in some cases. On the other
hand, when working with naive implementations, clang versions really did perform
much better than GCC. The clang compiled versions of the 1D and 2D convolutions
are more than 60% faster than their GCC counter parts. This is really
impressive. Overall, clang-4.0 seems to have several performance regressions,
but since it's not still a work in progress, I would not be suprised if these
regressions are not present in the final version. Since the clang-4.0 version is
in fact the clang version used by zapcc, it's also possible that zapcc is
introducing new performance regressions.&lt;/p&gt;
&lt;p&gt;Overall, my advice would be to use GCC-6.2 (or 5.4) on hand-optimized kernels
and clang when you have mostly naive implementations. However, keep in mind that
at least for the example shown here, the naive version optimized by the compiler
never comes close to the highly-optimized version.&lt;/p&gt;
&lt;p&gt;As ever, takes this with a grain of salt, it's only been tested on one project
and one machine, you may obtain very different results on other projects and on
other processors.&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>Compilers</category><category>gcc</category><category>Performance</category><category>templates</category><guid>https://baptiste-wicht.com/posts/2016/12/cpp-compiler-benchmark-on-expression-templates-library-etl.html</guid><pubDate>Sun, 11 Dec 2016 13:17:30 GMT</pubDate></item><item><title>zapcc C++ compilation speed against gcc 5.4 and clang 3.9</title><link>https://baptiste-wicht.com/posts/2016/12/zapcc-cpp-compilation-speed-against-gcc-54-and-clang-39.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;A week ago, I compared the &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2016/11/zapcc-a-faster-cpp-compiler.html"&gt;compilation time performance of zapcc against gcc-4.9.3 and clang-3.7&lt;/a&gt;. On debug builds, zapcc was about 2 times faster than gcc and 3 times faster than clang. In this post, I'm going to try some more recent compilers, namely gcc 5.4 and clang 3.9 on the same project. If you want more information on zapcc, read the previous posts, this post will concentrate on results.&lt;/p&gt;
&lt;p&gt;Again, I use my Expression Template Library
(&lt;a class="reference external" href="https://github.com/wichtounet/etl/"&gt;ETL&lt;/a&gt;). This is a purely header-only
library with lots of templates. I'm going to compile the full test cases.&lt;/p&gt;
&lt;p&gt;The results of the two articles are not directly comparable, since they were
obtained on two different computers. The one on which the present results are
done has a less powerful and only 16Go of RAM compared to the 32Go of RAM of my
build machine. Also take into account that that the present results were
obtained on a Desktop machine, there can be some perturbations from background
tasks.&lt;/p&gt;
&lt;p&gt;Just like on the previous results, it does not help using more threads than
physical cores, therefore, the results were only computed on up to 4 cores on
this machine.&lt;/p&gt;
&lt;p&gt;The link time is not taken into account on the results.&lt;/p&gt;
&lt;section id="debug-build"&gt;
&lt;h2&gt;Debug build&lt;/h2&gt;
&lt;p&gt;Let's start with the result of the debug build.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;469s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;230s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;130s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;710s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;371s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;218s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;214s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;112s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;66s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.31&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.31&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.3&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.19&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.05&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.96&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The results are almost the same as the previous test. zapcc is 3.3 times faster
to compile than Clang and around 2 times faster than GCC. It seems that GCC 5.4
is a bit faster than GCC 4.9.3 while clang 3.9 is a bit slower than clang 3.7,
but nothing terribly significant.&lt;/p&gt;
&lt;p&gt;Overall, for debug builds, zapcc can bring a very significant improvement to
your compile times.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="release-build"&gt;
&lt;h2&gt;Release build&lt;/h2&gt;
&lt;p&gt;Let's see what is the status of Release builds. Since the results are comparable
between the numbers of threads, the results here are just for one thread.&lt;/p&gt;
&lt;p&gt;This is more time consuming since a lot of optimizations are enabled and more
features from ETL are enabled as well.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-5.4.0&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;782s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.9&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;960s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;640s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.5&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.22&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;On a release build, the speedups are much less interesting. Nevertheless, they
are still significant. zapcc is still 1.2 times faster than gcc and 1.5 times
faster than clang. Then speedup against clang 3.9 is significantly higher than
it was on my experiment with clang 3.7, it's possible that clang 3.9 is slower
or simply has new optimization passes.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The previous conclusion still holds with modern version of compilers: zapcc is
much faster than other compilers on Debug builds of template heavy code. More
than 3 times faster than clang-3.9 and about 2 times faster than gcc-5.4. Since
it's based on clang, there should not be any issue compiling projects that
already compile with a recent clang. Even though the speedups are less
interesting on a release build, it is still significantly, especially compared
against clang.&lt;/p&gt;
&lt;p&gt;I'm really interested in finding out what will be the pricing for zapcc once
out of the beta or if they will be able to get even faster!&lt;/p&gt;
&lt;p&gt;For the comparison with gcc 4.9.3 and clang 3.7, you can have a look at
&lt;a class="reference external" href="http://baptiste-wicht.com/posts/2016/11/zapcc-a-faster-cpp-compiler.html"&gt;this article&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you want more information about zapcc, you can go to the
&lt;a class="reference external" href="https://www.zapcc.com/"&gt;official website of zapcc&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>Compilers</category><category>etl</category><category>gcc</category><category>meta</category><category>projects</category><guid>https://baptiste-wicht.com/posts/2016/12/zapcc-cpp-compilation-speed-against-gcc-54-and-clang-39.html</guid><pubDate>Mon, 05 Dec 2016 17:46:09 GMT</pubDate></item><item><title>zapcc - a faster C++ compiler</title><link>https://baptiste-wicht.com/posts/2016/11/zapcc-a-faster-cpp-compiler.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;Update: For a comparison against more modern compiler versions, you can read: &lt;a class="reference external" href="http://baptiste-wicht.com/posts/2016/12/zapcc-cpp-compilation-speed-against-gcc-54-and-clang-39.html"&gt;zapcc C++ compilation speed against gcc 5.4 and clang 3.9&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I just joined the private beta program of zapcc. Zapcc is a c++ compiler, based
on Clang which aims at being much faster than other C++ compilers. How they are
doing this is using a caching server that saves some of the compiler structures,
which should speed up compilation a lot. The private beta is free, but once the
compiler is ready, it will be a commercial compiler.&lt;/p&gt;
&lt;p&gt;Every C++ developer knows that compilation time can quickly be an issue when
programs are getting very big and especially when working with template-heavy
code.&lt;/p&gt;
&lt;p&gt;To benchmark this new compiler, I use my Expression Template Library
(&lt;a class="reference external" href="https://github.com/wichtounet/etl/"&gt;ETL&lt;/a&gt;). This is a purely header-only
library with lots of templates. There are lots of test cases which is what I'm
going to compile. I'm going to compare against Clang-3.7 and gcc-4.9.3.&lt;/p&gt;
&lt;p&gt;I have configured zapcc to let is use 2Go RAM per caching server, which is the
maximum allowed. Moreover, I killed the servers before each tests.&lt;/p&gt;
&lt;section id="debug-build"&gt;
&lt;h2&gt;Debug build&lt;/h2&gt;
&lt;p&gt;Let's start with a debug build. In that configuration, there is no optimization
going on and several of the features of the library (GPU, BLAS, ...) are
disabled. This is the fastest way to compile ETL. I gathered this result on
a 4 core, 8 threads, Intel processor, with an SSD.&lt;/p&gt;
&lt;p&gt;The following table presents the results with different number of threads and
the difference of zapcc compared to the other compilers:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j6&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j8&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;350s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;185s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;104s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;94s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;91s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.7&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;513s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;271s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;153s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;145s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;138s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;158s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;87s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;47s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;44s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;42s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.24&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.103&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.25&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.29&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;3.28&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.21&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.12&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.21&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.13&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;2.16&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The result is pretty clear! zapcc is around &lt;strong&gt;three times faster than Clang&lt;/strong&gt; and around
&lt;strong&gt;two times faster than GCC&lt;/strong&gt;. This is pretty impressive!&lt;/p&gt;
&lt;p&gt;For those that think than Clang is always faster than GCC, keep in mind that
this is not the case for template-heavy code such as this library. In all my
tests, Clang has always been slower and much memory hungrier than GCC on
template-heavy C++ code. And sometimes the difference is very significant.&lt;/p&gt;
&lt;p&gt;Interestingly, we can also see that going past the physical cores is not really
interesting on this computer. On some computer, the speedups are interesting,
but not on this one. Always benchmark!&lt;/p&gt;
&lt;/section&gt;
&lt;section id="release-build"&gt;
&lt;h2&gt;Release build&lt;/h2&gt;
&lt;p&gt;We have seen the results on a debug build, let's now compare on something a bit
more timely, a release build with all options of ETL enabled (GPU, BLAS, ...),
which should make it significantly longer to compile.&lt;/p&gt;
&lt;p&gt;Again, the table:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;&lt;th class="head"&gt;&lt;p&gt;Compiler&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j1&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j2&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j4&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j6&lt;/p&gt;&lt;/th&gt;
&lt;th class="head"&gt;&lt;p&gt;-j8&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;g++-4.9.3&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;628s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;336s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;197s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;189s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;184s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;clang++-3.7&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;663s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;388s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;215s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;212s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;205s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;zapcc++&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;515s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;281s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;173s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;168s&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;158s&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS Clang&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.28&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.38&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.24&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.26&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.29&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;Speedup VS GCC&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.21&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.30&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.13&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.12&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;1.16&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This time, we can see that the difference is much lower. Zapcc is &lt;strong&gt;between 1.2
and 1.4 times faster than Clang&lt;/strong&gt; and &lt;strong&gt;between 1.1 and 1.3 times faster than
GCC&lt;/strong&gt;. This shows that most of the speedups from zapcc are in the front end of
the compiler. This is not a lot but still significant over long builds,
especially if you have few threads where the absolute difference would be
higher.&lt;/p&gt;
&lt;p&gt;We can also observe that Clang is now almost on par with GCC which shows that
optimization is faster in Clang while front and backend is faster in gcc.&lt;/p&gt;
&lt;p&gt;You also have to keep in mind that zapcc memory usage is higher than Clang
because of all the caching. Moreover, the server are still up in between
compilations, so this memory usage stays between builds, which may not be what
you want.&lt;/p&gt;
&lt;p&gt;As for runtime, I have not seen any significant difference in performance
between the clang version and the zapcc. According to the official benchmarks
and documentation, there should not be any difference in that between zapcc and
the version of clang on which zapcc is based.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="incremental-build"&gt;
&lt;h2&gt;Incremental build&lt;/h2&gt;
&lt;p&gt;Normally, zapcc should shine at incremental building, but I was unable to show
any speedup when changing a single without killing the zapcc servers. Maybe
I did something wrong in my usage of zapcc.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In conclusion, we can see that zapcc is always faster than both GCC and Clang,
on my template-heavy library. Moreover, on debug builds, it is much faster than
any of the two compilers, being more than 2 times faster than GCC and more than
3 times faster than clang. This is really great. Moreover, I have not seen any
issue with the tool so far, it can seamlessly replace Clang without problem.&lt;/p&gt;
&lt;p&gt;It's a bit weird that you cannot allocate more than 2Go to the zapcc servers.&lt;/p&gt;
&lt;p&gt;For a program, that's really impressive. I hope that they are continuing the
good work and especially that this motivates other compilers to improve the
speed of compilation (especially of templates).&lt;/p&gt;
&lt;p&gt;If you want more information, you can go to the
&lt;a class="reference external" href="https://www.zapcc.com/"&gt;official website of zapcc&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>clang</category><category>Compilers</category><category>etl</category><category>gcc</category><category>projects</category><category>zapcc</category><guid>https://baptiste-wicht.com/posts/2016/11/zapcc-a-faster-cpp-compiler.html</guid><pubDate>Sat, 26 Nov 2016 12:17:50 GMT</pubDate></item><item><title>Compile integer Square Roots at compile-time in C++</title><link>https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;For one of my projects, I needed to evaluate a square root at compile-time.
There are several ways to implement it and some are better than the others.&lt;/p&gt;
&lt;p&gt;In this post, I'll show several versions, both with Template Metaprogramming
(TMP) and constexpr functions.&lt;/p&gt;
&lt;section id="naive-version"&gt;
&lt;h2&gt;Naive version&lt;/h2&gt;
&lt;p&gt;The easiest way to implement it is to enumerate the integers until we find two
integers that when multiplied are equal to our number. This can easily be
implemented in C++ with class template and partial specialization:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code c++"&gt;&lt;a id="rest_code_3f6087b45cf247c2b14c49fc974fc1de-1" name="rest_code_3f6087b45cf247c2b14c49fc974fc1de-1" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_3f6087b45cf247c2b14c49fc974fc1de-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;a id="rest_code_3f6087b45cf247c2b14c49fc974fc1de-2" name="rest_code_3f6087b45cf247c2b14c49fc974fc1de-2" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_3f6087b45cf247c2b14c49fc974fc1de-2"&gt;&lt;/a&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ct_sqrt&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;integral_constant&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_sqrt&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
&lt;a id="rest_code_3f6087b45cf247c2b14c49fc974fc1de-3" name="rest_code_3f6087b45cf247c2b14c49fc974fc1de-3" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_3f6087b45cf247c2b14c49fc974fc1de-3"&gt;&lt;/a&gt;
&lt;a id="rest_code_3f6087b45cf247c2b14c49fc974fc1de-4" name="rest_code_3f6087b45cf247c2b14c49fc974fc1de-4" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_3f6087b45cf247c2b14c49fc974fc1de-4"&gt;&lt;/a&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;a id="rest_code_3f6087b45cf247c2b14c49fc974fc1de-5" name="rest_code_3f6087b45cf247c2b14c49fc974fc1de-5" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_3f6087b45cf247c2b14c49fc974fc1de-5"&gt;&lt;/a&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ct_sqrt&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;integral_constant&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Really easy, isn't it ? If we test it with 100, it gives 10. But, if we try with
higher values, we are going to run into problem. For instance, when compiled
with 289, here is what clang++ gives me:&lt;/p&gt;
&lt;pre class="literal-block"&gt;src/sqrt/tmp.cpp:5:64: fatal error: recursive template instantiation exceeded maximum depth of 256
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^
src/sqrt/tmp.cpp:5:64: note: in instantiation of template class 'ct_sqrt&amp;lt;289, 257&amp;gt;' requested here
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^
src/sqrt/tmp.cpp:5:64: note: in instantiation of template class 'ct_sqrt&amp;lt;289, 256&amp;gt;' requested here
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^
src/sqrt/tmp.cpp:5:64: note: in instantiation of template class 'ct_sqrt&amp;lt;289, 255&amp;gt;' requested here
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^
src/sqrt/tmp.cpp:5:64: note: in instantiation of template class 'ct_sqrt&amp;lt;289, 254&amp;gt;' requested here
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^
src/sqrt/tmp.cpp:5:64: note: in instantiation of template class 'ct_sqrt&amp;lt;289, 253&amp;gt;' requested here
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^
src/sqrt/tmp.cpp:5:64: note: (skipping 247 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
src/sqrt/tmp.cpp:5:64: note: in instantiation of template class 'ct_sqrt&amp;lt;289, 5&amp;gt;' requested here
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^
src/sqrt/tmp.cpp:5:64: note: in instantiation of template class 'ct_sqrt&amp;lt;289, 4&amp;gt;' requested here
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^
src/sqrt/tmp.cpp:5:64: note: in instantiation of template class 'ct_sqrt&amp;lt;289, 3&amp;gt;' requested here
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^
src/sqrt/tmp.cpp:5:64: note: in instantiation of template class 'ct_sqrt&amp;lt;289, 2&amp;gt;' requested here
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^
src/sqrt/tmp.cpp:11:18: note: in instantiation of template class 'ct_sqrt&amp;lt;289, 1&amp;gt;' requested here
    std::cout &amp;lt;&amp;lt; ct_sqrt&amp;lt;289&amp;gt;::value &amp;lt;&amp;lt; std::endl;
                 ^
src/sqrt/tmp.cpp:5:64: note: use -ftemplate-depth=N to increase recursive template instantiation depth
struct ct_sqrt : std::integral_constant&amp;lt;std::size_t, (I*I&amp;lt;N) ? ct_sqrt&amp;lt;N,I+1&amp;gt;::value : I &amp;gt; {};
                                                               ^&lt;/pre&gt;
&lt;p&gt;And it is only to compute the square root for 289, not a big number. We could of
course increase the template depth limit (-ftemplate-depth=X), but that would
only get us a bit farther. If you try with g++, you should see that this works,
that is because g++ has a higher template depth limit (900 for 4.8.2 on my
machine) where clang has a default limit of 256. It can be noted too that with
g++ no context is skipped, therefore the error is quite long.&lt;/p&gt;
&lt;p&gt;Now that C++11 gives us constexpr function, we can rewrite it more cleanly:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code c++"&gt;&lt;a id="rest_code_dbdf994016d843c89e42bdadd52cb8b4-1" name="rest_code_dbdf994016d843c89e42bdadd52cb8b4-1" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_dbdf994016d843c89e42bdadd52cb8b4-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;constexpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_dbdf994016d843c89e42bdadd52cb8b4-2" name="rest_code_dbdf994016d843c89e42bdadd52cb8b4-2" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_dbdf994016d843c89e42bdadd52cb8b4-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_dbdf994016d843c89e42bdadd52cb8b4-3" name="rest_code_dbdf994016d843c89e42bdadd52cb8b4-3" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_dbdf994016d843c89e42bdadd52cb8b4-3"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Much nicer :) And it works perfectly with 289. And it works quite well up to a
large number. But it still fails once we git large numbers. For instance, here
is what clang++ gives me with 302500 (550*550):&lt;/p&gt;
&lt;pre class="literal-block"&gt;src/sqrt/constexpr.cpp:8:36: error: constexpr variable 'result' must be initialized by a constant expression
static constexpr const std::size_t result = ct_sqrt(SQRT_VALUE);
                                   ^        ~~~~~~~~~~~~~~~~~~~
src/sqrt/constexpr.cpp:5:38: note: constexpr evaluation exceeded maximum depth of 512 calls
    return n == i ? n : (i * i &amp;lt; n ? ct_sqrt(n, i + 1) : i);
                                     ^
src/sqrt/constexpr.cpp:5:38: note: in call to 'ct_sqrt(302500, 512)'
src/sqrt/constexpr.cpp:5:38: note: in call to 'ct_sqrt(302500, 511)'
src/sqrt/constexpr.cpp:5:38: note: in call to 'ct_sqrt(302500, 510)'
src/sqrt/constexpr.cpp:5:38: note: in call to 'ct_sqrt(302500, 509)'
src/sqrt/constexpr.cpp:5:38: note: in call to 'ct_sqrt(302500, 508)'
src/sqrt/constexpr.cpp:5:38: note: (skipping 502 calls in backtrace; use -fconstexpr-backtrace-limit=0 to see all)
src/sqrt/constexpr.cpp:5:38: note: in call to 'ct_sqrt(302500, 5)'
src/sqrt/constexpr.cpp:5:38: note: in call to 'ct_sqrt(302500, 4)'
src/sqrt/constexpr.cpp:5:38: note: in call to 'ct_sqrt(302500, 3)'
src/sqrt/constexpr.cpp:5:38: note: in call to 'ct_sqrt(302500, 2)'
src/sqrt/constexpr.cpp:8:45: note: in call to 'ct_sqrt(302500, 1)'
static constexpr const std::size_t result = ct_sqrt(SQRT_VALUE);
                                            ^&lt;/pre&gt;
&lt;p&gt;Again, we run into the limits of the compiler. And again, the limit can be
change with fconstexpr-backtrace-limit=X. With g++, the result is the same
(without the skipped part, which makes the error horribly long), but the command
to change the depth is -fconstexpr-depth=X.&lt;/p&gt;
&lt;p&gt;So, if we need to compute higher square roots at compile-time, we need a better
version.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="binary-search-version"&gt;
&lt;h2&gt;Binary Search version&lt;/h2&gt;
&lt;p&gt;To find the good square root, you don't need to iterate through all the numbers
from 1 to N, you can perform a binary search to find the numbers to test. I
found a very nice implementation by John Khvatov (&lt;a class="reference external" href="http://jkhvatov.blogspot.ch/2009/11/c-compile-time-square-root-sqrt-using.html"&gt;source&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Here is an adaptation of its code:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code c++"&gt;&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-1" name="rest_code_d2d58040a5b541b391bfaab28bc10103-1" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-1"&gt;&lt;/a&gt;&lt;span class="cp"&gt;#define MID(a, b) ((a+b)/2)&lt;/span&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-2" name="rest_code_d2d58040a5b541b391bfaab28bc10103-2" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-2"&gt;&lt;/a&gt;&lt;span class="cp"&gt;#define POW(a) (a*a)&lt;/span&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-3" name="rest_code_d2d58040a5b541b391bfaab28bc10103-3" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-3"&gt;&lt;/a&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-4" name="rest_code_d2d58040a5b541b391bfaab28bc10103-4" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-4"&gt;&lt;/a&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-5" name="rest_code_d2d58040a5b541b391bfaab28bc10103-5" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-5"&gt;&lt;/a&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-6" name="rest_code_d2d58040a5b541b391bfaab28bc10103-6" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-6"&gt;&lt;/a&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-7" name="rest_code_d2d58040a5b541b391bfaab28bc10103-7" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-7"&gt;&lt;/a&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-8" name="rest_code_d2d58040a5b541b391bfaab28bc10103-8" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-8"&gt;&lt;/a&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ct_sqrt&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;integral_constant&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-9" name="rest_code_d2d58040a5b541b391bfaab28bc10103-9" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-9"&gt;&lt;/a&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-10" name="rest_code_d2d58040a5b541b391bfaab28bc10103-10" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-10"&gt;&lt;/a&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-11" name="rest_code_d2d58040a5b541b391bfaab28bc10103-11" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-11"&gt;&lt;/a&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ct_sqrt&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;integral_constant&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_sqrt&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-12" name="rest_code_d2d58040a5b541b391bfaab28bc10103-12" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-12"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;POW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;a id="rest_code_d2d58040a5b541b391bfaab28bc10103-13" name="rest_code_d2d58040a5b541b391bfaab28bc10103-13" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_d2d58040a5b541b391bfaab28bc10103-13"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;POW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;With smart binary search, you can reduce A LOT the numbers that needs to be
tested in order to find the answer. It very easily found the answer for 302500.
It can find the square root of almost all integers, until it fails due to
overflows. I think it is really great :)&lt;/p&gt;
&lt;p&gt;Of course, we can also do the constexpr version:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code c++"&gt;&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-1" name="rest_code_ae231352cc68427fa398207e2eeec0d4-1" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;constexpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ct_mid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-2" name="rest_code_ae231352cc68427fa398207e2eeec0d4-2" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-3" name="rest_code_ae231352cc68427fa398207e2eeec0d4-3" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-3"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-4" name="rest_code_ae231352cc68427fa398207e2eeec0d4-4" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-4"&gt;&lt;/a&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-5" name="rest_code_ae231352cc68427fa398207e2eeec0d4-5" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-5"&gt;&lt;/a&gt;&lt;span class="k"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;constexpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ct_pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-6" name="rest_code_ae231352cc68427fa398207e2eeec0d4-6" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-6"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-7" name="rest_code_ae231352cc68427fa398207e2eeec0d4-7" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-7"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-8" name="rest_code_ae231352cc68427fa398207e2eeec0d4-8" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-8"&gt;&lt;/a&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-9" name="rest_code_ae231352cc68427fa398207e2eeec0d4-9" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-9"&gt;&lt;/a&gt;&lt;span class="k"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;constexpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-10" name="rest_code_ae231352cc68427fa398207e2eeec0d4-10" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-10"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-11" name="rest_code_ae231352cc68427fa398207e2eeec0d4-11" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-11"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-12" name="rest_code_ae231352cc68427fa398207e2eeec0d4-12" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-12"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-13" name="rest_code_ae231352cc68427fa398207e2eeec0d4-13" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-13"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;ct_mid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_mid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-14" name="rest_code_ae231352cc68427fa398207e2eeec0d4-14" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-14"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;ct_pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ct_mid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_mid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-15" name="rest_code_ae231352cc68427fa398207e2eeec0d4-15" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-15"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-16" name="rest_code_ae231352cc68427fa398207e2eeec0d4-16" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-16"&gt;&lt;/a&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-17" name="rest_code_ae231352cc68427fa398207e2eeec0d4-17" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-17"&gt;&lt;/a&gt;&lt;span class="k"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;constexpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-18" name="rest_code_ae231352cc68427fa398207e2eeec0d4-18" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-18"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_ae231352cc68427fa398207e2eeec0d4-19" name="rest_code_ae231352cc68427fa398207e2eeec0d4-19" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ae231352cc68427fa398207e2eeec0d4-19"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which is a bit more understandable. It works the same way than the previous one
and is only limited by numeric overflow.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="c-14-fun"&gt;
&lt;h2&gt;C++14 Fun&lt;/h2&gt;
&lt;p&gt;In C++14, the constraints on constexpr functions have been highly relaxed, we
can now use variables, if/then/else statements, loops and so on... in constexpr
functions making them much more readable. Here is the C++14 version of the
previous code:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code c++"&gt;&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-1" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-1" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-1"&gt;&lt;/a&gt;&lt;span class="k"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;constexpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-2" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-2" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-2"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-3" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-3" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-4" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-4" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-4"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-5" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-5" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-5"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-6" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-6" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-6"&gt;&lt;/a&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-7" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-7" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-7"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-8" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-8" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-8"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-9" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-9" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-9"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-10" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-10" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-10"&gt;&lt;/a&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-11" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-11" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-11"&gt;&lt;/a&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-12" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-12" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-12"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-13" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-13" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-13"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-14" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-14" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-14"&gt;&lt;/a&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-15" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-15" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-15"&gt;&lt;/a&gt;&lt;span class="k"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;constexpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-16" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-16" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-16"&gt;&lt;/a&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ct_sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;a id="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-17" name="rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-17" href="https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html#rest_code_ce2e17fe1e1f459786a0e0000e79ff9e-17"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I think this version is highly superior than the previous version. Don't you
think ?&lt;/p&gt;
&lt;p&gt;It performs exactly the same as the previous. This can only be done in clang for
now, but that will come eventually to gcc too.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;As you saw, there are several ways to compute a square root at compile-time in
C++. The constexpr versions are much more readable and generally more scalable
than the template metaprogramming version. Moreover, now, with C++14, we can
write constexpr functions almost as standard function, which makes really great.&lt;/p&gt;
&lt;p&gt;I hope that is is helpful to some of you :)&lt;/p&gt;
&lt;p&gt;All the sources are available on Github: &lt;a class="reference external" href="https://github.com/wichtounet/articles/tree/master/src/sqrt"&gt;https://github.com/wichtounet/articles/tree/master/src/sqrt&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;</description><category>C++</category><category>C++11</category><category>C++14</category><category>clang</category><category>Programming</category><guid>https://baptiste-wicht.com/posts/2014/07/compile-integer-square-roots-at-compile-time-in-cpp.html</guid><pubDate>Wed, 02 Jul 2014 19:05:11 GMT</pubDate></item><item><title>Build OpenCV with libc++ on Gentoo and static-libs</title><link>https://baptiste-wicht.com/posts/2014/06/build-opencv-with-libcxx-on-gentoo-and-static-libs.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;When you build C++ projects with CLang, you have the choice between using the
stdlibc++ that is provided along G++ and the new libc++ that is provided by
CLang.&lt;/p&gt;
&lt;p&gt;libc++ is another implementation of the C++ Standard Library. This
implementation is dual-licensed under the MIT license and UIUC license. It is
especially targeting C++11 and has already 100% support for C++14. This
last point is the reason that I use libc++ on several of my projects. Moreover,
it is also the default on Mac OS X.&lt;/p&gt;
&lt;p&gt;The problem with linking with another library is that you can only works with
libraries that have been compiled with libc++ support. For instance, if you want
to use Boost dynamic libraries, you'll have to compile Boost from sources with
libc++.&lt;/p&gt;
&lt;p&gt;For one of my project, I'm using OpenCV and libc++. To simplify the installation
of OpenCV, I created a new ebuild with a &lt;em&gt;libcxx&lt;/em&gt; use flag to selectively build the
library with libc++. This requires LLVM/CLang on the build machine. Moreover, by
default, the Gentoo ebuild does not have support for building the static
libraries. The reason for that is that OpenCV build is not able to build dynamic
and static libraries. I added a &lt;em&gt;static-libs&lt;/em&gt; use flag that build the static
libraries by building OpenCV a second time after the first. That will likely
double the compile time (unless ccache is used). Anyhow, it is simple easier
than to build that by hand on several machine.&lt;/p&gt;
&lt;p&gt;The ebuild is available on &lt;a class="reference external" href="https://github.com/wichtounet/wichtounet-overlay"&gt;my overlay&lt;/a&gt;. You can add the overlay to
your machine by modifying &lt;em&gt;/etc/layman/layman.cfg&lt;/em&gt;:&lt;/p&gt;
&lt;pre class="literal-block"&gt;overlays: http://www.gentoo.org/proj/en/overlays/repositories.xml
          http://github.com/wichtounet/wichtounet-overlay/raw/master/repository.xml&lt;/pre&gt;
&lt;p&gt;Then, you can add it to layman:&lt;/p&gt;
&lt;pre class="literal-block"&gt;layman -S
layman -a wichtounet&lt;/pre&gt;
&lt;p&gt;For now, I have created an ebuild for &lt;em&gt;opencv-2.4.8-r1&lt;/em&gt;. If someone is
interested in other versions, I'd be glad to create new ebuilds.&lt;/p&gt;
&lt;p&gt;I hope that this ebuild will be helpful.&lt;/p&gt;</description><category>clang</category><category>Gentoo</category><category>libc++</category><category>opencv</category><guid>https://baptiste-wicht.com/posts/2014/06/build-opencv-with-libcxx-on-gentoo-and-static-libs.html</guid><pubDate>Tue, 10 Jun 2014 12:09:29 GMT</pubDate></item><item><title>Software Reliability Presentation</title><link>https://baptiste-wicht.com/posts/2014/06/software-reliability-presentation.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;section id="software-reliability"&gt;
&lt;h2&gt;Software Reliability&lt;/h2&gt;
&lt;p&gt;In behalf of my school (College of Engineering and Architecture of Fribourg), I
presented a shoft presentation about Software Reliability. In this presentation,
I outline the main issues about the subject and propose some solutions:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Software Validation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Defensive Programming&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Software Analysis Tools&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the Software Analysis Tools, I present three tools: cppcheck, Valgrind and
the Clang Static analyzer. Several examples are presented for each tools as well
as some recommendations for using them. A short presentation of SonarQube is
also performed.&lt;/p&gt;
&lt;p&gt;I thought that it could be of some interest to some of the readers, so here it
is:&lt;/p&gt;
&lt;div style="text-align:center;"&gt;&lt;iframe src="http://www.slideshare.net/slideshow/embed_code/35576524" width="597" height="486" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px 1px 0; margin-bottom:5px; max-width: 100%;" allowfullscreen&gt; &lt;/iframe&gt;&lt;/div&gt;&lt;p&gt;Don't hesitate if you have any comments or questions about the presentation ;)&lt;/p&gt;
&lt;p&gt;The source code for the examples is available &lt;a class="reference external" href="https://github.com/wichtounet/analysis-examples"&gt;on Github&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;</description><category>clang</category><category>Gentoo</category><category>Linux</category><category>Programming</category><category>Reliability</category><category>Tools</category><guid>https://baptiste-wicht.com/posts/2014/06/software-reliability-presentation.html</guid><pubDate>Sat, 07 Jun 2014 14:45:33 GMT</pubDate></item><item><title>Install and Use CLang Static Analyzer on a CMake project</title><link>https://baptiste-wicht.com/posts/2014/04/install-use-clang-static-analyzer-cmake.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;I recently started a bit of work on my compiler (eddic) again. I started by adapting it to build on CLang with libc++. There was some minor adaptions to make it compile, but nothing really fancy. It now compiles and runs fine on LLVM/Clang 3.4 with the last version of libc++. I'm gonna use some features of C++14 in it and I plan to refactor some parts to make it more &lt;em&gt;STL-correct&lt;/em&gt;. I also plan to use only CLang on eddic right now, since C++14 support of GCC is not released right now. &lt;/p&gt;
&lt;p&gt;I decided it was a good time to try again the CLang static analyzer. &lt;/p&gt;
&lt;h3&gt;Installation&lt;/h3&gt;
&lt;p&gt;If, like me, you're using Gentoo, the static analyzer is directly installed with the &lt;em&gt;sys-devel/clang&lt;/em&gt; package, unless you disabled the &lt;em&gt;static-analyzer&lt;/em&gt; USE flag. &lt;/p&gt;
&lt;p&gt;If your distribution does not ship the static analyzer directly with CLang, you'll have to install it manually. To install it from sources, I advise you to follow the &lt;a href="http://clang-analyzer.llvm.org/installation.html"&gt;Official Installations instruction&lt;/a&gt;. &lt;/p&gt;
&lt;h3&gt;Usage&lt;/h3&gt;
&lt;p&gt;The usage of CLang static analyzer can be a bit disturbing at first. Most static analysis tools generally takes the sources directly and do their stuff. But that is not how Clang Static Analyzer works. It works as a kind of monitor in top of building the program, using &lt;em&gt;scan-build&lt;/em&gt;. When you are analyzing a program, you are also building the program. &lt;/p&gt;
&lt;p&gt;For instance, if you are compiling a source file like that: &lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;clang [clang-options] source_file.cpp
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;you can perform static analysis like that: &lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;scan-build [scan-build-options] clang [clang-options] source_file.cpp
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;scan-build works by replacing calls to the compiler by calls to &lt;em&gt;ccc-analyzer &lt;/em&gt;. This works generally well, but there are some cases where that things get a bit more complicated. That is the case of CMake where the paths to the compiler are hardcoded in the generated makefiles. &lt;/p&gt;
&lt;p&gt;For that, you have to run &lt;em&gt;cmake&lt;/em&gt; and &lt;em&gt;make&lt;/em&gt; with &lt;em&gt;scan-build&lt;/em&gt;: &lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CCC_CC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;clang&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CCC_CXX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;clang&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
&lt;span class="n"&gt;scan&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cmake&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;DCMAKE_CXX_COMPILER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;clang&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;DCMAKE_C_COMPILER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;clang&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;scan&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;make&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This can take a very long time. On eddic, it is about three times slower than a normal compilation. An important point to note about performance, is that you can run compilations in parallel (-j option of make) and that it is supported by scan-build quite well. &lt;/p&gt;
&lt;p&gt;Once analysis is performed, the found bugs are put into an HTML report. By default, the HTML report is created in &lt;em&gt;/tmp/&lt;/em&gt;, but you can specificy the folder with -o option of scan-build. &lt;/p&gt;
&lt;p&gt;You can enable or disable checker with the -enable-checker and -disable-checker options of scan-build. &lt;/p&gt;
&lt;h3&gt;Results on eddic&lt;/h3&gt;
&lt;p&gt;Several versions of Clang ago, I tried the static analyzer on eddic, but it failed on several source files without producing any results. Moreover, at this time, I don't think there was any nice HTML report at this time. &lt;/p&gt;
&lt;p&gt;I ran it again on eddic with the last versions. Here is a picture of the generated report: &lt;/p&gt;
&lt;p&gt;&lt;img alt="CLang Static Analyzer eddic results" src="https://baptiste-wicht.com/images/eddic_results.png"&gt;&lt;/p&gt;
&lt;p&gt;As you can see, 14 bugs have been found. Unfortunately, none of them is a real bug on my code, but they are not all false positives neither. For instance, here is some unreachable code report: &lt;/p&gt;
&lt;p&gt;&lt;img alt="CLang Static Analyzer eddic bug" src="https://baptiste-wicht.com/images/eddic_results_bug.png"&gt;&lt;/p&gt;
&lt;p&gt;It is indeed an unreachable statement, but it is expected, since it is an assert to ensure that the code is unreachable. But that proves that the analysis works ;) &lt;/p&gt;
&lt;p&gt;Even if it didn't found anything, this time it worked much better than the last time I checked and the HTML results are just really good. &lt;/p&gt;
&lt;p&gt;I hope you found this article interesting. If you happen to have interesting results on your codebase with the CLang static analyzer, I'd be glad to hear about them ;)&lt;/p&gt;</description><category>C++</category><category>C++11</category><category>C++14</category><category>clang</category><category>eddic</category><category>llvm</category><category>Tools</category><guid>https://baptiste-wicht.com/posts/2014/04/install-use-clang-static-analyzer-cmake.html</guid><pubDate>Wed, 09 Apr 2014 14:39:11 GMT</pubDate></item><item><title>GCC 4.7 vs CLang 3.1 on eddic</title><link>https://baptiste-wicht.com/posts/2012/11/gcc-4-7-clang-3-1-eddic.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;&lt;a href="http://www.baptiste-wicht.com/2012/11/eddic-compiles-with-clang-3-1/" title="eddic compiles with CLang 3.1"&gt;Now that eddic can be compiled with CLang&lt;/a&gt;, I wanted to compare the differences in compilation time and in performance of the generated executable between those two compilers. The tests are done using GCC 4.7.2 and CLang 3.1 on Gentoo.&lt;/p&gt;
&lt;h3&gt;Compilation Time&lt;/h3&gt;

&lt;p&gt;The first thing that I tested has been the compilation time of the two compilers to compile eddic with different flags. I tested the compilation in debug mode and with -O2 and -O3.&lt;/p&gt;
&lt;div id="graph_0" style="width: 400px; height: 300px;"&gt;&lt;/div&gt;
&lt;p&gt;&lt;input id="button_graph_0" type="button" value="Logarithmic scale"&gt;
&lt;script type="text/javascript"&gt;function draw_graph_0(){var graph=new google.visualization.ColumnChart(document.getElementById('graph_0'));var data=google.visualization.arrayToDataTable([['Options','GCC','CLang'],['-g',234.59,119.59],['-O2',273.02,178.22],['-O3',276.87,183.78],]);var options={title:"Compilation Time - Less is better",animation:{duration:1200,easing:"in"},width:'400px',height:'300px',hAxis:{title:"Options"},vAxis:{title:"Seconds",viewWindow:{min:0}}};graph.draw(data,options);var button=document.getElementById('button_graph_0');button.onclick=function(){if(options.vAxis.logScale){button.value="Logarithmic Scale";}else{button.value="Normal scale";}options.vAxis.logScale=!options.vAxis.logScale;graph.draw(data,options);};}&lt;/script&gt;
&lt;/p&gt;
&lt;p&gt;The most interesting fact in these results is that CLang is much faster than GCC. It takes twice less times to compile eddic with CLang in debug mode than with GCC. The impact on optimizations on CLang's compilation is also more important than on GCC. For both compilers, -O3 does not seems to add a lot of overhead.&lt;/p&gt;
&lt;h3&gt;Runtime performance&lt;/h3&gt;

&lt;p&gt;Then, I tested the performance of the generated executable. I tested it on three things, the whole test suite and two test cases that I know are the slowest for the EDDI Compiler. For each case, I took the slowest value of 5 consecutive executions.&lt;/p&gt;
&lt;div id="graph_1" style="width: 600px; height: 400px;"&gt;&lt;/div&gt;
&lt;p&gt;&lt;input id="button_graph_1" type="button" value="Logarithmic scale"&gt;
&lt;script type="text/javascript"&gt;function draw_graph_1(){var graph=new google.visualization.ColumnChart(document.getElementById('graph_1'));var data=google.visualization.arrayToDataTable([['Compiler','GCC -O2','GCC -O3','CLang -O2','CLang -O3'],['testsuite',6.58,6.59,6.74,6.58],['assembly',1.2,1.2,1.2,1.2],['linked_list',0.51,0.5,0.49,0.49],]);var options={title:"Runtime Performance - Less is better",animation:{duration:1200,easing:"in"},width:'600px',height:'400px',hAxis:{title:"Options"},vAxis:{title:"Seconds",viewWindow:{min:0}}};graph.draw(data,options);var button=document.getElementById('button_graph_1');button.onclick=function(){if(options.vAxis.logScale){button.value="Logarithmic Scale";}else{button.value="Normal scale";}options.vAxis.logScale=!options.vAxis.logScale;graph.draw(data,options);};}&lt;/script&gt;
&lt;/p&gt;
&lt;p&gt;The difference are very small. In -02, GCC performs a bit better, but in -O3, the performance are equivalent. I was a bit disappointed by the results, because I thought that there would be higher differences. It seems that CLang is not as far from GCC that some people would like to say. It also certainly depends on the program being compiled.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;It is clear that CLang is much faster than GCC to compile eddic. Moreover, the performance of the generated executable are almost similar.&lt;/p&gt;
&lt;p&gt;I will continue to use CLang as my development compiler and switches between the two when I'm doing performance benchmarking. I will try to update the benchmark once new versions of GCC / CLang are available.&lt;/p&gt;
&lt;script type="text/javascript"&gt;function draw_visualization(){draw_graph_0();draw_graph_1();}google.setOnLoadCallback(draw_visualization);&lt;/script&gt;</description><category>Benchmarks</category><category>clang</category><category>Compilers</category><category>EDDI</category><category>gcc</category><category>Performances</category><guid>https://baptiste-wicht.com/posts/2012/11/gcc-4-7-clang-3-1-eddic.html</guid><pubDate>Mon, 12 Nov 2012 08:28:44 GMT</pubDate></item><item><title>eddic compiles with CLang 3.1</title><link>https://baptiste-wicht.com/posts/2012/11/eddic-compiles-with-clang-3-1.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;I finally added support for compiling eddic with LLVM CLang 3.1 !&lt;/p&gt;
&lt;p&gt;The current development version can be completely compiled with CLang. Starting with the version 1.1.4, all versions of eddic will be support GCC and CLang. &lt;/p&gt;
&lt;p&gt;The changes have not been as painful as I first thought. &lt;/p&gt;
&lt;ul&gt;
    &lt;li&gt;The main problem that I has was about a static const variable of a class that had no user-constructor. GCC allows that, but it is not standard compliant and CLang was complaining. &lt;/li&gt;
    &lt;li&gt;Another problem that I encountered was about the used of bit flags and Template Meta Programming. I simplified that by the use of a simple type traits and it worked. I don't really know why this does not worked at first. &lt;/li&gt;
    &lt;li&gt;The remaining effort was to fix the several warnings that CLang had. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CLang also fixed a bug in my code with a warning on a assignment that was not supposed to be an assignment, thanks CLang. &lt;/p&gt;
&lt;p&gt;The most interesting fact about CLang is that &lt;strong&gt;is it twice faster to build eddic than GCC&lt;/strong&gt;. I think I'm gonna use it during development to fasten the compile time. Moreover, even if I only worked two days with it, it seems that the error messages are indeed better than the GCC's ones. &lt;/p&gt;
&lt;p&gt;I haven't tried to compare the performances of eddic in both cases, but I will do that in the future, soon after the 1.1.4 version is released. &lt;/p&gt;
&lt;p&gt;I tried the CLang static analyzer on eddic but it didn't found any bugs. Moreover, it crashed on several of my files. I didn't found why for now, but I will continue to investigate, perhaps I'm not using it correctly. &lt;/p&gt;
&lt;p&gt;I expect to publish the next version of eddic in the next two weeks. This version has much more improvements that I thought at first and I have less time to work now that &lt;a href="http://www.baptiste-wicht.com/2012/09/back-in-berkeley-california/" title="Back in Berkeley, California" target="_blank"&gt;I'm working on my Master thesis&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;More informations on CLang: &lt;a href="http://clang.llvm.org/" title="CLang official site"&gt;The official site&lt;/a&gt;.&lt;/p&gt;</description><category>clang</category><category>Compilers</category><category>EDDI</category><category>gcc</category><category>Linux</category><guid>https://baptiste-wicht.com/posts/2012/11/eddic-compiles-with-clang-3-1.html</guid><pubDate>Thu, 01 Nov 2012 08:11:05 GMT</pubDate></item></channel></rss>