<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Blog blog("Baptiste Wicht"); (Posts about dbn)</title><link>https://baptiste-wicht.com/</link><description></description><atom:link href="https://baptiste-wicht.com/categories/dbn.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><lastBuildDate>Sun, 15 Feb 2026 06:57:39 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Publication: CPU Performance Optimizations for RBM and CRBM</title><link>https://baptiste-wicht.com/posts/2017/02/publication-cpu-performance-optimizations-rbm-crbm.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;Recently, we have published a paper about performance optimizations that may
interest you.&lt;/p&gt;
&lt;p&gt;The paper is &lt;a class="reference external" href="https://www.researchgate.net/publication/307908790_On_CPU_Performance_Optimization_of_Restricted_Boltzmann_Machine_and_Convolutional_RBM"&gt;On CPU Performance Optimizations for Restricted Boltzmann Machine and Convolutional RBM&lt;/a&gt;, published in the Proceedings of the Artificial Neural Networks and Pattern Recognition workshop (ANNPR-2016). I've presented this paper in Germany, at Ulm.&lt;/p&gt;
&lt;p&gt;Although most of the performance research going on is focused on GPU, there are
still of research laboratories that are only equipped with CPU and it remains
important to be as fast as possible on CPU. Moreover, this is something
I really like.&lt;/p&gt;
&lt;p&gt;For this publication, I have tried to make my Restricted Boltzmann Machine (RBM)
and Convolutional RBM (CRBM) implementations in my DLL library as fast as
possible.&lt;/p&gt;
&lt;p&gt;The first part of the article is about Restricted Boltzmann Machine (RBM) which
are a form of dense Artificial Neural Network (ANN). Their training is very
similar to that of the ANN with Gradient Descent. Four different network
configurations are being tested.&lt;/p&gt;
&lt;p&gt;First, mini-batch training is shown to be much faster than online training, even
when online training is performed in parallel. Once mini-batch training is used,
BLAS operations are used in order to get as much performance as possible on the
different operations, mainly the Matrix Matrix Multiplication with the use of
the GEMM operation from the Intel Math Kernel Library (MKL). Moreover, the
parallel version of the MKL is also used to get even more performance. When all
these optimizations are performed, speedups of 11 to 30 are obtained compared to
the online training, depending on the network configurations. This final version
is able  to perform one epoch of Contrastive Divergence in 4 to 15 seconds
depending on the network, for 60000 images.&lt;/p&gt;
&lt;p&gt;The second part of the article is about Convolutional Restricted Boltzmann
Machine (CRBM). This is almost the equivalent of a Convolutional Neural Network
(CNN). Again four different networks are evaluated.&lt;/p&gt;
&lt;p&gt;The main problem with CRBM is that there are no standard implementations of the
convolution operation that is really fast. Therefore, it is not possible to
simply use a BLAS library to make the computation as fast as possible. The first
optimization that was tried is to vectorize the convolutions. With this, the
speedups have been between 1.1 and 1.9 times faster. I'm not really satisfied
with these results since in fact per convolution the speedups are much better.
Moreover, I have since been able to obtain better speedups but the deadline was
too short to include them in this paper. I'll try to talk about these
improvements in more details on this blog. What is more interesting to to
parallellize the different convolutions since they are mostly independent. This
can bring a speedup of the amount of cores available on the machine. Since
convolutions are extremely memory hungry, virtual cores with Hyper Threading
generally does not help. An interesting optimization is to use a Matrix
Multiplication to compute several valid convolutions at once.  This can give an
additional speedup between 1.6 and 2.2 compared to the vectorized version. While
it is possible to use the FFT to reduce the full convolution as well, in our
experiment the images were not big enough for this to be interesting. The final
speedups are about 10 times faster with these optimizations.&lt;/p&gt;
&lt;p&gt;We have obtained pretty good and I'm happy we have been published. However, I'm
not very satisfied with these results since I've been able to get even faster
since this and when compared with other frameworks, DLL is actually quite
competitive. I'll try to publish something new in the future.&lt;/p&gt;
&lt;p&gt;If you want more information, you can have a look at the paper. If you want to
look at the code, you can have a look at my projects:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/wichtounet/etl"&gt;Expression Templates Library (ETL)&lt;/a&gt;: For
the Matrix Multiplication and Convolutions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/wichtounet/dll"&gt;Deep Learning Library (DLL)&lt;/a&gt;: For the RBM
and CRBM implementations&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Don't hesitate to ask any questions if you want more information :)&lt;/p&gt;</description><category>C++</category><category>CPU</category><category>crbm</category><category>dbn</category><category>Deep Learning</category><category>dll</category><category>etl</category><category>Intel</category><category>Performances</category><category>publications</category><category>rbm</category><category>thesis</category><guid>https://baptiste-wicht.com/posts/2017/02/publication-cpu-performance-optimizations-rbm-crbm.html</guid><pubDate>Tue, 07 Feb 2017 16:33:33 GMT</pubDate></item><item><title>Short introduction to deep learning</title><link>https://baptiste-wicht.com/posts/2014/09/short-introduction-to-deep-learning.html</link><dc:creator>Baptiste Wicht</dc:creator><description>&lt;p&gt;At my school, I gave a short presentation about Deep Learning and the
implementation I made in C++.&lt;/p&gt;
&lt;p&gt;It is nothing fancy, but it could be interesting to someone.&lt;/p&gt;
&lt;div style="text-align:center;"&gt;&lt;iframe src="//www.slideshare.net/slideshow/embed_code/39024941" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;p&gt;Don't hesitate if you have any comments or questions about the presentation ;)&lt;/p&gt;
&lt;p&gt;The implementation is here: &lt;a class="reference external" href="https://github.com/wichtounet/dll"&gt;https://github.com/wichtounet/dll&lt;/a&gt;&lt;/p&gt;</description><category>dbn</category><category>Deep Learning</category><category>dll</category><category>rbm</category><guid>https://baptiste-wicht.com/posts/2014/09/short-introduction-to-deep-learning.html</guid><pubDate>Fri, 12 Sep 2014 18:41:58 GMT</pubDate></item></channel></rss>