#
Advanced GPU Patterns Optimization in ETL

Posted:

The GPU performance of my Expression Templates Library (ETL) is pretty good when most of the time is spent inside expensive operations such as Matrix-Matrix Multiplication or convolutions. However, when most of the time is spent in linear kernels, performance is not great because this will invoke a lot of CUDA kernels. Indeed, the way it is done is that each sub expressions compute its result in a temporary GPU vector (or matrix) and these temporaries are passed through the expressions. For instance, this expression:

yy = 1.1 * x + 1.2 * y

will be executed on the GPU as something like this:

t1 = 1.1 * x t2 = 1.2 * y yy = t1 + t2

that will results in three GPU kernels being invoked. In the CPU case, the complete expression will be executed as one CPU kernel, that is constructed with Expression Templates. Unfortunately, a CUDA kernel cannot be constructed in the same way since the CUDA compiler does not support general template metaprogramming. That's why I've implemented by using small kernels for each expression.

Fortunately, we can do better with a bit more meta-programming. Indeed, there are some patterns that are repeated a lot and that easily be implemented in CUDA kernels. I've started detecting a few of these patterns and for each of them a single CUDA kernel is executed. For instance, each of the following expressions can be executed with a single kernel:

yy = 1.1 * x + y yy = x + 1.1 * y yy = 1.1 * y + 1.2 * y yy = 1.1 * x * y yy = x / (1.1 * y)

This results in significantly performance improvement for these expressions!

I have tested these new improvements in my Deep Learning Library (DLL) project
(not yet merged) and it resulted in **25% faster momentum computation** and
**17% faster Nesterov Adam (NADAM)**.

I'm going to continue to investigate which kernels need to be made faster for DLL and try to improve the overall performance. Currently, the GPU performance of DLL is very good for large convolutional networks, but could be improved for small fully-connected networks. Indeed, in that case, quite some time is spent outside the matrix-matrix multiplication and inside serial expressions for which GPU could be improved. Once I'm done with my optimizations, I'll probably post again on the blog with the latest results.

All these new optimizations are now in the **master** branch of the ETL
project if you want to check it out. You can access the project
on Github.