The site is now running WordPress 3.4

I just updated the site to run WordPress 3.4

Normally, you should not see any differences as most of the new features of this version are in the admin side, but if you see something that doesn't work, don't hesitate to contact me.

EDDI Compiler 1.0 - Structures and Global Optimizations

I've the pleasure to announce the availability of version 1.0 of the EDDI Compiler (eddic).

This release adds one big enhancement to the language: Structures

Structures are used like in the C programming language:

struct Complex {
    int imag;
    int real;
}

void main(){
    Complex c;
    c.imag = 222;
    c.real = 666;

    println(c.imag);
    println(c.real);

    c.imag += 111;
    c.real += 222;

    println(c.imag);
    println(c.real);

    test(c);

    if(c.b){
        println(c.imag);
    } else {
        println(c.real);
    }
}

void test(Complex a){
    println(a.imag);
    println(a.real);
}

For now on, you can declare structures, use local variables of the struct type and pass them as parameter. But the usage of structure is still limited, there are no way to return a structure from a function and no way to pass a structure by reference. Another limitation is that a member of struct cannot be of a struct type. At least, the last limitation will be addressed in the next version of eddic.

Another main change is the use of a new low-level Intermediate Representation (LTAC). This change is describe more in details in this article.

The main other change is the use of a data-flow framework for global optimization in the optimization engine. An optimization is global if it takes into account all the basic blocks of the function being optimized. For that, it takes a Control-Flow graph of the function and follow the logical flow of the function to determine what can be optimized. Two old optimization have been transformed from local to global: Constant Propagation and Copy Propagation. They have also been merged for being more efficient, so they are done both in one pass of the flow. I also implemented a new technique: Common Subexpression Elimination. This optimization make sure that no computation is made when the result is still available. The control flow graph is handled with the Boost Graph Library.

I also fixed a performance issue on the Optimization Engine. Before, the optimization were done for the whole program and if one optimization was successful, all the optimization techniques were tried again on the whole program. Now, there are made one function at a time and restarted only for this function. It should prove faster on problem with a lot of functions.

In the side of the assembly generation, I changed the way the floats constants are handled. Before, a general purpose register was used to load the constant and then load it in the SSE register. To avoid having to use a GP Register, I used a constant float pool and loaded the float directly from memory to the SSE Register.

On the compiler side, I added several new unit test and fixed the old tests. They were lots of bugs in the tests itself that made that they were not working at all. The test suite is now much more robust and showed me lots of other bugs.

I removed the dependency to Boost Chrono by relying on the new std::chrono library.

Future work

The next version will be the 1.0.1 version. There will be several changes with this version.

I will improve the support of structures. I will add the support for struct inside structs and perhaps passing struct by reference (which would also means adding supports for references for other types as well).

I will also make more improvements to the optimization engine. I will add at least one new data-flow optimization and I will try to make the optimization pass faster.

Finally, as ever, I will certainly make some refactorings on some parts of the Compiler, but it starts looking good.

Download

You can find the compiler sources on the Github repository: https://github.com/wichtounet/eddic

The exact version I refer to is the v1.0 available in the GitHub tags or directly as the release branch.

Advanced Compiler Design and Implementation - Book Review

After having read my first book about compilers, I decided to try another one more focused on optimizations. For that, I chose "Advanced Compiler Design and Implementation", by Steven S. Muchnick.

This book covers several subjects about compilers, but more than 60% of the text is about compiler optimizations.

The first chapter introduces the main concepts of compiler design. It also explains why optimization is so important in a compiler.

The algorithms of this book are presented in ICAN (Informal Compiler Algorithm Notation) notation. The chapter two provides both a brief and a full description of this notation. In my case, the brief description has been enough to understand the algorithms presented in the following chapters, but it can be useful for a deep understanding of the notation to read the full description.

The next chapter covers Symbol Table. It also includes a way to generate load and store instructions directly based on the information contained in the Symbol Table. Then, the fourth chapter presents the intermediate representations used in that book. This book uses three different intermediate languages: A high-level one, a medium-level one and a low-level. This chapter covers each of them in details. The importance of the design of an intermediate representation is also discussed here. There will be two more intermediate forms used in the book, static single-assignment (SSA) and program dependence graphs that are discussed later in the book.

The chapter five gives some information about the different runtime support of some architectures. It is very useful to know how to handle high-level languages at runtime. The next one is about producing code generators automatically from machine descriptions. Three approaches are covered in this chapter.

With the seventh chapter, the optimization techniques start. This chapter covers control-flow analysis. It will introduce several techniques that can be used to perform this kind of analysis, namely depth first search and dominators, interval analysis and structural analysis. These analysis can be used to identify structures like loops and branches in the intermediate representations. The chapter eight covers data-flow analysis. This chapter introduces a lot of mathematical concepts like lattices or flow functions. It takes some time to understand completely the concepts of this chapter, but the explanations are very good. Again, three methods of doing this analysis are studied. It covers iterative data-flow analysis, control-tree techniques and slotwise analysis. Another techniques are also introduced, but not covered in details. The chapter 9 covers dependence analysis. This analysis will be vital for optimizations on arrays and loops and to instruction scheduling techniques that will be studied later. Finally, the chapter ten introduces alias-analysis techniques.

Once the analysis techniques have been covered, the other chapters are about optimization themselves. The chapter 11 introduces optimizations. It explains which optimizations should be performed at which level and in which order. It also describes briefly the optimizations that are covered in the next chapters. You will see that the following chapters are very rich, each of them containing a lot of optimizations that can be performed.

The first optimizations that are covered (in chapter 12) are the so-called early optimizations. It includes scalar replacement of aggregates, value numbering, copy propagation and sparse conditional constant propagation. It also covers constant folding and algebraic simplifications. After that, the optimizations that reduce redundancy are covered. Again, several techniques are covered, common subexpression elimination, forward substitution, loop invariant code motion, partial redundancy elimination and code hoisting. Then, the loop optimizations are introduced. This chapter first introduces a way to identify induction variables in a loop and then covers some optimization that can be used. For example, strength reduction and unnecessary bounds checking optimizations are covered.

The next two chapters are more related to low-level problematic. The chapter 15 covers optimizations that can be applied to reduce the cost of procedures. It discusses tail-call optimization, procedure integration, in-line expansion, leaf-routine optimization and shrink wrapping. The, the chapter 16 covers a very important subject that is Register Allocation. It covers several techniques like cost based methods and global graph coloring.

The chapter 17 deals with code scheduling. It is a technique that reorder instructions to take best advantage of the pipelines built into ,modern processors. First, local approaches (within a basic block) are discussed and then optimization for scheduling across basic-block boundaries are covered. For the two subjects, several techniques are discussed. The chapter 18 covers low-level optimizations like unreachable-code elimination, loop inversion, dead-code elimination, etc... This chapter is very broad and very interesting too.

The chapter 19 covers more complex optimization: the inteprocedural optimizations. Several techniques for doing inteprocedural analysis are covered in details as well as several optimizations depending on these analysis, like constant propagation. This chapter is not very simple to understand and even less to apply, but it is very interesting. The chapter 20 is the last about optimizations. It covers techniques to improve the memory hierarchy usage. The first optimizations are about instruction-cache: instruction prefetching, procedure sorting and procedure splitting for example. Then, data-cache optimizations are covered. It includes data prefetching and scalar replacement of array elements in details and gives an outline for some other optimizations.

Finally, the chapter 21 studies four different compilers to see what optimizations are applied and in which order. Their intermediate forms are also studied. It is very interesting how this is done in real-world compiler.

To conclude, I think that this book is really great. It covers a lot of optimizations that can be implemented in a compiler. All the optimizations are covered in details with code samples and examples of applying the optimization on some code. However, it has to be said that this book is not easy to read and sometimes it is hard to understand exactly what means a specific optimization and in what it differs from some close technique. If you want to write an aggressive optimizer compiler or just write some optimizations for an existing one, you should consider to take a look at this book.

If you know another good book on Compilers, I will be glad to hear about it.

Compiler Architecture refinements for eddic

The next version of eddic will see an improved compiler architecture. There are two new main changes in this version:

  1. A better separation between the front end and the back end
  2. A new intermediate representation to improve and ease code generation

Front end and Back End

First, the front and back ends have been clearly separated. The general compiler architecture is currently something like that:

EDDI Compiler General Architecture

The first part didn't change, but the Compiler was part was clearly separated between front and back ends:

EDDI Compiler Architecture

The backend has no information about the source language. It only sees the intermediate representation provided by the front-end, named: Medium-Level Three Address Code (MTAC).

There are several advantages to this model. The main one is that it is easy to add support for a new programming language to the compiler. Only the front end needs to be changed. The same can be achieved if a new output is necessary, for example output ARM assembly instead of Intel assembly.

New intermediate representation

In the previous versions of the compiler, the code generators were fairly complex. Indeed, they had to transform the MTAC intermediate representation directly into assembly. This process involves several things:

  • instruction selection
  • register allocation
  • low-level optimization (replace a  mov rax, 0 with xor rax, rax for example)
  • handle basic blocks management

In this version, I decided to change it to a better architecture. This architecture uses a new intermediate representation: Low-Level Three Address Code (LTAC). As its name states, it is a low-level representation, close to assembly. In this  representation there are addresses, registers and abstracted instructions. This representation is platform independent (the differences between 32 and 64 bits are moved to the code generators). There are no more basic blocks here, only functions containing statements.

The next figure presents the structure of the backend:

EDDI Compiler Backend architecture

The compiler is responsible for transforming the MTAC Representation in LTAC Representation. It does not do any low-level optimization. The instruction selection is easier as it is platform independent. The peephole optimizer is responsible for the low-levels optimizations. In the 1.0 release, there would be only few things done at this level. In the future, I will try to invest some time to complete it to generate better assembly code. The optimizations are far simpler than the one done in the MTAC optimization engine. Indeed, a peephole optimizer is generally working only in a small window of instructions, like three or four instructions at a time. And finally, the code generators performs the instruction selection process and address resolving. It also has to translate symbolic registers into physical ones.

Conclusion

I hope that these refinements in the compiler architecture will allow the compiler to produce better code.

The 1.0 version of the compiler will include another new features:

  • Basic support for custom structures
  • Global optimizations
  • Some bug fixes found with the new set of unit tests

As always, feel free to comment on the new architecture, the compiler itself, the project or whatever

Install Valgrind on Gentoo Linux

Valgrind is very powerful suite of software for dynamic analysis of binary programs. Valgrind is available in an ebuild on the Gentoo portage tree, but if you want to install valgrind on your Gentoo distribution, there is a problem with the build with the standard library. On Gentoo, the standard C library (glibc) is stripped and Valgrind needs the debug symbols to work. If you try to launch valgrind without the debug symbols, you will get the following error:

valgrind:  Fatal error at startup: a function redirection
valgrind:  which is mandatory for this platform-tool combination
valgrind:  cannot be set up.  Details of the redirection are:
valgrind:  
valgrind:  A must-be-redirected function
valgrind:  whose name matches the pattern:      strlen
valgrind:  in an object with soname matching:   ld-linux-x86-64.so.2
valgrind:  was not found whilst processing
valgrind:  symbols from the object with soname: ld-linux-x86-64.so.2
valgrind:  
valgrind:  Possible fixes: (1, short term): install glibc's debuginfo
valgrind:  package on this machine.  (2, longer term): ask the packagers
valgrind:  for your Linux distribution to please in future ship a non-
valgrind:  stripped ld.so (or whatever the dynamic linker .so is called)
valgrind:  that exports the above-named function using the standard
valgrind:  calling conventions for this platform.
valgrind:  
valgrind:  Cannot continue -- exiting now.  Sorry.

So first, you have to activate the debug symbols for the libraries in your /etc/make.conf:

FEATURES="splitdebug"

Then, you can emerge again the glibc:

sudo emerge glibc

If you already had emerged valgrind, there is no need to emerge it again, it should work now.

And finally, you can emerge valgrind:

sudo emerge valgrind

And everything will work fine.

Switching to Gentoo Linux

After having switched to Mint from Ubuntu, I'm on the verge of switching to Gentoo Linux.

Gentoo is a powerful operating system base on Linux. This operating system provides extreme configurability and performance. Gentoo is very lightweight on its own, by default, there is not even a window manager installed. A big advantage of this system is that you can customize your system to your exact needs. You can use it as a server, a desktop distribution or whatever you needs. You install only the program you needs. This advantage leads to an inconvenient: you will need an advanced knowledge on Linux to install your system. Indeed, you will have to configure your kernel, choose compilation flags, choose your packages carefully and know your hardware as well.

Gentoo is based on a very powerful software distribution system, Portage. Portage is used to install new packages, get the latest software for Gentoo or upgrade your system. Except for some proprietary software, all the packages are built from the sources. This allow to a deep customization of your software. The installation of some package can take a big amount of time to compile. Count at least several hours to install a system based on Gnome Shell for example.

If you plan to install a full installation of Gentoo, reserve some days for that. I've spent several days working on my installation before getting to something fully working.

Here is my current configuration:

  • Gentoo operating system
  • Linux Kernel 3.3
  • Gnome Shell 3.2.1
  • Google Chrome 18
  • NVidia Drivers 295.33
  • ...

As I've stripped my kernel and my init scripts to the maximum, my boot time is much faster and my installation takes much less space than my Mint installation.

I said that I'm on the verge of switching because I still have some applications that are not installed on my new Gentoo distribution. For example, I have no multimedia support for now, but I already spent most of my time on my new distribution.

I will try to write some posts on Gentoo in the future.

C++11 Concurrency Tutorial - Part 3: Advanced locking and condition variables

In the previous article, we saw how to use mutexes to fix concurrency problems. In this post, we will continue to work on mutexes with more advanced techniques. We will also study another concurrency technique of the C++11 Concurrency Library: condition variables.

Recursive locking

Let's imagine that you have a simple class like this one:

struct Complex {
    std::mutex mutex;
    int i;

    Complex() : i(0) {}

    void mul(int x){
        std::lock_guard<std::mutex> lock(mutex);
        i *= x;
    }

    void div(int x){
        std::lock_guard<std::mutex> lock(mutex);
        i /= x;
    }
};

And you want to add an operation doing both operations with no problems, so you add a new function:

void both(int x, int y){
    std::lock_guard<std::mutex> lock(mutex);
    mul(x);
    div(y);
}

Now, it's time to test this function:

int main(){
    Complex complex;
    complex.both(32, 23);

    return 0;
}

If you launch this application, you'll see that the program will never terminates. The problem is very simple. In the both() function, the thread acquires the lock and then calls the mul() function. In this function, the threads tries to acquire the lock again, but the lock is already locked. This is a case of deadlock. By default, a thread cannot acquire the same mutex twice.

There is a simple solution to this problem: std::recursive_mutex. This mutex can be acquired several times by the same thread. Here is the correct version of the Complex struct:

struct Complex {
    std::recursive_mutex mutex;
    int i;

    Complex() : i(0) {}

    void mul(int x){
        std::lock_guard<std::recursive_mutex> lock(mutex);
        i *= x;
    }

    void div(int x){
        std::lock_guard<std::recursive_mutex> lock(mutex);
        i /= x;
    }

    void both(int x, int y){
        std::lock_guard<std::recursive_mutex> lock(mutex);
        mul(x);
        div(y);
    }
};

This time, the application works correctly.

Timed locking

Sometimes, you doesn't want a thread to wait ad infinitum for a mutex. For example, if your thread can do something else when waiting for the thread. For this purpose, the standard library has a solution: std::timed_mutex and std::recursive_timed_mutex (if you need the recursivity properties of the mutex). You have access to the same functions as a std::mutex: lock() and unlock(), but you have also two new functions: try_lock_for() and try_lock_until().

The first one is also the most useful. It allows you to set a timeout after when the function automatically returns even if the lock was not acquired. The function returns true if the lock has been acquired, false otherwise. Let's try it with a simple example:

std::timed_mutex mutex;

void work(){
    std::chrono::milliseconds timeout(100);

    while(true){
        if(mutex.try_lock_for(timeout)){
            std::cout << std::this_thread::get_id() << ": do work with the mutex" << std::endl;

            std::chrono::milliseconds sleepDuration(250);
            std::this_thread::sleep_for(sleepDuration);

            mutex.unlock();

            std::this_thread::sleep_for(sleepDuration);
        } else {
            std::cout << std::this_thread::get_id() << ": do work without mutex" << std::endl;

            std::chrono::milliseconds sleepDuration(100);
            std::this_thread::sleep_for(sleepDuration);
        }
    }
}

int main(){
    std::thread t1(work);
    std::thread t2(work);

    t1.join();
    t2.join();

    return 0;
}

(The example is completely useless in practice)

The first interesting thing in this example is the declaration of the duration with std::chrono::milliseconds. This is also a new feature of the C++11 standard. You have access to several time unit: nanoseconds, microseconds, milliseconds, seconds, minutes and hours. We use a variable of this kind to set the timeout of the try_lock_for function. We also use this to make a thread sleeps with std::this_thread::sleep_for(duration). The rest of the example has nothing exciting in it, just some prints to see the results visually. Note that the program never stops, you have to kill it.

Call once

Sometimes you want a function to be called only once no matter the number of threads that are used. Imagine a function that has two parts. The first part has to be called only once and the second has to be executed every time the function gets called. We can use the std::call_once function to fix this problem very easily. Here is an example using this mechanism:

std::once_flag flag;

void do_something(){
    std::call_once(flag, [](){std::cout << "Called once" << std::endl;});

    std::cout << "Called each time" << std::endl;
}

int main(){
    std::thread t1(do_something);
    std::thread t2(do_something);
    std::thread t3(do_something);
    std::thread t4(do_something);

    t1.join();
    t2.join();
    t3.join();
    t4.join();

    return 0;
}

Each std::call_once is matched to a std::once_flag variable. Here I put a closure to be executed only once, but a function pointer or a std::function will make the trick.

Condition variables

A condition variable manages a list of threads waiting until another thread notify them. Each thread that wants to wait on the condition variable has to acquire a lock first. The lock is then released when the thread starts to wait on the condition and the lock is acquired again when the thread is awakened.

A very good example is a concurrent Bounded Buffer. It’s a cyclic buffer with a certain capacity with a start and an end. Here is our implementation of a Bounded Buffer using condition variables:

struct BoundedBuffer {
    int* buffer;
    int capacity;

    int front;
    int rear;
    int count;

    std::mutex lock;

    std::condition_variable not_full;
    std::condition_variable not_empty;

    BoundedBuffer(int capacity) : capacity(capacity), front(0), rear(0), count(0) {
        buffer = new int[capacity];
    }

    ~BoundedBuffer(){
        delete[] buffer;
    }

    void deposit(int data){
        std::unique_lock<std::mutex> l(lock);

        not_full.wait(l, [this](){return count != capacity; });

        buffer[rear] = data;
        rear = (rear + 1) % capacity;
        ++count;

        l.unlock();
        not_empty.notify_one();
    }

    int fetch(){
        std::unique_lock<std::mutex> l(lock);

        not_empty.wait(l, [this](){return count != 0; });

        int result = buffer[front];
        front = (front + 1) % capacity;
        --count;

        l.unlock();
        not_full.notify_one();

        return result;
    }
};

The mutexes are managed by a std::unique_lock. It is a wrapper to manage a lock. This is necessary to be used with the condition variables. To wake up a thread that is waiting on a condition variable, the notify_one() function is used. The unlock before the notify_one is not totally necessary. If you omit it, it will be done automatically by destructor of the unique_lock. But, it is then possible that the notify_one() call will wake up a waiting thread that will then directly block again since the lock itself is still locked by the notifier thread. Therefore, if you do it before, the notified thread should be able to get the lock directly. Therefore, it's a slight optimization, but it won't make a lot of differences. The wait function is a bit special. It takes as the first argument the unique lock and a the second one a predicate. The predicate must return false when the waiting must be continued (it is equivalent to while(!pred()){cv.wait(l);}). The rest of the example has nothing special.

We can use this structure to fix multiple consumers / multiple producers problem. This problem is very common in concurrent programming. Several threads (consumers) are waiting from data produced by another several threads (producers). Here is an example with several threads using the structure:

void consumer(int id, BoundedBuffer& buffer){
    for(int i = 0; i < 50; ++i){
        int value = buffer.fetch();
        std::cout << "Consumer " << id << " fetched " << value << std::endl;
        std::this_thread::sleep_for(std::chrono::milliseconds(250));
    }
}

void producer(int id, BoundedBuffer& buffer){
    for(int i = 0; i < 75; ++i){
        buffer.deposit(i);
        std::cout << "Produced " << id << " produced " << i << std::endl;
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }
}

int main(){
    BoundedBuffer buffer(200);

    std::thread c1(consumer, 0, std::ref(buffer));
    std::thread c2(consumer, 1, std::ref(buffer));
    std::thread c3(consumer, 2, std::ref(buffer));
    std::thread p1(producer, 0, std::ref(buffer));
    std::thread p2(producer, 1, std::ref(buffer));

    c1.join();
    c2.join();
    c3.join();
    p1.join();
    p2.join();

    return 0;
}

Three consumer threads and two producer threads are created and query the structure constantly. An interesting thing about this example is the use of std::ref to pass the buffer by reference, it is necessary to avoid a copy of the buffer.

Wrap-Up

In this article we saw several things. First, we saw how to use a recursive_mutex to allow a thread to acquire a thread more than once. Then, we saw how to acquire a mutex with a timeout. After that, a method to call a function only once has been studied. And finally, condition variables were used to solve the multiple consumers / multiple producers problem.

The source code for this article can be found on Github.

Next

In the next post of this series, we will another technique of this new C++11 Concurrency Library, the Atomics.

Enhanced Code Snippets with SyntaxHighlighter Evolved

After a long time of service, I decided to replace the old WP-Syntax plugin with a more modern one: SyntaxHighlighter Evolved.

I thought that the style of the code snippets of WP-Syntax started looking a bit old. The new plugin has several advantages:

  • The code snippets look better
  • A useful toolbar is available on each snippet allowing you to copy the code, to print it or to view the source only
  • The configuration is more complete

This plugin use the SyntaxHighlighter JavaScript package by Alex Gorbatchev.

Here is an example using this new plugin:

ExecutorService threadPool = Executors.newFixedThreadPool(4);

CompletionService&lt;String&gt; pool = new ExecutorCompletionService&lt;String&gt;(threadPool);

for(int i = 0; i &lt; 10; i++){
   pool.submit(new StringTask());
}

for(int i = 0; i &lt; 10; i++){
   String result = pool.take().get();

   //Compute the result
}

threadPool.shutdown();

For now, only two posts are using this new plugin:

The other ones are still using the old plugin. I will convert the other posts when I will find some time.

I hope that this change will makes the site better for you.

C++11 Concurrency Tutorial - Part 2 : Protect shared data

In the previous article, we saw how to start threads to execute some code in parallel. All the code executed in the threads were independant. In the general case, you often use shared objects between the threads. And when you do it, you will face another problem: synchronization.

We will see what is this problem in a simple code.

Synchronization issues

As an example, we will take a simple Counter structure. This structure has a value and methods to increment or decrement the value. Here is the structure:

struct Counter {
    int value;

    Counter() : value(0){}

    void increment(){
        ++value;
    }
};

There is nothing new here. Now, let's start some threads and make some increments:

int main(){
    Counter counter;

    std::vector<std::thread> threads;
    for(int i = 0; i < 5; ++i){
        threads.push_back(std::thread([&counter](){
            for(int i = 0; i < 100; ++i){
                counter.increment();
            }
        }));
    }

    for(auto& thread : threads){
        thread.join();
    }

    std::cout << counter.value << std::endl;

    return 0;
}

Again, nothing new there. We launch 5 threads and each one increment the counter hundred times. After all thread have finished their work, we print the value of the counter.

If we launch this program, we should expect that it will print 500. But this is not the case. No one can say what this program will print. Here are some results I obtained on my computer:

442
500
477
400
422
487

The problem is that the incrementation is not an atomic operation. As a matter of fact, an incrementation is made of three operations:

  • Read the current value of value
  • Add one to the current value
  • Write that new value to value

When you run that code using a single thread, there are no problems. It will execute each part of the operation one after another. But when you have several threads, you can start having troubles. Imagine this situation:

  1. Thread 1 : read the value, get 0, add 1, so value = 1
  2. Thread 2 : read the value, get 0, add 1, so value = 1
  3. Thread 1 : write 1 to the field value and return 1
  4. Thread 2 : write 1 to the field value and return 1

These situations come from what we call interleaving. Interleaving describe the possible situations of several threads executing some statements. Even for three operations and two threads, there is a lot of possible interleavings. When you have more threads and more operations, it is almost impossible to enumerate the possibles interleavings. The problem can also occurs when a thread gets preempted between instructions of the operation.

There are several solutions to fix this problem:

  • Semaphores
  • Atomic references
  • Monitors
  • Condition codes
  • Compare and swap
  • etc.

In this blog post we will learn how to use semaphores to fix this problem. As a matter of fact, we will a special kind of semaphores called mutexes. A mutex is a very simple object. Only one thread can obtain the lock on a mutex at the same time. This simple (and powerful) property of a mutex allow us to use it to fix synchronization problems.

Use a mutex to make our Counter thread-safe

In the C++11 threading library, the mutexes are in the mutex header and the class representing a mutex is the std::mutex class. There are two important methods on a mutex: lock() and unlock(). As their names indicate, the first one enable a thread to obtain the lock and the second releases the lock. The lock() method is blocking. The thread will only return from the lock() method when the lock has been obtained.

To make our Counter struct thread-safe, we have to add a std::mutex member to it and then to lock()/unlock() the mutex in every function of the object:

struct Counter {
    std::mutex mutex;
    int value;

    Counter() : value(0) {}

    void increment(){
        mutex.lock();
        ++value;
        mutex.unlock();
    }
};

If we now test this implementation with the same code as before for starting the threads, the program will always display 500.

Exceptions and locks

Now, let's see what happens in another case. Imagine that the Counter has a decrement operation that throws an exception if the value is 0:

struct Counter {
    int value;

    Counter() : value(0) {}

    void increment(){
        ++value;
    }

    void decrement(){
        if(value == 0){
            throw "Value cannot be less than 0";
        }

        --value;
    }
};

You want to access this structure concurrently without modifying the class. So you create a wrapper with locks for this class:

struct ConcurrentCounter {
    std::mutex mutex;
    Counter counter;

    void increment(){
        mutex.lock();
        counter.increment();
        mutex.unlock();
    }

    void decrement(){
        mutex.lock();
        counter.decrement();        
        mutex.unlock();
    }
};

This wrapper works well in most of the cases, but when an exception occurs in the decrement method, you have a big problem. Indeed, if an exception occurs, the unlock() function is not called and so the lock is left in a blocked state. Then, you program is completely blocked. To fix this problem, you have to use a try/catch structure to unlock the lock before throwing again the exception:

void decrement(){
    mutex.lock();
    try {
        counter.decrement();
    } catch (std::string e){
        mutex.unlock();
        throw e;
    } 
    mutex.unlock();
}

The code is not difficult but starts looking ugly. Now imagine you are in a function with 10 different exit points. You will have to call unlock() from each of these points and the probability that you will forget one is big. Even bigger is the risk that you won't add a call to unlock when you add a new exit point to a function.

The next section gives a very nice solution to this problem.

Automatic management of locks

When you want to protect a whole block of code (a function in our case, but can be inside a loop or another control structure), it exists a good solution to avoid forgetting to release the lock: std::lock_guard.

This class is a simple smart manager for a lock. When the std::lock_guard is created, it automatically calls lock() on the mutex. When the guard gets destructed, it also releases the lock. You can use it like this:

struct ConcurrentSafeCounter {
    std::mutex mutex;
    Counter counter;

    void increment(){
        std::lock_guard<std::mutex> guard(mutex);
        counter.increment();
    }

    void decrement(){
        std::lock_guard<std::mutex> guard(mutex);
        counter.decrement();
    }
};

Much nicer, isn't it :)

With that solution, you do not have to handle all the cases of exit of the function, they are all handled by the destructor of the std::lock_guard instance.

Conclusion

We are now done with semaphores. In this article, you learned how to protect shared data using mutexes from the C++ Threads Library.

Keep in mind that locks are slow. Indeed, when you use locks you make sections of the code sequential. If you want an highly parallel application, there are other solutions than locks that are performing much better but this is out of the scope of this article.

Next

In the next blog post of this serie, I will talk about advanced concepts for mutexes and how to use condition variables to fix little concurrent programming problem.

The source code for each sample is available on Github.