C++ Benchmark - std::list VS boost::intrusive::list

Recently, we saw that the std::list performance was not really good when it comes to searching it or iterating through it.In this post, we will see an alternative to the std::list: the boost::intrusive::list from the Boost C++ libraries. It is not a well known library but it can be useful in some specific cases. I will focus on how this implementation performs compared to an std::list.

An intrusive list does not store copies of the object but the objects itself. Because it stores the objects, it is necessary to add information to the object data structure directly to link them together, that is why it is called an intrusive list. This has a big advantage, the list does not have to allocate any memory at all to insert objects. In a std::list if you insert an object, a node object will be created containing a copy of the object, but not in an intrusive list where only the pointers to the next and to the previous element of the inserted object are updated. Another advantage is that if you have a reference to an object you can directly obtain an iterator to this object in O(1), an operation that would be in O(n) with a std::list. Iteration is faster because it needs less memory accesses.

On the other hand, an intrusive list has also several drawbacks. First of all, it is intrusive. It means that you have to change the definition of the type that is stored inside the intrusive list. Then, and it be very complicated, it is up to the developer to manage the life time of the objects. It means that it is up to you to allocate and deallocate memory for each objects that you want to put in your collection. For instance, if you store an object into an intrusive list and later delete this object without removing it from the list, you will have broken you list. It is also less safe because the container can be modified from outside simply by modifying the pointers directly inside the objects.

This article is not a tutorial for Boost intrusive collections, I will just focus on the performance aspect, if you want to learn how to use them, you can consult the official documentation.

A boost::intrusive::list can be configured in three mode:

  1. Normal mode: No special features
  2. Safe mode: The hook is initialized to a default safe state and the container check this state before inserting a value. The state of a removed node is also updated correctly. It can be used to detect programming errors. It implies a small performance overhead. This is the default mode.
  3. Auto unlink mode: When an object gets destructed it is automatically removed from the container. This mode has also the properties of the safe mode.

The mode is chosen at constant time by configuring the hook of the data type. In this article, all three mode will be tested. Here are the four types that will be tested:

  • list : std::list
  • normal_ilist : boost::intrusive::list in normal mode
  • safe_ilist : boost::intrusive::list in safe mode
  • auto_unlink_ilist : boost::intrusive::list in auto unlink mode

The data types are varying in size, they hold an array of longs and the size of the array varies to change the size of the data type. In each graph, the size of the data type is indicated. It is the size of the normal data type. The intrusive data types are always 16 bytes bigger than the normal data types.

In the graphs and in the text, n is used to refer to the number of elements of the collection.

All the tests performed have been performed on an Intel Core i7 Q 820  @ 1.73GHz. The code has been compiled in 64 bits with GCC 4.7.2 with the given options: -std=c++11 -O2 -fomit-frame-pointer -march=native

For each graph, the vertical axis represent the amount of time necessary to perform the operations, so the lower values are the better. The horizontal axis is always the number of elements of the collection. For some graph, the logarithmic scale could be clearer, a button is available after each graph to change the vertical scale to a logarithmic scale.

So let's see these data structures in practice.

Fill list

The first test that is performed is how long does it take to fill each data structure. For the std::list, each value is entered directly. For the intrusive list variations, the data are entered into a vector and then pushed back to the intrusive list.

So, let's test them:

We can see that filling a list is about thrice slower than an intrusive version. This is quite logical because there are much less memory allocations in the case of the intrusive lists. The differences between the different intrusive versions are not very big. The normal version is the fastest, then the auto unlink and finally the safe version is the slowest.

If we increase the size of the data type a bit:

The results remains more or less the same, but this time there is less difference between list and intrusive list.

This time, the results are really different. The intrusive versions are twenty times faster than a standard list. This comes from dynamic allocations of large blocks that is very expensive. In the case of intrusive list, there are no memory allocations, just modifications of pointers, so it it is normal that for big blocks the difference is higher.

We can see that for push_back operations, the intrusive are clearly faster. For small data types, they can be up to three times faster. The difference can be much higher with very big data types. There are no big differences between the different versions of the intrusive lists. The normal mode is about 10% faster than the safe mode.

Destruction of list

The second test is about the time necessary to destruct a collection. For a list, the list is simply allocated on the heap and destructed with delete operator. For an intrusive list, both the vector and the list are allocated on the heap. The time is computed to delete both the vector and the intrusive list.

Let's take a look at the results:

The impressive result is that the normal mode is almost free. It is normal because the destructor of the objects does not do anything. Neither the list does anything about the state of the object after it has been removed. The other two intrusive versions performs the same and twice faster than a list.

Let's increase a bit the size:

This time, the std::list version is getting closer to the auto unlink and safe versions. The auto unlink version is a bit slower than the safe version. The normal mode is still free.

Increasing it a bit again:

We can see that the list is a small bit faster than the other two versions.

If we push the memory to its limit:

This time, the list is again slower. I'm not sure of why this happens, but it is certainly because of the memory allocator that has two allocate too many big blocks, which tends to be more costly than many small.

For the destruction, the normal mode proved its high strength, being totally free to destroy. The safe and auto unlink modes are proving much more expensive during destruction, but still quite a bit faster than a standard list.

It is also necessary to keep in mind that the destruction is generally not a common operation and is about 4 times faster than insertion. In practice, neither push_back nor destruction are critical in choosing a data structure.

Linear Search in a list

The next operation that will benchmark is the linear search. The container is filled with all the numbers in [0, n] and shuffled. Then, each number in [0, n] is searched in the container with std::find that performs a simple linear search.

How do the different data structures perform:

As expected, the intrusive list versions are faster than the standard list. The margin is about 40%. The intrusive versions have a better locality than the standard list because there is one less indirection.

Increasing the data type size:

The margin has decreased a bit to about 15%. As the data object does not fit in cache we have higher cache misses rate.

If we increase it to the maximum:

Again, the margin decreased, to 3%.

For linear searching, the intrusive versions are clearly faster, however, not by a high advantage and this advantage tends to get lower with bigger data types. I would really have expected a more interesting result here. We will see with the next results if it gets better.

Iteration over a list

This time, we will test the iteration over a whole collection. The iterate is done with the C++11 foreach loop (taking a reference) and the data is used to increment a counter (to make sure the loop is not optimized away.

Let's start with our small data type:

The standard list is indeed slower than the other versions (by about 20%). Which is expected due to their better data locality. However, the results are not very stable (probably too fast, many things can intervene). I was expecting better results.

Let's go with a higher data type:

This time, the results look better, but there are less difference between the standard list and the intrusive versions. The intrusive versions are faster by about 10%.

If we take a bigger data type:

This time, there is no more difference between the different versions.

Just like the results for linear search, the intrusive versions are faster but the difference is not huge. For very small data type, there is a gain of about 15 to 20 percent, but on very big data types, there is no more increase in performance. Again, I would have expected better results for the intrusive versions.

Write to elements of the list

This test is almost the same as the previous one, but this time each element of the collection is modified by incrementing one of its field.

Let's see if the results are different this time.

The results are more stable than before. We can see that the normal mode is leading the results by a bit less than 30%. Just like for iteration, there no real difference between the different modes.

Let's increase the data size by a bit:

The results are the same, still 30% better for the intrusive version.

Bigger data type again:

This time, the intrusive version is not faster anymore than the standard list.

We have seen that when write is made to the data, intrusive list are better than list. The margin is higher than when doing only iteration.

Reverse the list

Let's test something more useful with a reverse operation. The reverse member function is used to reverse all the containers.

Let's see how they perform:

The intrusive versions are about 25% faster than standard list. Even if reversal does not need to access the values, the pointers of the intrusive lists have a better locality than the one of a list that can be dispersed through memory.

The performance improved a bit, to 30% improvement for an intrusive list.

Let's see if this continue:

It did, the intrusive list is more than 40% faster than the standard list !

What happens a bigger one:

The lines have been interchanged! This time the standard list is about 25% faster than the intrusive versions. This time, the better locality of the intrusive versions is not a gain but a loss.

It is logical that the margin decrease with very big objects during reversal. Indeed, each element is very close one to another, but the pairs of pointers are separated by the size of the data type. The bigger the data type, the higher distance between the pointers and so the worse spatial locality for the pointers. However, I do not explain why there is this big difference...

The performance of intrusive list are clearly interesting for reversing collection of small data types. However, it seems that for high data types, the standard list is faster.

Sort the list

Let's continue with an even more interesting operation, sorting the list. All the versions are sorted with the sort member function.

The intrusive versions are really interesting, being twice faster than a standard list.

Let's see with a see a higher data type:

The difference is decreasing to about 20%.

Increasing it again:

It decreased again to 18%.

For sort operations, the intrusive versions are clearly interesting, especially for small data types.

Random insert into the list

The last operation that we will test is the random insert. I have to say that this test is not fair. Indeed, in an intrusive list, from an object we can directly get an iterator and insert at a random position. For a standard list, we have to find the iterator by linear search. I think that it is still important because it is one of the advantages of an intrusive container.

It is very clear that the performance are much better but it is logical because we are comparing something in O(1) versus something in O(n).

Conclusion

In conclusion, intrusive lists are almost always faster than a standard list. However, on my computer and with GCC, the difference is not always very important. It can brings about 20%-30% on some workloads but most likely around 15%. Even if not negligible, it is not a huge improvement. I would have thought that intrusive lists were faster by an higher margin. On some operations like sort, it is clearly more interesting. It has a better data locality than standard list.

It is also more interesting to fill the collection, because no memory allocation is involved. But of course, you need to take care of memory allocation by yourself, for example in a vector like here or by dynamically allocating the objects one after one. This has also a cost that is not counted in the benchmark.

If you really have to use a linked list and performance is critical, I advice you to use Boost intrusive list. If performance is not really critical, it is not perhaps not interesting because of the increased complexity.

There are situations when only intrusive lists can work. If you want the same object to be present in two different list, you can use boost intrusive list with several member hooks, which is not possible with standard list because only a copy is stored, not the object itself. The same is true for objects that are non-copyable, only intrusive list can handle them. And finally, with intrusive lists you can directly get an iterator to an object in O(1) if you have a reference to an object. For a standard list, you have to iterate through the list to find the object. Sometimes, it can be very useful.

If you are interested, the Boost documentation provides also a performance benchmark for intrusive list, but it is very old (GCC 4.1.2). It is interesting to see that the results are better for intrusive lists than on my benchmark. I do not know if it comes from the processor, the memory or from the compiler.

I hope you found this benchmark interesting. If you have questions, comments, ideas or whatever to say about this benchmark, don't hesitate to comment. I would be glad to answer you :) The same if you find errors in this article. If you have different results, don't hesitate to comment as well.

The code source of the benchmark is available online: https://github.com/wichtounet/articles/blob/master/src/intrusive_list/bench.cpp

New WordPress Plugin - Google Visualization Charts

As I started writing some big benchmarks, I discovered that there were no really good plugins to generate graphs in WordPress (at least not free ones). Then, I discovered the Google Visualization API that generates awesome charts. I decided to create a new WordPress plugin to help generates these charts.

The Google Visualization API is very powerful. There are a lot of different charts that are available. The charts can be customized easily. The charts are interactive, the user can get the values of any points in the graphs. The charts are rendered using HTML5/SVG technology to provide cross-browser compatibility (including VML for older IE versions) and cross platform portability to iPhones, iPads and Android. No plugins are needed, only the inclusion of the JavaScript library.

The plugin supports two different graphs: line charts and bar charts. For each graph, the title of the graph and the titles of the axis can be configured. The height and width of the graph can also be configured. I also added an option to change the scale of the graph to a logarithmic scale.

Examples

Here is an example of line chart:

The chart is generated using the given code:

[line_chart title="fill_back - 8 bytes" h_title="Number of elements" v_title="us" scale_button="true" width="600px" height="400px"]
['x', 'list', 'vector', 'deque', 'vector_pre'],
['100000', 2545, 271, 2012, 317],
['200000', 4927, 552, 998, 334],
['300000', 7310, 944, 1707, 595],
['400000', 9463, 936, 2056, 1099],
['500000', 12591, 1140, 2642, 1058],
['600000', 14351, 1894, 3125, 1237],
['700000', 16561, 1995, 3686, 1208],
['800000', 18820, 2648, 4291, 1365],
['900000', 20832, 2777, 4962, 2268],
['1000000', 23430, 3015, 5396, 2585],
[line_chart]

And here is an another example of a bar chart:

Here is the code of the chart:

[bar_chart width="600px" height="400px" title="Runtime Performance - Less is better" h_title="Options" v_title="Seconds" scale_button="true"]
['Compiler', 'testsuite', 'assembly', 'linked_list'],
['GCC -O2',  6.58,   1.2, 0.51],
['GCC -O3',  6.59,   1.2, 0.5],
['CLang -O2',  6.74,   1.2, 0.49],
['CLang -O3',  6.58,   1.2, 0.49],
[/bar_chart]

For more examples, you can consult this article or this other one.

Downloads

The plugin is available on the WordPress Plugin Directory: Google Visualization Charts

I hope that this plugin will be useful. I will try to add support for new charts in the future.

Don't hesitate to comment if you find a bug or if you have an idea of improvement :) As it is my very first WordPress plugin, I'm welcoming all comments.

CMake Testing - Rerun the last failed tests with CTest

Some time ago, we saw how to use CMake to run Boost Tests in paralel, now it is time for another tip.

A feature that I think is lacking in CMake/CTest is a way to launch only the last failed tests. As it is not possible to do that directly, I posted the question on StackOverflow and got a great answer from Fraser. I wanted to share its answer.

CTest has -I option to select a list of tests to run. The idea here is to convert the log of CTest in format readable by CTest. What I think is great in its answer is that the solution is a CMake script:

set(FailedFileName FailedTests.log)
if(EXISTS "Testing/Temporary/LastTestsFailed.log")
  file(STRINGS "Testing/Temporary/LastTestsFailed.log" FailedTests)
  string(REGEX REPLACE "([0-9]+):[^;]*" "\\1" FailedTests "${FailedTests}")
  list(SORT FailedTests)
  list(GET FailedTests 0 FirstTest)
  set(FailedTests "${FirstTest};${FirstTest};;${FailedTests};")
  string(REPLACE ";" "," FailedTests "${FailedTests}")
  file(WRITE ${FailedFileName} ${FailedTests})
else()
  file(WRITE ${FailedFileName} "")
endif()

This test just transforms one file into another.

You can then run the last failing tests using:

cmake -P 
ctest -I FailedTests.log

Very easy, isn't it ?

There is a limitation to this solution. It won't work when CTest is running in dashboard mode, but it wouldn't take too long to adapt it for that.

Hope you found that tip useful.

C++ benchmark – std::vector VS std::list VS std::deque

Last week, I wrote a benchmark comparing the performance of std::vector and std::list on different workloads. This previous article received a lot of comments and several suggestions to improve it. The present article is an improvement over the previous article.

In this article, I will compare the performance of std::vector, std::list and std::deque on several different workloads and with different data types. In this article, when I talk about a list refers to std::list, a vector refers to std::vector and deque to std::deque.

It is generally said that a list should be used when random insert and remove will be performed (performed in O(1) versus O(n) for a vector or a deque). If we look only at the complexity, the scale of linear search in both data structures should be equivalent, complexity being in O(n). When random insert/replace operations are performed on a vector or a deque, all the subsequent data needs to be moved and so each element will be copied. That is why the size of the data type is an important factor when comparing those two data structures. Because the size of the data type will play an important role on the cost of copying an element.

However, in practice, there is a huge difference: the usage of the memory caches. All the data in a vector is contiguous where the std::list allocates separately memory for each element. How does that change the results in practice ? The deque is a data structure aiming at having the advantages of both data structures without their drawbacks, we will see how it perform in practice. Complexity analysis does not take the memory hierarchy into level. I believe that in practice, memory hierarchy usage is as important as complexity analysis.

Keep in mind that all the tests performed are made on vector, list and deque even if other data structures could be better suited to the given workload.

In the graphs and in the text, n is used to refer to the number of elements of the collection.

All the tests performed have been performed on an Intel Core i7 Q 820  @ 1.73GHz. The code has been compiled in 64 bits with GCC 4.7.2 with -02 and -march=native. The code has been compiled with C++11 support (-std=c++11).

For each graph, the vertical axis represent the amount of time necessary to perform the operations, so the lower values are the better. The horizontal axis is always the number of elements of the collection. For some graph, the logarithmic scale could be clearer, a button is available after each graph to change the vertical scale to a logarithmic scale.

The data types are varying in size, they hold an array of longs and the size of the array varies to change the size of the data type. The non-trivial data type is made of two longs and has very stupid assignment operator and copy constructor that just does some maths (totally meaningless but costly). One may argue that is not a common copy constructor neither a common assignment operator and one will be right, however, the important point here is that it is costly operators which is enough for this benchmark.

Fill

The first test that is performed is to fill the data structures by adding elements to the back of the container (using push_back). Two variations of vector are used, vector_pre being a std::vector using vector::reserve at the beginning, resulting in only one allocation of memory.

Lets see the results with a very small data type:

The pre-allocated vector is the fastest by a small margin and the list is 3 times slower than a vector. deque and vector.

If we consider higher data type:

This time vector and deque are performing at about the same speed. The pre-allocated vector is clearly the winner here. The variations in the results of deque and vector are probably coming from my system that doesn't like allocating so much memory back and forth at this speed.

Finally, if we use a non-trivial data type:

All data structures are performing more or less the same, with vector_pre being the fastest.

For push_back operations, pre-allocated vectors is a very good choice if the size is known in advance. The others performs more of less the same.

I would have expected a better result for pre-allocated vector. If someone find an explanation for such a small margin, I'm interested.

Linear Search

The first operation is that is tested is the search. The container is filled with all the numbers in [0, N] and shuffled. Then, each number in [0,N] is searched in the container with std::find that performs a simple linear search. In theory, all the data structures should perform the same if we consider their complexity.

It is clear from the graph that the list has very poor performance for searching. The growth is much worse for a list than for a vector or a deque.

The only reason is the usage of the cache line. When a data is accessed, the data is fetched from the main memory to the cache. Not only the accessed data is accessed, but a whole cacheline is fetched. As the elements in a vector are contiguous, when you access an element, the next element is automatically in the cache. As the main memory is orders of magnitude slower than the cache, this makes a huge difference. In the list case, the processor spends its whole time waiting for data being fetched from memory to the cache, at each fetch, the processor fetches a lot of unnecessary data that are almost always useless.

The deque is a bit slower than the vector, that is logical because here there are more cache misses due to the segmented parts.

If we take a bigger data type:

The list is still much slower than the others, but what is interesting is that gap between the deque and the array is decreasing. Let's try with a 4KB data type:

The performance of the list are still poor but the gap is decreasing. The interesting point is that deque is now faster than vector. I'm not really sure of the reason of this result. It is possible that it comes only from this special size. One thing is sure, the bigger the data size, the more cache misses the processor will get because elements don't fit in cache lines.

For search, list is clearly slow where deque and vector have about the same performance. It seems that deque is faster than a vector for very large data sizes.

Random Insert (+Linear Search)

In the case of random insert, in theory, the list should be much faster, its insert operation being in O(1) versus O(n) for a vector or a deque.

The container is filled with all the numbers in [0, N] and shuffled. Then, 1000 random values are inserted at a random position in the container. The random position is found by linear search. In both cases, the complexity of the search is O(n), the only difference comes from the insert that follow the search. We saw before that the performance of the list were poor for searching, so we'll see if the fast insertion can compensate the slow search.

List is clearly slower than the other two data structures that exhibit the same performance. This comes from the very slow linear search. Even if the two other data structures have to move a lot of data, the copy is cheap for small data types.

Let's increase the size a bit:

The result are interesting. The list is still the slowest but with a smaller margin. This time deque is faster than the vector by a small margin.

Again, increasing the data size:

This time, the vector is clearly the looser and deque and list have the same performance. We can say that with a size of 128 bytes, the time to move a lot of the elements is more expensive than searching in the list.

A huge data type gives us clearer results:

The list is more than 20 times faster than the vector and an order of magnitude faster than the deque ! The deque is also twice faster than the vector.

The fact than the deque is faster than vector is quite simple. When an insertion is made in a deque, the elements can either moved to the end or the beginning. The closer point will be chosen. An insert in the middle is the most costly operation with O(n/2) complexity. It is always more efficient to insert elements in a deque than in vector because at least twice less elements will be moved.

If we look at the non-trivial data type:

The results are about the same as for the previous graph, but the data type is only 16B. The cost of the copy constructors and assignment operators is very important for vector and deque. The list doesn't care because no copy neither assignment of the existing elements is made during insertions (only the inserted element is copied).

Random Remove

In theory, random remove is the same case than random insert. Now that we've seen the results with random insert, we could expect the same behavior for random remove.

The container is filled with all the numbers in [0, N] and shuffled. Then, 1000 random values are removed from a random position in the container.

If we take the same data sizes as the random insert case:

The behavior of random remove is the same as the behavior of random insert, for the same reasons. The results are not very interesting, so, let's get to the next workload.

Push Front

The next operation that we will compare is inserting elements in front of the collection. This is the worst case for vector, because after each insertion, all the previously inserted will be moved and copied. For a list or a deque, it does not make a difference compared to pushing to the back.

So let's see the results:

The results are crystal-clear and as expected, vector is very bad at inserting elements to the front. The list and the deque results are almost invisible in the graph because it is a free operation for the two data structures. This does not need further explanations. There is no need to change the data size, it will only make vector much slower and my processor hotter.

Sort

The next operation that is tested is the time necessary to sort the data structures. For the vector and the deque std::sort is used and for a list the member function sort is used.

For a small data type, the list is several times slower than the other two data structures. This is again due to the very poor spatial locality of the list during the search. vector is slightly faster than a deque, but the difference is not very significant.

If we increase the size:

The order remains the same but the difference between the list and the other is decreasing.

With a 1KB data type:

The list is almost five times faster than the vector and the deque which are both performing the same (with a very slight advantage for vector).

If we use the non-trivial data type:

Again, the cost of the operators of this type have a strong impact on the vector and deque.

Destruction

The next test is to calculate the time necessary to the destruction of a container. The containers are dynamically allocated, are filled with n numbers and then their destruction time (via delete) is computed.

The results are already interesting. The vector is almost free to destroy, which is logical because that incurs only freeing one array and the vector itself. The deque is slower due to the freeing of each segments. But the list is much more costly than the other two, more than an order of magnitude slower. This is expected because the list have to free the dynamic memory of each node and also has to iterate through all the elements which we saw was slow.

If we increase the data type:

This time we can see that the deque is three times slower than a vector and that the list is still an order of magnitude slower than a vector ! However, the is less difference than before.

With our biggest data type, now:

There is no more difference between list and deque. The vector is still twice faster than them.

Even if the vector is always faster than the list and deque, keep in mind that the graphs for destruction are in microseconds and so the operations are not very costly. It could make a difference is very time-sensitive application but unlikely in most applications. Moreover, destruction is made only once per data structure, generally, it is not a very important operation.

Number Crunching

Finally, we can also test a number crunching operation. Here, random elements are inserted into the container that is kept sorted. It means, that the position where the element has to be inserted is first searched by iterating through elements and the inserted. As we talk about number crunching, only 8 bytes elements are tested.

Even if there is only 100'000 elements, the list is already an order of magnitude slower than the other two data structures. If we look a the curves of the results, it is easy to see that this will be only worse with higher collection sizes. The list is absolutely not adapted for number crunching operations due to its poor spatial locality.

Conclusion

To conclude, we can get some facts about each data structure:

  • std::list is very very slow to iterate through the collection due to its very poor spatial locality.
  • std::vector and std::deque perform always faster than std::list with very small data
  • std::list handles very well large elements
  • std::deque performs better than a std::vector for inserting at random positions (especially at the front, which is constant time)
  • std::deque and std::vector do not support very well data types with high cost of copy/assignment

This draw simple conclusions on usage of each data structure:

  • Number crunching: use std::vector or std::deque
  • Linear search: use std::vector or std::deque
  • Random Insert/Remove:
    • Small data size: use std::vector
    • Large element size: use std::list (unless if intended principally for searching)
  • Non-trivial data type: use std::list unless you need the container especially for searching. But for multiple modifications of the container, it will be very slow.
  • Push to front: use std::deque or std::list

I have to say that before writing this new version of the benchmark I did not know std::deque a lot. This is a very good data structure that is very good at inserting at both ends and even in the middle while exposing a very good spatial locality. Even if sometimes slower than a vector, when the operations involves both searching and inserting in the middle, I would say that this structure should be preferred over vectors, especially for data types of medium sizes.

If you have the time, in practice, the best way to decide is always to benchmark each version, or even to try another data structures. Two operations with the same Big O complexity can perform quite differently in practice.

I hope that you found this article interesting. If you have any comment or have an idea about an other workload that you would like to test, don't hesitate to post a comment ;) If you have a question on results, don't hesitate as well.

The code source of the benchmark is available online: https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp

The older version of the article is still available: C++ benchmark – std::vector VS std::list

C++ benchmark - std::vector VS std::list

A updated version of this article is available: C++ benchmark – std::vector VS std::list VS std::deque

In C++, the two most used data structures are the std::vector and the std::list. In this article, we will compare the performance in practice of these two data structures on several different workloads. In this article, when I talk about a list it is the std::list implementation and vector refers to the std::vector implementation.

It is generally said that a list should be used when random insert and remove will be performed (performed in O(1) versus O(n) for a vector). If we look only at the complexity, search in both data structures should be roughly equivalent, complexity being in O(n). When random insert/replace operations are performed on a vector, all the subsequent data needs to be moved and so each element will be copied. That is why the size of the data type is an important factor when comparing those two data structures.

However, in practice, there is a huge difference, the usage of the memory caches. All the data in a vector is contiguous where the std::list allocates separately memory for each element. How does that change the results in practice ?

Keep in mind that all the tests performed are made on vector and list even if other data structures could be better suited to the given workload.

In the graphs and in the text, n is used to refer to the number of elements of the collection.

All the tests performed have been performed on an Intel Core i7 Q 820  @ 1.73GHz. The code has been compiled in 64 bits with GCC 4.7.2 with -02 and -march=native. The code has been compiled with C++11 support (-std=c++11).

Fill

The first test that is performed is to fill the data structures by adding elements to the back of the container. Two variations of vector are used, vector_pre being a std::vector with the size passed in parameters to the constructor, resulting in only one allocation of memory.

All data structures are impacted the same way when the data size increases, because there will be more memory to allocate. The vector_pre is clearly the winner of this test, being one order of magnitude faster than a list and about twice faster than a vector without pre-allocation. The result are directly linked to the allocations that have to be performed, allocation being slow. Whatever the data size is, push_back to a vector will always be faster than to a list. This is logical becomes vector allocates more memory than necessary and so does not need to allocate memory for each element.

But this test is not very interesting, generally building the data structure is not critical. What is critical is the operations that are performed on the data structure. That will be tested in the next sections.

Random Find

The first operation is that is tested is the search. The container is filled with all the numbers in [0, N] and shuffled. Then, each number in [0,N] is searched in the container with std::find that performs a simple linear search.

Yes, vector is present in the graph, its line is the same as the x line ! Performing a linear search in a vector is several orders of magnitude faster than in a list.

The only reason is the usage of the cache line. When a data is accessed, the data is fetched from the main memory to the cache. Not only the accessed data is accessed, but a whole cacheline is fetched. As the elements in a vector are contiguous, when you access an element, the next element is automatically in the cache. As the main memory is orders of magnitude slower than the cache, this makes a huge difference. In the list case, the processor spends its whole time waiting for data being fetched from memory to the cache.

If we augment the size of the data type to 1KB, the results remain the same, but slower:

Random Insert

In the case of random insert, in theory, the list should be much faster, its insert operation being in O(1) versus O(n) for a vector.

The container is filled with all the numbers in [0, N] and shuffled. Then, 1000 random values are inserted at a random position in the container. The random position is found by linear search. In both cases, the complexity of the search is O(n), the only difference comes from the insert that follow the search.

When, the vector should be slower than the list, it is almost an order of magnitude faster. Again, this is because finding the position in a list is much slower than copying a lot of small elements.

If we increase the size:

The two lines are getting closer, but vector is still faster.

Increase it to 1KB:

This time, list outperforms vector by an order of magnitude ! The performance of random insert in a list are not impacted much by the size of the data type, where vector suffers a lot when big sizes are used. We can also see that list doesn't seem to care about the size of the collection. It is because the size of the collection only impact the search and not the insertion and as few search are performed, it does not change the results a lot.

If the iterator was already known (no need for linear search), it would be faster to insert into a list than into the vector.

Random Remove

In theory, random remove is the same case than random insert. Now that we've seen the results with random insert, we could expect the same behavior for random remove.

The container is filled with all the numbers in [0, N] and shuffled. Then, 1000 random values are removed from a random position in the container.

Again, vector is several times faster and looks to scale better. Again, this is because it is very cheap to copy small elements.

Let's increase it directly to 1KB element.

The two lines have been reversed !

The behavior of random remove is the same as the behavior of random insert, for the same reasons.

Push Front

The next operation that we will compare is inserting elements in front of the collection. This is the worst case for vector, because after each insertion, all the previously inserted will be moved and copied. For a list, it does not make a difference compared to pushing to the back.

The results are crystal-clear and as expected. vector is very bad at inserting elements to the front. This does not need further explanations. There is no need to change the data size, it will only make vector much slower.

Sort

The next operation that is tested is the performance of sorting a vector or a list. For a vector std::sort is used and for a list the member function sort is used.

We can see that sorting a list is several times slower. It comes from the poor usage of the cache.

If we increase the size of the element to 1KB:

This time the list is faster than the vector. It is not very clear on the graph, but the values for the list are almost the same as for the previous results. That is because std::list::sort() does not perform any copy, only pointers to the elements are changed. On the other hand, swapping two elements in a vector involves at least three copies, so the cost of sorting will increase as the cost of copying increases.

Number Crunching

Finally, we can also test a number crunching operation. Here, random elements are inserted into the container that is kept sorted. It means, that the position where the element has to be inserted is first searched by iterating through elements and the inserted. As we talk about number crunching, only 8 bytes elements are tested.

We can clearly see that vector is more than an order of magnitude faster than list and this will only be more as the size of the collection increase. This is because traversing the list is much more expensive than copying the elements of the vector.

Conclusion

To conclude, we can get some facts about each data structure:

  • std::vector is insanely faster than std::list to find an element
  • std::vector performs always faster than std::list with very small data
  • std::vector is always faster to push elements at the back than std::list
  • std::list handles very well large elements, especially for sorting or inserting in the front

This draw simple conclusions on usage of each data structure:

  • Number crunching: use std::vector
  • Linear search: use std::vector
  • Random Insert/Remove: use std::list (if data size very small (< 64B on my computer), use std::vector)
  • Big data size: use std::list (not if intended for searching)

If you have the time, in practice, the best way to decide is always to benchmark both versions, or even to try another data structures.

I hope that you found this article interesting. If you have any comment or have an idea about an other workload that you would like to test, don't hesitate to post a comment ;) If you have a question on results, don't hesitate as well.

The code source of the benchmark is available online: https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp

GCC 4.7 vs CLang 3.1 on eddic

Now that eddic can be compiled with CLang, I wanted to compare the differences in compilation time and in performance of the generated executable between those two compilers. The tests are done using GCC 4.7.2 and CLang 3.1 on Gentoo.

Compilation Time

The first thing that I tested has been the compilation time of the two compilers to compile eddic with different flags. I tested the compilation in debug mode and with -O2 and -O3.

The most interesting fact in these results is that CLang is much faster than GCC. It takes twice less times to compile eddic with CLang in debug mode than with GCC. The impact on optimizations on CLang's compilation is also more important than on GCC. For both compilers, -O3 does not seems to add a lot of overhead.

Runtime performance

Then, I tested the performance of the generated executable. I tested it on three things, the whole test suite and two test cases that I know are the slowest for the EDDI Compiler. For each case, I took the slowest value of 5 consecutive executions.

The difference are very small. In -02, GCC performs a bit better, but in -O3, the performance are equivalent. I was a bit disappointed by the results, because I thought that there would be higher differences. It seems that CLang is not as far from GCC that some people would like to say. It also certainly depends on the program being compiled.

Conclusion

It is clear that CLang is much faster than GCC to compile eddic. Moreover, the performance of the generated executable are almost similar.

I will continue to use CLang as my development compiler and switches between the two when I'm doing performance benchmarking. I will try to update the benchmark once new versions of GCC / CLang are available.

EDDI Compiler 1.1.4 – Graph Coloring Register Allocation

I'm proud to announce the release of the version 1.1.4 of the EDDI Compiler (eddic).

This version has taken me much more time than I thought but I also added much more features than I thought too.

There are few changes of the language itself, the main changes are in the optimization passes or in the compiler.

For the language, it is now possible to use dynamically allocated arrays. The this pointer is now implicit in member functions.

The standard library has been improved by the addition of a Doubly-Linked List. This list uses the templates so that it is generic. It is possible to add elements to the front and the back of the list. The list is iterable using iterators (bidirectional).

The template engine has been almost entirely rewritten. The previous version was too limited and there was code to handle the templates almost in the whole front-end. Now, the templates are handled recursively at each point where they can appear. For not the template instantiation depth is not limited, but this will be done in the next version of eddic.

The major change of this version is the use of a Graph Coloring Register Allocator ! This allocator is based on a Chaitin-style allocator. This greatly improves the quality of the generated assembly. The LTAC compilation is now made in two phase. In the first one, only pseudo registers are used. This first pass includes a first cleanup pass. Then, the register allocator replaces all the pseudo registers by actual registers. Finally, the LTAC IR is optimized like before. In the future, it will be improved further. The coalescing and renumbering passes are a bit limited for now and Chaitin-Briggs optimistic coloring will be used in the future.

The data-flow framework has been improved to support data-flow analysis of LTAC program. For now, the only analysis that does that is Live Registers Analysis. This analysis is used by the Register Allocator by the Dead Code Elimination that is run in LTAC code.

The MTAC optimization engine has been greatly improved by the use of a powerful pass manager that runs the optimization in the correct order and that gives them the necessary information. The Control Flow Graph is now updated by the different passes and never invalidated. The CFG is computed only once before the optimizations.

The MTAC optimization engine has also new optimization passes regarding to loops: Loop Invariant Code Motion, Loop Strength Reduction and Complete Loop Peeling. The loops are discovered by a dominance analysis implemented using the Lengauer-Tarjan's algorithm.

The inliner has also beeen greatly improved. The inlining decision is now taken at the call site level. It means that only some calls to a function can be inlined and not the whole function. The inliner now supports functions with string parameters. Moreover, the inliner heuristic takes the number of constant parameters at the call site into account to take its decision.

On the side of the Compiler, there are several improvements.

Future Work

The next version of the EDDI compiler (eddic) will be the version 1.2.0. This version will add support for inheritance at least in a basic way. It will also add support for returning a structure by value. The structures can contains arrays of defined size. This version will also focus on removing the limitations that exists on some features (Function Call Left Values for instance). It will also contains several necessary cleanups to the files.

Download

You can find the EDDI Compiler sources on the Github repository: https://github.com/wichtounet/eddic

The version is available in the v1.1.4 tag available in the GitHub or directly in the master branch.

Integer Linear Time Sorting Algorithms

Update: The code is now more C++

Most of the sorting algorithms that are used are generally comparison sort. It means that each element of the collection being sorted will be compared to see which one is the first one. A comparison must have a lower bound of Ω(n log n) comparisons. That is why there are no comparison-based sorting algorithm better than O(n log n).

On the other hand, there are also sorting algorithms that are performing better. This is the family of the integer sorting algorithms. These algorithms are using properties of integer to sort them without comparing them. They can be only be used to sort integers. Nevertheless, a hash function can be used to assign a unique integer to any value and so sort any value. All these algorithms are using extra space. There are several of these algorithms. In this article, we will see three of them and I will present an implementation in C++. At the end of the article, I will compare them to std::sort.

In the article, I will use n as the size of the array to sort and m as the max number that is permitted in the array.

Bin Sort

Bin Sort, or Bucket Sort, is a very simple algorithm that partition all the input numbers into a number of buckets. Then, all the buckets are outputted in order in the array, resulting in a sorting array. I decided to implement the simplest case of Bin Sort where each number goes in its own bucket, so there are m buckets.

The implementation is pretty straightforward:

void binsort(std::vector<std::size_t>&& A){
    std::vector<std::vector<std::size_t>> B(MAX + 1);

    for(std::size_t i = 0; i < SIZE; ++i){
        B[A[i]].push_back(A[i]);
    }

    std::size_t current = 0;
    for(std::size_t i = 0; i < MAX; ++i){
        for(auto item : B[i]){
            A[current++] = item;
        }
    }
}

B is the array of buckets. Each bucket is implemented as a std::vector. The algorithm starts by filling each buckets with the numbers from the input array. Then, it outputs them in order in the array.

This algorithm works in O(n + m) and requires O(m) extra memory. With these properties, it makes a very limited algorithm, because if you don't know the maximum number and you have to use the maximum number of the array type, you will have to allocate for instance 2^32 buckets. That won't be possible.

Couting Sort

An interesting fact about binsort is that each bucket contains only the same numbers. The size of the bucket would be enough. That is exactly what Counting Sort. It counts the number of times an element is present instead of the elements themselves. I will present two versions. The first one is a version using a secondary array and then copying again into the input array and the second one is an in-place sort.

void counting_sort(std::vector<std::size_t>&& A){
    std::vector<std::size_t> B(SIZE);
    std::vector<std::size_t> C(MAX);

    for (std::size_t i = 0; i < SIZE; ++i){
        ++C[A[i]];
    }

    for (std::size_t i = 1; i <= MAX; ++i){
        C[i] += C[i - 1];
    }

    for (long i = SIZE - 1; i >= 0; --i) {
        B[C[A[i]] - 1] = A[i];
        --C[A[i]];
    }

    for (std::size_t i = 0; i < SIZE; ++i){
        A[i] = B[i];
    }
}

The algorithm is also simple. It starts by counting the number of elements in each bucket. Then, it aggregates the number by summing them to obtain the position of the element in the final sorted array. Then, all the elements are copied in the temporary array. Finally, the temporary array is copied in the final array. This algorithms works in O(m + n) and requires O(m + n). This version is presented only because it is present in the literature. We can do much better by avoiding the temporary array and optimizing it a bit:

void in_place_counting_sort(std::vector<std::size_t>&& A){
    std::vector<std::size_t> C(MAX + 1);

    for (std::size_t i = 0; i < SIZE; ++i){
        ++C[A[i]];
    }

    int current = 0;
    for (std::size_t i = 0; i < MAX; ++i){
        for(std::size_t j =0; j < C[i]; ++j){
            A[current++] = i;
        }
    }
}

The temporary array is removed and the elements are directly written in the sorted array. The counts are not used directly as position, so there is no need to sum them. This version still works in O(m + n) but requires only O(m) extra memory. It is much faster than the previous version.

Radix Sort

The last version that I will discuss here is a Radix Sort. This algorithm sorts the number digit after digit in a specific radix. It is a form of bucket sort, where there is a bucket by digit. Like Counting Sort, only the counts are necessary. For example, if you use radix sort in base 10. It will first sort all the numbers by their first digit, then the second, .... It can work in any base and that is its force. With a well chosen base, it can be very powerful. Here, we will focus on radix that are in the form 2^r. These radix have good properties, we can use shifts and mask to perform division and modulo, making the algorithm much faster.

The implementation is a bit more complex than the other implementations:

static const std::size_t digits = 2;             //Digits
static const std::size_t r = 16;                 //Bits
static const std::size_t radix = 1 << r;         //Bins
static const std::size_t mask = radix - 1;

void radix_sort(std::vector<std::size_t>&& A){
    std::vector<std::size_t> B(SIZE);
    std::vector<std::size_t> cnt(radix);

    for(std::size_t i = 0, shift = 0; i < digits; i++, shift += r){
        for(std::size_t j = 0; j < radix; ++j){
            cnt[j] = 0;
        }

        for(std::size_t j = 0; j < SIZE; ++j){
            ++cnt[(A[j] >> shift) && mask];
        }

        for(std::size_t j = 1; j < radix; ++j){
            cnt[j] += cnt[j - 1];
        }

        for(long j = SIZE - 1; j >= 0; --j){
            B[--cnt[(A[j] >> shift) && mask]] = A[j];
        }

        for(std::size_t j = 0; j < SIZE; ++j){
           A[j] = B[j];
        }
    }
}

r indicates the power of two used as the radix (2^r). The mask is used to compute modulo faster. The algorithm repeats the steps for each digit. Here digits equals 2. It means that we support 2^32 values. A 32 bits value is sorted in two pass. The steps are very similar to counting sort. Each value of the digit is counted and then the counts are summed to give the position of the number. Finally, the numbers are put in order in the temporary array and copied into A.

This algorithm works in O(digits (m + radix)) and requires O(n + radix) extra memory. A very good thing is that the algorithm does not require space based on the maximum value, only based on the radix.

Results

It's time to compare the different implementations in terms of runtime. For each size, each version is tested 25 times on different random arrays. The arrays are the same for each algorithm. The number is the time necessary to sort the 25 arrays. The benchmark has been compiler with GCC 4.7.

The first test is made with very few duplicates (m = 10n).

Radix Sort comes to be the fastest in this case, twice faster as std::sort. In place counting sort has almost the same performance as std::sort. The other are performing worse.

The second test is made with few duplicates (m ~= n).

The numbers are impressive. In place counting sort is between 3-4 times faster than std::sort and radix sort is twice faster than std::sort ! Bin Sort does not performs very well and counting sort even if generally faster than std::sort does not scale very well.

Let's test with more duplicates (m = n / 2).

std::sort and radix sort performance does not change a lot but the other sort are performing better. In-place counting sort is still the leader with a higher margin.

Finally, with a lot of duplicates (m = n / 10).

Again, std::sort and radix sort performance are stable, but in-place counting is now ten times faster than std::sort !

Conclusion

To conclude, we have seen that these algorithms can outperforms std::sort by a high factor (10 times for In place Counting Sort when there m << n). If you have to sort integers, you should consider these two cases:

  • m > n or m is unknown : Use radix sort that is about twice faster than std::sort.
  • m << n : Use in place counting sort that can be much faster than std::sort.

I hope you found this article interesting. The implementation can be found on Github: https://github.com/wichtounet/articles/tree/master/src/linear_sorting

CMakeLatex 1.0.2 - Support for nomenclature and better filters

I released a new version of CMakeLatex, the version 1.0.2.

First of all, this version restore the support for nomenclature. Then, it also adds filters for makeindex (including makeglossaries and makenomenclature). The filters will hides all the information of the output stream but the errors. The filters for pdflatex are also improved.

CMakeLatex is a CMake script to build Latex documents using CMake / Make. It supports glossary, indexes, bibliographies and nomenclature. It can automatically converts your images to the right format using imagemagick or cairosvg (for SVG to PDF conversion).

You can download it on Github: CMakeLatex Github repository

If you have any idea for improvement, don't hesitate to contact me or to create a Feature Request on Github.

eddic compiles with CLang 3.1

I finally added support for compiling eddic with LLVM CLang 3.1 !

The current development version can be completely compiled with CLang. Starting with the version 1.1.4, all versions of eddic will be support GCC and CLang.

The changes have not been as painful as I first thought.

  • The main problem that I has was about a static const variable of a class that had no user-constructor. GCC allows that, but it is not standard compliant and CLang was complaining.
  • Another problem that I encountered was about the used of bit flags and Template Meta Programming. I simplified that by the use of a simple type traits and it worked. I don't really know why this does not worked at first.
  • The remaining effort was to fix the several warnings that CLang had.

CLang also fixed a bug in my code with a warning on a assignment that was not supposed to be an assignment, thanks CLang.

The most interesting fact about CLang is that is it twice faster to build eddic than GCC. I think I'm gonna use it during development to fasten the compile time. Moreover, even if I only worked two days with it, it seems that the error messages are indeed better than the GCC's ones.

I haven't tried to compare the performances of eddic in both cases, but I will do that in the future, soon after the 1.1.4 version is released.

I tried the CLang static analyzer on eddic but it didn't found any bugs. Moreover, it crashed on several of my files. I didn't found why for now, but I will continue to investigate, perhaps I'm not using it correctly.

I expect to publish the next version of eddic in the next two weeks. This version has much more improvements that I thought at first and I have less time to work now that I'm working on my Master thesis.

More informations on CLang: The official site.