deque and plf::colony

Baptiste Wicht

Already more than three years ago, I've written a benchmark of some of the STL containers, namely the vector, the list and the deque. Since this article was very popular, I decided to improve the benchmarks and collect again all the results. There are now more benchmarks and some problems have been fixed in the benchmark code. Moreover, I have also added a new container, the plf::colony. Therefore, there are four containers tested:

The std::vector: This is a dynamically-resized array of elements. All the elements are contiguous in memory. If an element is inserted or removed it at a position other than the end, the following elements will be moved to fill the gap or to open a gap. Elements can be accessed at random position in constant time. The array is resized so that it can several more elements, not resized at each insert operation. This means that insertion at the end of the container is done in amortized constant time.
The std::deque: The deque is a container that offer constant time insertion both at the front and at the back of the collection. In current c++ libraries, it is implementation as a collection of dynamically allocated fixed-size array. Not all elements are contiguous, but depending on the size of the data type, this still has good data locality. Access to a random element is also done in constant time, but with more overhead than the vector. For insertions and removal at random positions, the elements are shifted either to the front or to the back meaning that it is generally faster than the vector, by twice in average.
The std::list: This is a doubly-linked list. It supports constant time insertions at any position of the collection. However, it does not support constant time random access. The elements are obviously not contiguous, since they are all allocated in nodes. For small elements, this collection has a very big memory overhead.
The plf::colony: This container is a non-standard container which is unordered, it means that the insertion order will not necessarily be preserved. It provides strong iterators guarantee, pointers to non-erased element are not invalidated by insertion or erasure. It is especially tailored for high-insertion/erasure workloads. Moreover, it is also specially optimized for non-scalar types, namely structs and classes with relatively large data size (greater than 128 bits on the official documentation). Its implementation is more complicated than the other containers. It is also implemented as a list of memory blocks, but they are of increasingly large sizes. When elements are erased, there position is not removed, but marked as erased so that it can be reused for fast insertion later on. This container uses the same conventions as the standard containers and was proposed for inclusion to the standard library, which is the main reason why it's included in this benchmark. If you want more information, you can consult the official website.

In the text and results, the namespaces will be omitted. Note that I have only included sequence containers in my test. These are the most common containers in practices and these also the containers I'm the most familiar with. I could have included multiset in this benchmark, but the interface and purpose being different, I didn't want the benchmark to be confusing.

All the examples are compiled with g++-4.9.4 (-std=c++11 -march=native -O2) and run on a Gentoo Linux machine with an Intel Core i7-4770 at 3.4GHz.

For each graph, the vertical axis represent the amount of time necessary to perform the operations, so the lower values are the better. The horizontal axis is always the number of elements of the collection. For some graph, the logarithmic scale could be clearer, a button is available after each graph to change the vertical scale to a logarithmic scale.

The tests are done with several different data types. The trivial data types are varying in size, they hold an array of longs and the size of the array varies to change the size of the data type. The non-trivial data type is composed of a string (just long enough to avoid SSO (Small String Optimization) (even though I'm using GCC)). The non-trivial data types comes in a second version with noexcept move operations. Not all results are presented for each data types if there are not significant differences between in order to keep this article relatively short (it's already probably too long :P).

Here are direct links to all results:

Fill Benchmark
Emplace Benchmark
Fill Front Benchmark
Linear Search Benchmark
Iterate and modify Benchmark
Random Insert Benchmark
Random Remove Benchmark
Iterate and Erase Benchmark
Number Crunching Benchmark
Sort Benchmark
Destruction Benchmark

Fill

Let's start with the most obvious benchmark, insertion. A series of elements will be inserted at the back of the container. Vector and colony both have a reserve function that is able to preallocate memory for the necessary elements. For both containers, both the standard version without reserve and the version with reserve are benchmarked.

Since colony is unordered, this test is using insert in place of push_back.

For a very small data type, the results are very clear, the vector with reserve is the winner, directly followed by the deque. After this, almost three times slower is the colony and vector without pre-allocation. Interestingly, reserve on the colony container does not help at all. And finally, the list is the slowest container. The fact that the deque is faster than the vector (without reserve) is logical since the deque will not have to move all its elements after a reallocation.

With a type of 32bytes, the results are almost the same with less margin, let's see with something bigger:

This time, the vector without reserve() is clearly the slowest and we can two plateaus which are corresponding to the levels of cache. The fastest is now the deque almost on par with the vector preallocated. The colony and the list are now at the same speed.

With a large data type (4096 bytes), the list becomes the fastest, followed by the deque and a bit slower the colony versions and vector_reserve.

Let's see with a non-trivial type that is costly to copy but fast to move:

The vector and the deque are able to take good advantage of the move and are the fastest here. Even the vector without pre allocations is faring quite well here.

Let's see if there is a difference with noexcept on the move operation:

There is in fact a significant difference for the vector without pre allocation, that is 20% faster than with the previous version. Indeed, since it knows that the operation cannot throw it can use a faster path for reallocation and still guarantees its exception safety.

Overall, for insertions, the vector and deque are the fastest for small types and the list is the fastest for the very large types. colony offers a medium performance on this benchmark but is quite stable for different data types. When you know the size of the collection, you should always use reserve() on vectors. Moreover, if you can use noexcept operations, you should always do it since it can significantly speedup the vector performance.

Emplace

The next operation is very similar to the first except that we use emplace insertions instead of pushes.

As expected, there is no difference between push_back and emplace_back for trivial types. The preallocated vector is the fastest container, followed by the deque and then significantly slower are the colony and vector without preallocation.

However, here is a very large difference between the push version and the emplace version. Indeed, it is much slower. This may seem hard to believe that emplace is slower than normal insert since it should be at least as fast, and generally faster. This in fact due to the version of GCC that is still using Copy-On-Write for string. Therefore, the previous version was much faster because the copies were not done since the string was not modified and this saved a lot of time in that case. However, this is an artificial case since a collection filled of all the same string is not highly likely in practice. Generally, I think it's better to use Small-String-Optimization than Copy-On-Write and now COW is not allowed by the standard anymore in C++11.

Overall, for the tested types, emplace should have exactly the same performance as normal push_back. Except for the special case of COW for GCC that should not happen anymore if you use a recent compiler and C++11.

Fill Front

The next benchmark is again a fill benchmark but elements are inserted at the front of the container. Since colony is unordered, it's removed from the benchmark. This benchmark is mostly used because this is the worst case for vector.

As expected, the vector is terribly slower than the deque and list containers, by almost three orders of magnitude. As before, the deque is much faster than the list.

If you really need a collection that offers performance for front and back insertions, you should definitely prefer the deque over the vector. The list should only be preferred for very large data types.

Linear Search

The next operation is that is tested is the search. The container is filled with all the numbers in [0, N] and shuffled. Then, each number in [0,N] is searched in the container with std::find that performs a simple linear search. In theory, all the data structures should perform the same if we only consider their complexity.

The differences between the different data structures are very important. The list is about 8 times slower than the vector container. The deque is slightly slower than the vector. Interestingly, the colony container is even 50% slower than the list container.

This is simply due to the much better data locality of the vector and deque compared to the list. The list is in fact a terrible data structure for data locality. Every time the next element needs to be accessed, it is necessary to jump from memory positions to memory positions. Since the processor will always load a full cache line, several elements of the vector will be loaded in memory at once and the next few loads will directly reading from the cache rather than from the main memory. In the case of the list, this does not help since elements are not contiguous. It means that all the time will be wasted with cache misses. The vector is even faster than the deque because all elements are contiguous whereas in a deque, only packs of elements are contiguous.

For a bigger data type (128 bytes), the differences are smaller. The list is only three times slower than the vector and deque which are about at the same speed. The colony is now significantly faster than the list, but still much slower than vector and deque.

For a large data type (4KB), the list is still twice slower than the other containers. For such big data, the colony is now the fastest data type, followed by the deque and vector.

Overall, the speed of linear searching through a data structure is mostly limited by the data locality of the container. For this reason, the deque and vector containers are significantly faster than the list. Interestingly, the colony container becomes faster than the other containers for large data types. Overall, a list should never be used if the container is to be searched a lot.

Iterate and modify

The next test iterates over the entire collection and increment each number contained inside it. It uses the begin and end iterators for each container. The time should be mostly dominated by the iteration time.

As expected, the list is the slowest of the container for that sort of operations and the vector is the fastest. The deque is slightly slower than the vector and the colony slightly faster than the list. There is a 6 times difference between the best result, which is pretty significant.

As the data type size augments, the deque starts to gets slightly better than the vector and the colony starts to get much better than the list, but still not on par with the other containers.

Once increased to 128B, the colony really starts to get interesting being the fastest with the deque depending on the number of elements.

Interestingly, with very large data types (4KB), the vector takes the lead again, very close to the colony and the deque a bit behind.

Overall, one thing is clear, the list is really bad for iteration. The colony container starts to shine when the size of the data type is around 128B, but does not provide a very significant speedup. The vector is generally the fastest for this kind of workload.

Random Insert

The following test is performing insertions at random places. In the case of random insert, in theory, the list should be much faster, its insert operation being in O(1) versus O(n) for a vector or a deque. The container is filled with all the numbers in [0, N] and shuffled. Then, 1000 random values are inserted at a random position in the container. The random position is found by linear search. In both cases, the complexity of the search is O(n), the only difference comes from the insert that follow the search. We saw before that the performance of the list were poor for searching, so we'll see if the fast insertion can compensate the slow search. In my opinion, this is something that happens more in practice than simply inserting at a random position. To go to this random position, you need to iteration to this point (the list is not random, then to find an iterator position, you have no other choice). Again, since it's not possible to insert at a random position in a colony, it is removed from the benchmark.

Let's see how the containers are doing on this workload:

With a trivial number of 8 bytes, the vector and deque are clearly faster than the list. Even though the insertion itself is much faster on the list than on the other data structures, the search for the random positions is much slower. However, the list is only three times slower than the other containers.

At 32 bytes, we can see interesting things. The deque is now faster than the vector. This is logical since the deque can decide to move either to the front or the back and therefore should perform in average twice less move operations than the vector. Moreover, the overhead of the list is now getting lower, with only twice longer times.

At 128 bytes element, the vector is not the fastest container anymore. It is now becoming costly to move the elements and the iteration is becoming to be less important. Still, the deque is the fastest on this benchmark.

With 1024 bytes, the list is already significantly faster than the other containers. The deque is now almost twice faster than the vector. This is showing that the cost of iterating is now almost insignificant compared to the cost of insertion. The list is now 4 times faster than the deque and almost 7 times faster than the vector.

If we go to 4096 bytes, the list is 10 times faster than the vector.

If we use a non-trivial data types, it becomes interesting again, even if not that big data type. The list is already 5 times faster than the vector and almost three times faster than the deque. This is logical, since this time it is not enough to perform a memory copy, each object must be moved or copied from position to position. In that case, the list only performs one of such operations, but has the overhead of finding the position again.

If we add noexcept to the move operations, the vector and deque are getting about 50% faster than before. Although they are still slower than the list for big number of elements. This is simply because of the thread safety guarantee of these collections that is very costly on non-noexcept move operations.

Overall, the vector and deque are clearly the winner for small trivial data types. However, for large data types or non-trivial, the list is becoming very interesting because it will perform O(1) insertion and the cost of insertion of the other containers is now bigger than the cost of iterating through the list.

Random Remove

The following benchmark is quite similar than the previous except that instead of inserting an element at a random position, a random element is removed. However, this is done a bit differently. The collection is filled with the numbers of [0,N] and then shuffled. Then, the numbers in [0, 1000] are removed from the collection. This means that a linear search is done to find the position of the element to remove.

For each collection, two versions are performed. The first version is simply doing 1000 times a erase of the result of a find_if. The second version is using the erase remove idiom, but only once. Of course, the remove versions (named X_rem) are expected to be much faster than the naive versions.

(I advice to use the logarithmic scale to compare the results)

If we use a trivial small data types, the vector and the deque versions are the fastest. The list is three times slower than them and the colony is four times slower. As expected, the versions using one erase remove instead of several erase is significantly faster. In fact, it is around three orders of magnitude faster. This should not come as a surprise to anyone. What this will do is move the elements to be removed to the end of the collection and then erase in one go all the elements that have been moved. When using remove, the vector is twice faster than the deque and four times faster than the list and colony.

With a bigger data type, for naive versions, the deque is now getting faster than the vector because of more efficient removal, again because it can choose the direction in which the collection is moved. The colony is getting much faster too, but still 20% slower than the vector. The list is still significantly slower, but only 2.5 times slower than the deque now. For the remove versions, the vector and deque versions are now at the same speed and the colony is only slightly slower than them. The list is still about 2.5 times slower than the vector and deque.

With elements of 128 bytes, we see what was already observed in the Random Insert benchmark, the deque is twice faster than the vector and even the list is significantly faster than the vector. What is more interesting is that the colony is now the fastest collection for the naive version. This shows excellent removal and iteration performance for this kind of data. For the remove versions, only the list is about 20% slower than the other versions which are running at the same speed.

For the 4KB data type, the difference between the naive versions are very significant. The colony is significantly faster than all the other collections. It is 4 times faster than the list, 26 times faster than the deque and 46 times faster than the vector. The remove versions are almost all the same with the list being 20% slower again the other containers. h

For a non-trivial type, we can see the same behaviour as random insertions. The colony and the list are the fastest type, followed by the deque and finally by the vector. The remove versions are much faster, the list being the slowest.

Overall, on this benchmark we can see that again the vector and deque are the fastest containers. When the data type starts to be big, the colony starts to shine. The version performing a single erase with remove_if is much faster than the naive versions. Interestingly, with the remove version, all the collections are performing at almost the same speed, with the list being the slowest. In practice, if you know that you have to erase a series of elements, you should do a erase with remove_if instead of several erase.

Iterate and Erase

The next benchmark is about erasure again, but is quite different than the previous one. This benchmark iterates through the entire collection and at each point, randomly decides if the element is to be removed or not. This may seem almost the same as the naive version of the Random Erase benchmark but this is quite difference since each erasure does not an iteration, only one iteration though the whole container is done. This is probably something a bit more realistic than the other workloads. Moreover, this benchmark is done with several different erasure percentage.

Let's start with 1% probability of erasing an element.

Actually, I was quite surprised by these results. I would not have thought the difference to be this huge already for 1% removal. Indeed, the list and colony are already faster than the vector by one order of magnitude! The deque is slightly faster than the vector.

With 128B elements, the colony is starting to get faster than the list by about 50%. The deque is also much faster here than the vector. The colony is about two orders of magnitude faster than the vector and almost as much faster than the deque.

With 1024B elements, the colony is now 3 times faster than the list! It is now three orders of magnitude faster than the vector and deque.

If we go even higher, with 4096B elements, the colony is now 4 times faster than the list! It is now about four orders of magnitude faster than the vector and deque.

Let's see what happens with a 10% erase rate:

In the previous case, the list and colony were faster than the vector by about a factor 10 and now it's almost a factor 100!.

If we jump directly to the biggest data type, the colony is again about four times faster than the list and it's now 5 orders of magnitude faster than the vector and deque!

Just for the sake of it, let's see what happens if we erase with a probability of 50%:

This time the colony and list are faster by more than 3 orders of magnitude than the vector.

For the last configuration of this test, the colony is about 3 times faster than the list and about 5 orders than the vector. Just to put a number on it, the colony in this case is about 361'143 times faster than the vector!

It should be clear from this result that it's a terribly bad workload for the vector and deque which both have very bad erasure time. The results would be about the same if were were to insert instead of erase. In each test the vecotr and deque are several orders of magnitude slower than the other contenders. The colony is always as fast and the list and several times faster than the list with larger data types.

Number Crunching

The next test is about numbers. Random numbers are inserted into the container so that it is kept sorted. That means that a linear search will be performed to find the insertion point. Since colony is unordered, it is excluded from this benchmark. In practice, vector and deque could use binary search contrary to the list.

Let's see the result with a number of 8 bytes:

The results are quite clear. The list is more than 20 times slower and than the vector and the deque. This is because this benchmark is driven more by iterations than by modifications of the structure and therefore the vector and deque are much faster at this. Vector is still faster than the deque for its slightly better locality.

If we take elements of 32 bytes:

The list is only 8 times slower than the vector and deque. There is no doubt that the difference would be even slower as the size of the elements grows. Nevertheless, since we are talking about number crunching, this is rarely with bigger numbers.

Overall, for such a load, the vector and deque structures are shining because of their excellent iteration performance compared to the poor performance of the list. Moreover, keep into account that in practice this would be done using binary search for the vector and deque, probably meaning faster time for them (even though binary search is not cache-efficient).

Sort

Let's see how the different collections are efficient at being sorted. For list and colony, the sort member function is used while the other are using std::sort directly. Moreover, since the colony offers two different sort methods, both where tested. The first is using std::sort internally and the second is using a timsort algorithm.

Let's start with a very small type:

The vector is the fastest container here, closely followed by the deque. colony is about twice slower, with the timsort being slightly slowest. The list is about 8 times slower than the vector.

For a bigger data type, the differences are smaller between the containers. The vector is still the fastest, but only 2.3 times faster than the list. Interestingly, the timsort is now faster than the standard sort algorithm.

For a really big data type, the list becomes the fastest container. The vector and deque containers now the slowest containers. The colony is significantly faster than the vector on this data type but still twice slower than the list.

Overall, sorting a vector and deque is generally faster than a list unless the data type gets too big (>1KB). Again, the colony container is in a sort of middle ground with very stable performance for both large and small data types but is never the fastest on this benchmark.

Destruction

The last test that is done is used to measure the time necessary to delete a container. The containers are dynamically allocated, filled with n numbers, and then their destruction time (via delete) is computed. This is probably never a bottleneck in practice, but this is still interesting to benchmark in my opinion.

As you can see, the differences between the benchmarks are very significant. The list is 10'000 times slower than the vector, the colony is 2000 slower than it and even the deque is 200 times slower than the vector. The deallocation of a vector, for trivial type, is simply a memory deallocation so its speed purely depends on the speed on deallocating memory which is very fast on modern systems. The other containers need to deallocate all the small pieces they have allocated. Not only does that mean more deallocations but especially means walking through most of the elements.

For a larger data type, the results are changing significantly. The overhead of the deque is going up very quickly. This is actually normal each of the blocks of the deque are actually very few elements and therefore it becomes very close to a list in terms of deallocation and memory walks. What is very interesting here is that colony actually is going on par with the vector and sometimes slower than it. This shows that very deallocations are not necessary slower than several smaller deallocations. Moreover, this also shows that colony is especially good when the data type starts to become important.

For a very large data type, the vector and the colony are the fastest collection, followed by the deque and list, only 1.8 times slower. This shows that at this point, the deque makes as much allocations than the list.

For a non-trivial type, every collection has to go through each element and calls the necessary destructor. Therefore, the time is mostly related to the iteration time. This puts the list on the bottom and the three other containers at almost the same time.

Overall, the destruction of a vector for trivial types is significantly faster than the other collections, unless the data type becomes very big. Colony has a large overhead for small types but becomes interesting for large data types. The list is always a poor contender since it needs to walk through all elements in order to deallocate each node. Interestingly, the deque has more and more overhead as the data type grows since each block will be able to hold less and elements and therefore resembles a list. When types are non-trivial, the time for destruction is generally tied to the time necessary to walk through the entire collection and calls each of the destructor.

Conclusion

With all these results in mind, it's now time to try to get some conclusions about the different containers.

vector:

The best iteration performance
The best back insertion performance (with reserve)
Excellent number crunching performance
Very good for sorting small data types, bad for big types
Very fast destruction
Slow insertion/removal at random positions

deque:

Very good iteration performance
Very good insertion at the back (better with vector without reserve)
The best front insertion performance
Excellent number crunching performance
Very good for sorting small data types, bad for big types
Slow destruction for large data types
Better insertion/removal at random position than vector, but still slow for many modifications

list:

The slowest iteration performance in general
Slow destruction
Very fast insertion/removal from a known position

colony:

Bad iteration performance for small types
Good iteration performance starting from medium data types (and excellent for large ones)
No possibility of random insertions
The fastest insertion/removal from a known position
Good sort performance for large data types, bad performance for small data types

Overall, each container has some advantages and disadvantages. It highly depends on your workload:

If you have purely iteration oriented workload, the vector is your best option, followed by the deque and finally the colony is quite good too if you have large elements to store. The list is terrible at iteration, whatever the element type.
If you need to search a position and insert at this position, you should choose a deque or a vector for small trivial data types and a list or large data types or non-trivial ones.
If you have to search a position and remove at this position, it depends if you can use erase-remove_if instead of several erase. If you can use erase-remove, you should use a vector or a deque or maybe a colony (only for large data types). If you cannot, use a deque or a vector for small data types and a colony (or list if you can't use colony) for larger types.
If you iterate and do several removal and/or insertions during the same iteration, you should NOT use a vector or a deque and should use a colony (or list, if you can't) for all data types.
If you have a sort-driven workload, you should use a vector or a deque for small types, a colony for medium types and a list for the big data types.
If you have another workload, just benchmark it ;)

And maybe a few tips that can be extracted from the results:

If you need to decide on which container to use for your task, the best way to decide is still to benchmark for your specific workload.
Don't base your decisions only on Complexity Analysis, it does not take memory locality into account.
If you can, try to make noexcept move operations, this will significantly speedup your vector and deque operations (when insert or remove occurs in the middle).
If you need to erase several elements of a collection, you should always use one erase remove_if instead of several.
If you need to insert several elements in a vector, always reserve the memory first to save some time on reallocations.

I hope this overly long post was interesting to you and that you learned something. If you have any questions or comments, don't hesitate to post in the comments section below.

The code of the benchmark is available on Github. If you happen to run it, you should probably comment the eraseX benchmarks that takes an awful long time to run.

Another time, I hope to be able to run these benchmarks on several compilers, but I'll have to trim down a bit some of the benchmarks (mainly the Iterate and Erase benchmarks that takes a very long time). I'll also need to find a way to present the result for multiple compilers in a nice way.

C++ benchmark – std::vector VS std::list VS std::deque

C++11 Performance tip: When to use std::pow ?

C++ benchmark - std::vector VS std::list

C++ Benchmark - std::list VS boost::intrusive::list

GCC 4.7 vs CLang 3.1 on eddic

New WordPress Plugin - Google Visualization Charts