Here is the first post of the performance series in C++. In this series, I will write about how to improve the performance of frequently used algorithms in C++ by writing efficient code. I chose C++ because it gives a lot of control over the performance, but same concepts can be applied to different languages too. For starter, I will discuss how to implement the BFS algorithm efficiently and is the optimum data structure to use in the BFS?

Breath-first search is arguably the most important graph algorithm known to humanity. It is very versatile and has many applications besides graph traversal such as checking graph connectivity and measuring shortest distances. In my capacitated selfish games I had to use BFS algorithm in 5 different ways to traverse the game graph! My program was spending so much time in various BFS loops that got me thinking, how can I implement the BFS in the most efficient way possible.

The BFS algorithm traverses the graph by picking some arbitrary node and exploring its neighbors before moving to the next level neighbors and so on. The primary data structure in the BFS algorithm is a FIFO queue that holds the nodes that are being processed. Traditionally, the BFS queue is implemented using std::list. The goal of this experiment is to measure which data structure in C++ STL, including list, vector, queue and dequeue would be the fastest choice for implementing the BFS algorithm and why. The answer is, surprisingly, the vector will even out-perform the obvious choice, dequeue, if used efficiently.

I benchmarked different implementations of the BFS algorithm to determine the fastest one. To evaluate the different data structures, I generated a random Erdos graph with 1000 nodes and p=0.1. Here p indicates the probability that there is an edge between two nodes. Then I implemented the BFS algorithm with the different data structured that was mentioned earlier. In STL, many of these data-structures follow the similar API for front(), push_back() and pop_front(). So swapping data structures is relatively easy. Here is the sample code for the BFS algorithm using list:

I used Google Benchmark to run the different algorithms and benchmark the performance. Micro-benchmarking libraries, such as Google benchmark, are better than manually inserting timing macros into the code to evaluate the performance. There are other good alternatives to Google Benchmark, such as Hayai, Benchpress, Nonius, and Celero. I prefer Google Benchmark simply because it is easy to deploy, great documentation and has tons of functionality.

Here is the initial result on an Intel Core-i7 Broadwell chip with 256KB L2 and 4MB L3 cache clocked at 3.1GHz:

Understanding why the data structures are behaving like this requires digging deeper. Fortunately, macOS is shipped with a handy tool, very similar to Linux perf, called that allows the user to monitor the performance counters of the CPU. These performance counters are registers inside the processor that keep count of a number of instructions, cache missed, branch-predictor misses, etc.

The Google benchmark runs each benchmark as long as it finds the results statistically significant. As a result, some functions might be running more than the other. For example, in our previous experiment, the number of iterations for the graph_constructors was 38 whereas for the BFS_List was 1000. This will affect the performance counters. So I wrote a second benchmark where I executed each function for precisely 1000 times. Here are the results:

Here is the code for the BFS algorithm implemented using list. In STL, the underlying list is implemented using a linked-list. Whereas the vector and dequeue are implemented using arrays and enforces a continuous block of memory. The standard states the deque does not guarantee contiguous storage location. However, its behavior indicates similar memory allocation to vector, at least in LLVM’s standard library implementation.

The linked-lists are extremely cache unfriendly. As a result, their performance is very poor. The list implementation of the BFS algorithm executed about 8 billion instructions and had 31 million cache misses. Whereas the deque has about 7 million cache misses in 1 billion executed instructions. This impact is huge. Everything else being equal, the list implementation has 8X more instruction for performing the same functionality as deque but causes 4.4X more cache misses. The cache misses are very expensive in Modern processor. In a typical 3GHz processor, a single L2 cache miss can take cause a main memory reference which takes about 100ns and therefore it is causing the code to be 100 times slower. The simple rule is never, ever use std::list.

The vector data structure does not support a pop_front() operation in O(1). As a result, I had to use v.erase(v.begin()) to remove the first element. The erase() operator will shift all the elements in linear time which results in a lot of overhead instructions on the processor. More importantly, the erase() is very cache-unfriendly and can cause a lot of cache-misses. This sounds very expensive, however, if the graph is sparse enough (i.e. the expected size of the buffer is small), it still beats the std::list implementation because the size of the BFS queue will be very small.

The std::deque is the obvious choice for implementing BFS. It is a double-ended queue and therefore supports both pop_front() and push_back() natively. Furthermore, deque enforces a continuous block of memory and therefore it is very cache friendly. If there are no constraints on the problem, deque is the optimal choice for the job and the de facto data structure for the BFS algorithm should be deque. However, after benchmarking, I notice that my poor vector implementation was very close to deque. This made me think can we improve vector to out-perform the deque?

If the ordering of the nodes didn’t matter, we can replace the erase() operation with swapping the last element with the first element and executing a pop_back(). Here is the code snippet:

The vector implementation where we used swap() and pop_back() instead of erasing the first element is out-performing the deque because it simply executes 50% less amount of code and has less branch-predictor and cache miss. The downside of pop_back instead of pop_front() is that the BFS algorithm no longer visits the nodes in an increasing order of minimum distance from the source. When the ordering of visited node doesn’t matter, for example for checking connectivity components, it is better to implement BFS with std::vector, otherwise the most efficient data-structure for BFS is std::deque. Finally, here is the code that I used for benchmarking different BFS algorithms.