What is the fastest way to add the elements of a std::vector?. A question which I will pursue in the next posts. I use the single threaded addition as reference number. In further posts I discuss atomics, locks, and thread local data.

My strategy

My plan is to fill a std::vector with one hundred million arbitrary numbers between 1 and 10. I apply the uniforml distribution to get the numbers. The task is it to calculate the sum of all values.

As usual I use my Linux desktop and my Windows laptop to get the numbers. The Linux PC has four, my Windows PC two cores. Here are details to my compilers: Thread safe initialization of a singleton. I will measure the performance with and without maximum optimization.

A simple loop

The obvious strategy is it to add the numbers in a range-based for lop.

The key lines is line 27. The performance of std::acummulate corresponds to the performance of the range-based for loop. But not for Windows.

Without optimization

Maximum optimization

That was all. I have my numbers to compare the single threaded with the multithreading program. Really? I'm very curious to protect the summation with a lock or use an atomic. So we get the overhead of protection.

Protection by a lock

If I protect the access to the summation variable with a lock, I get the answers to my two questions.

How expensive is the synchronization of a lock?

How fast can a lock be if no concurrent access to a variable takes place?

Of course I can rephrase point 2. If more the one thread access the shared variable, the access time decreases.

The program is special. First I ask in line 26, if the atomic has a lock. That is crucial, because otherwise, there is not difference between the usage of locks and atomics. On all mainstream platforms I knew atomics use no lock. Second I calculate the sum in two ways. I use in line 31 the += operator, in line 42 the method fetch_add. Both variants have in the single threaded case a comparable performance but I can explicitly specify in the case of fetch_add the memory model. More about that point in the next post.

But now the numbers.

Without optimization

Maximum optimization

The atomics are in the case of Linux 1.5 times slower, in the case of windows 8 times slower than the std::accumulate algorithm. That changes even more in the case of optimization. Now Linux is 15 times, Windows is 50 times faster.

I want to stress two points.

Atomics are 2 - 3 times faster on Linux and Windows than locks.

Linux is in particular for atomics 2 times faster than Windows.

All numbers compact

How lost the orientation because of the number. Here is the overview in seconds.

What's next?

Singlethreaded becomes mulithreaded in the next post. The summation variable add becomes in the first step a shared variable used by four threads. In the second step add will be an atomic.

Go to Leanpub/cpplibrary"What every professional C++ programmer should know about the C++ standard library".Get your e-book. Support my blog.

The atomics are in the case of Linux 1.5 times faster, in the case of windows 8 times faster than the std::accumulate algorithm. That changes even more in the case of optimization. Now Linux is 15 times, Windows is 50 times faster.

The atomics are in the case of Linux 1.5 times faster, in the case of windows 8 times faster than the std::accumulate algorithm. That changes even more in the case of optimization. Now Linux is 15 times, Windows is 50 times faster.

Had a lot of hassle researching posts about business casual outfits with leggings,glad I came across this...incredibly helpfulSingle threaded: Summation of a vector - The newest addition to my RSS feed