Iterator vs. Sentinel

Not without audible weeping and gnashing of teeth we observed that end iterators are no real iterators. Most otherwise valid expressions on an iterator type are not allowed if an instance happens to represent an end iterator, such as:

Dereference: *it

Member dereference: it->member

Incrementing: ++it, it++

Offset: it += n, it -= n

Random access: it[n]

Valid expressions for end iterators:

Random access:

Distance (RAI): it_a - it_b

Order: it_a < it_b, …

Equality: it_a == it_b, it_a != it_b

Generally speakig, this leaves us with equality / inequality comparison for end iterators. Arguably, inequality is the only relevant expression we need for iteration:

The only reason why we required to specify the size of the sequence and only allow indices starting from 0 was to mimic an STL sequence container.
But apart from that, there is no need to restrict the beginning and length of the sequence. We could also allow:

But if no size is specified, there is no lazy_sequence::end(), so how could we iterate over the sequence?

Perhaps lazy_sequence was a bad idea to begin with?

So let’s stick to the plain STL: how would you specify an iteration on an std::list over exactly n elements?

I assume you have acquired a healthy aversion to StackOverflow answers at this point.
The wide spectrum of recommended solutions to this very problem in this StackOverflow discussion - masterfully avoiding any correct or useful insight - should eliminate any residual doubt.

You would probably use a loop counter.
For classic loops, this is ugly but would get the job done:

for (int ct = 20,
auto it = list.begin();
it != list.end() &&
ct > 0;
++it, --ct) {
//// This is just as useful as it looks//
}

But we don’t like old-fashioned for loops, do we.
We prefer:

range-based for in case of in-order sequential iteration, or

algorithms like std::for_each, std::transform, … if we can be specific about operation semantics

Range-based for loops do not allow decrementing a counter in the loop header, and STL algorithms cannot be “canceled” at all. This is why there is a std::<algorithm>_n variant at least for the most basic algorithms.

In conclusion, it would be highly desirable to:

… specify iteration for half-open ranges

… decouple sequence length from iterator distance

… use a non-iterator concept for end

Sentinels as a concept replacing end iterators achieve all of this, and much more. They only provide equality comparison operators which effectively serve as a break condition.

The async construct uses an object pair called a promise and a future. The former has made a promise to eventually provide a value. The future is linked to the promise and can at any time try to retrieve the value by get(). If the promise hasn’t been fulfilled yet, it will simply wait until the value is ready. The async hides most of this for us, except that it returns in this case a future object. Again, since the compiler knows what this call to async returns, we can use auto to declare the future.

Exercise:

Use async to solve the sum of squares problem. Iterate up to 20 and add your future objects to a vector>. Then, finally iterate all your futures and retrieve the value and add it to your accumulator. This should be only a few modifications from the code above.

this_thread

Let’s make sure this really runs in parallel. Using the code from the last exercise, now add a cout that prints x inside square. Run your program again. Every time you run it, it should be listing the values of x in order. This seems awfully deterministic and is not characteristic of running things in parallel. We did start them in that order, so maybe the threads aren’t overtaking each other. We can check this by adding a sleep inside square, which we can pretend is the heavy computation of x * x:

this_thread::sleep_for(chrono::milliseconds(100));

Note that all these seemingly global objects are in the std namespace, but since we issued using namespace std, we made them visible globally. Okay, run this and time the execution. They are clearly taking turns and they are not running in parallel. Use cout to print this_thread::get_id().

Since the main execution is also considered a thread, try printing the thread ID inside main using this same function. What does this tell you?

The function async by default gives the program the option of running it asynchronously or deferred. The latter means square will be called first when we call get(), and it will be executed in the same thread. Ideally, the program should make an intelligent decision, optimized for performance, but for some reason GCC always defers, so let’s not give it a choice about it.

Change the call to async as follows:

async(launch::async, &square, ...)

Run it again, timing the execution.

Note

We got a 20 time speed up only because our threads aren’t actually heavy on the CPU while sleeping, so it’s not an ideal surrogate for imagining that x * x actually takes 100 ms of CPU time. In the real case, the speed up would be largely determined by how many cores we have on our computer.

Ideally, you should avoid starting more computationally intensive threads than your computer can truly run in parallel, since otherwise your CPU cores will start switching its attention between different threads. Each switch is called a context switch and comes with an overhead that will hurt performance.

Shared Access

Let us imagine that x * x is a very costly operation and we want to calculate the sum of squares up to a certain number. It would make sense to parallelize the calulation of each square across threads.

This should sum all squares up to and including 20. We iterate up to 20, and launch a new thread in each iteration that we give the assignment to. After this, we call join() on all our threads, which is a blocking operation that waits for the thread to finish, before continuing the execution.

This is important to do before we print accum, since otherwise our threads might not be done yet. You should always join your threads before leaving main, if you haven’t already.

Before moving on, also note that C++11 offers more terse iteration syntax of the vector class, very close in syntax to Java. We are also using the keyword auto instead of specifying the data type thread, which we can do whenever the compiler can unambiguously guess what the correct type should be. We added an & to retrieve a reference and not a copy of the object, since join changes the nature of the object.

Now, run this. Chances are it spits out 2870, which is the correct answer.

Let’s list all distinct outputs from 1000 separate runs, including the count for each output:

Before we fix the race condition, since keeping accum as a global variable is poor style, we would rather pass it into the thread. Add a parameter int& accum to square. It is important that it’s a reference, since we want to be able to change the accumulator. However, we can’t simply call thread(&square, accum, i), since it will make a copy of accum and then call square with that copy. To fix this, we wrap accum in ref(), making it thread(&square, ref(accum), i).

Mutex

A mutex (mutual exlusion) allows us to encapsulate blocks of code that should only be executed in one thread at a time. Keeping the main function the same:

The problem should now be fixed. The first thread that calls lock() gets the lock. During this time, all other threads that call lock(), will simply halt, waiting at that line for the mutex to be unlocked. It is important to introduce the variable temp, since we want the x * x calculations to be outside the lock-unlock block, otherwise we would be hogging the lock while we’re running our calculations.

As a final polish, you should use std::lock_guard, a mutex wrapper that provides a convenient RAII-style mechanism for owning a mutex for the duration of a scoped block.

In the following variant, you also have unambiguous control over the locked section and obvious prevention of reordering:

We don’t need to introduce temp here, since x * x will be evaluated before handed off to accum, so it will be outside the atomic event.

Condition Variables

It is a common scenario to have one thread wait for another thread to finish processing, essentially sending a signal between the threads.

This can be done with mutexes, but it would be awkward. It can also be done using a global boolean “ready flag” that is set by one thread and polled by the other.

Since setting notified to true is atomic, this would not need a std::atomic or a mutex. However, reacting on a flag with low latency requires a busy loop, pumping CPU load to 100% in the thread. We could add a sleep inside the polling loop to reduce CPU load, but this effectively sets latency to the sleep period length.

A more principled way however is to add a call to wait for a condition variable inside the for loop.

Condition variables can be spuriously awaken, so we still need a ready flag.