Async Tasks in C++11: Not Quite There Yet

If you expected std::async to be just syntactic sugar over thread creation, you can stop reading right now, because that’s what it is. If you expected more, read on.

Don’t get me wrong, std::async combines several useful concurrency concepts into a nice package: It provides a std::future for the return value, and hides the std::promise side of the future. It also provides options to run a task synchronously. (See the Appendix for a short refresher.)

But tasks have a slightly different connotation in parallel programming: they are the basic blocks of task-based parallelism. And C++11 tasks fall short on that account.

Task-Based Parallelism

Tasks are an answer to performance and scalability problems associated with threads. Operating system threads are rather heavy-weight; it takes time and system resources to create a thread. If you have an algorithm that naturally decomposes into a large number of independent computations, a.k.a. tasks, you’ll probably get your best performance not by creating a separate thread for each task, but by adjusting the number of threads depending on the amount of parallelism available on your particular system, e.g., the number of cores and their current loads. This can be done by hand, using thread pools and load balancing algorithms; or by using task-based systems.

In a task-based system, the programmer specifies what can be run in parallel but lets the system decide how much parallelism to actually use. The programmer splits the algorithm into tasks and the runtime assigns them to threads — often many tasks to a thread.

There are many implementations of task-based parallelism with varying language support. There’s the Cilk language which pioneered this approach; there’s the built in support in Haskell, F#, and Scala; and there are several C++ libraries, like Microsoft PPL or Intel TBB.

At the center of task-based systems are work-stealing queues. When a thread creates tasks, they are initially queued on the processor (core) that runs the thread. But if there are other idle processors, they will try to steal tasks from other queues. The stolen tasks are then run in parallel.

Notice that tasks must be able to migrate between threads. What’s more, efficient use of OS threads requires that tasks that are blocked, for instance waiting for I/O or waiting for other tasks to finish, should be taken off their threads, so that other tasks may reuse them.

C++11 Tasks

My expectation was that C++11 “tasks” that are created using std::async should be abstracted from threads, just as they are in task-based parallelism. When I started preparing a video tutorial about tasks in C++, I wrote a simple program to demonstrate it. I created async tasks using the default launch policy and waited for them to complete. Each task slept for one second and then printed its thread ID.

The results were surprising. The first six tasks executed in parallel, each in its own thread. But the rest of the tasks executed in the main thread one after another, separated by 1 second intervals. (Note: this behavior was fixed in v 1.7 of Just::Thread — read on).

The output of the task test

This approach to parallelism obviously doesn’t scale very well.

Then I wrote another program that lists directories recursively, creating a separate async task for each subdirectory. But this time I explicitly requested launch policy launch::async, which guarantees that each task will start in a new thread. This program worked up to a point, but when I tried to list my whole disk, it failed by exhausting Windows’ limited thread creation capability. Again, this approach doesn’t scale very well.

What was even worse, when the program didn’t fail, it performed better with launch::deferred policy, which forces all tasks to be executed serially, than with the launch::async policy. That’s because thread creation in Windows is so expensive that it can easily nullify performance gains of parallelism (although Windows 7 supports user-level threads, which might bypass these problems).

My first reaction was to blame Anthony Williams for badly implementing the Just::Thread library I was using. When he assured me that it was Standard compliant, I turned to Herb Sutter and Hans Boehm for confirmation and they sided with Anthony. It turns out that there are serious problems that prevented C++11 from standardizing task-based concurrency.

The Problems

The foundation of task-based parallelism is the ability for tasks to share threads and to migrate between threads. This sharing and migration must be transparent.

The requirements for the default-launch tasks are the following:

The runtime can either run such task asynchronously or synchronously

When it’s run synchronously, it should be run in the context of the parent thread

When it’s run asynchronously, it should behave as if it were run on a separate thread

Strictly speaking, a task could always call this_thread::get_id() and fool any attempt at thread sharing or migration by discovering the ID of the current thread. In general, the namespace std::this_thread, which also contains sleep functions, is thread-bound.

But let’s suppose that we only require that asynchronous tasks behave as if they were run on separate threads, except when they call functions in the this_thread namespace. There are still several problems.

Thread-Local Variables

C++11 introduced a new thread_local storage qualifier. A thread-local variable is separately initialized and destroyed in every thread. It must not survive thread termination. This requirement complicates thread sharing.

In our exchange, Hans Boehm clarified the termination requirement for tasks: Thread-local variables must be fully destroyed before the owning thread returns from calling get or wait on a future produced by the corresponding std::async; or before the destructor of that future returns, whichever comes first.

This actually leaves some wiggle room: A thread could be reused if the runtime guarantees that thread-local variables of terminating tasks are correctly destroyed. Unfortunately, this might be impossible if the programmer calls OS-specific API, like Windows’ TlsAlloc. Anthony also pointed out that it’s not clear how to deal with DLL_THREAD_DETACH handlers in DLLs, when switching to task granularity.

Locks

There’s another aspect of C++11 concurrency that is tied to threads — locking. The std::mutex object is thread aware. It requires that unlock is called from the same thread as lock. Why should this be a problem?

I haven’t mentioned yet that task migration might be necessary in the middle of execution, but that is what most task-based systems do. It’s an optimization in the case when you’re dealing with blocking tasks.

There are two major blocking scenarios: external and internal. External blocking happens when a task calls an OS function (directly or indirectly) that may block, for instance waiting for I/O. My directory listing program did a lot of that. Internal blocking, which is potentially easier to intercept, happens when tasks are blocked on futures. My program did a lot of that too, when waiting for the results of tasks that were listing subdirectories of the current directory.

A blocked task doesn’t use processor resources, but it does occupy a thread. That thread could be reused to run another task. But that requires a clean way of taking a task off a thread and then restoring its state once the call unblocks. Now, if the task takes a lock on a mutex before blocking, it cannot be migrated to another thread. The unlocking wouldn’t work from another thread.

Herb Sutter observed that, if we tried to restore the task to its original thread, we might get into a spurious deadlock, when the original thread is occupied be another task waiting for the same mutex.

The other problem with locks is in the use of a recursive_mutex. A thread may lock such a mutex multiple times before calling unlock (also multiple times). But if a second thread tries to lock a mutex that’s owned by the current thread, it will block. As long as tasks run is separate threads, this works. But if they share the same thread, they may successfully acquire the same mutex and cause data corruption.

Imagine the following scenario. Task A wants to modify a shared data structure and takes a lock on its recursive mutex. It then blocks on some OS call (probably not a good idea in general, but it may happen). The task gets taken off of the current thread, and task B starts executing. It takes a lock on the same mutex — successfully, as it is executing in the same thread, and reads or modifies a data structure that was in the middle of being modified by task A. A disaster unfolds.

Notice that this is not a problem if tasks are run serially in the same thread, as it happens with the launch::deferred policy, because each task runs to completion before allowing another task to run.

Finally, such migration of running tasks would also wreaks havoc with thread-local variables.

Possible Solutions

Optimizing the Default Launch Policy

The easiest part was to change the implementation of the default policy, to defer the decision whether to run a given task asynchronously or as deferred. Anthony was quick to notice this, and almost immediately released a fix — version 1.7 of Just::Thread.

The idea is simple, you schedule N tasks asynchronously — N being some runtime number dependent on the number of available cores — and put the rest on a queue. When any of the queued tasks is forced (by the call to get or wait on its future), it is executed synchronously in the context of the forcing thread — as if the launch::deferred policy were used. Otherwise, as soon as one of the asynchronous tasks finishes, the next task from the queue is scheduled to run asynchronously. Here’s the output of the same test program after the changes in Just::Thread:

The output of the test with the new library

This time each task ran in a separate thread, but because of the default launch policy, they ran in small batches that could effectively exploit the parallelism of a multicore machine. Still, without thread reuse, the runtime had to create 22 OS threads. The hope is that the operating system caches thread resources so that the creation of the second batch of threads is substantially cheaper than the first one.

(I suggest running this test when evaluating any implementation of a task library.)

Thread Reuse

The next step in improving task performance would be to use a thread pool and reuse threads instead of creating them from scratch. Because of the problem with thread-local variables, it might be impossible to implement thread reuse without some help from the language runtime. The task library would need hooks into every creation of a thread_local variable, so it can destroy them at task exit.

That still leaves the problem of tasks calling APIs like TlsAlloc directly. An atractive option (for library writers) would be to ignore the problem — after all the language provides a portable way of dealing with thread-local storage.

Task Migration

We would like to be able to remove a blocked task from a thread in order to run another task on it. This is not easy because of thread-locals and locks.

The problem with thread_local variables is that they should really be task-local. Or at least they should behave “as if” they were task-local. So when two tasks are sharing the same thread, there has to be some mechanism for “context switching” between them. The context would have to include the state of all thread-local variables.

Migrating a task that is holding a lock could only be done if locks were task-bound rather than thread-bound. Interestingly, there is a provision in the Standard for this kind of behavior. The definition of Lockable in (30.2.5) talks about “execution agents” that could be threads, but could also be something else. This comment is of particular interest:

[ Note: Implementations or users may introduce other kinds of agents such as processes or thread-pool tasks. —end note ]

However, the Standard Library mutex is bound to threads, not tasks. The intention of (30.2.5) is that, if you create your own separate task library with your own task-local variables and mutexes, you will still be able to use the standard utilities such as lock_guard or condition variables. But the implementation of std::async tasks must work with thread_local and std::mutex.

Deadlocks

Here’s a potential scenario where two tasks could deadlock if their threads are reused while they are blocked:

Task A runs on thread T1, takes the mutex M1, and makes a blocking call

The runtime takes A off T1 (saves its state, etc.) and puts it in a Blocked queue

Task B starts executing on the same thread, T1, and tries to take M1, which is locked by A

In order to unlock M1, task A would have to run on T1 — the same thread the lock was taken on — but T1 is now occupied by B, and A can’t make progress

The only way to resolve this deadlock is to take B off the thread. So that’s what a task migration system must do — guarantee that any blocked task is taken off its thread.

In general, any spurious deadlock would involve a bunch of blocked tasks. If all of them are blocked on locks, this is an actual deadlock which would happen anyway. If there is at least one task that can make progress when its blocking call returns, it can always be assigned back to its thread, either because the task running on it completes, or because it’s blocked and will be taken off of it.

Of course if we allow lock migration, as in associating locks with tasks rather than threads, the problem disappears on its own.

Conclusion

What I learned from this exercise was that std::async with default launch policy can be made usable. However its strong association with threads makes it virtually impossible to implement full-blown task-based parallelism. A task-based system could be implemented as a library but it would have to come with severe restrictions on the use of thread_local variables and standard mutexes. Such a system would have to implement its own version of task-local variables and mutexes.

The code calls std::async with a lambda (anonymous function) that takes a Boolean flag and returns a Boolean. The lambda can either throw an exception or return the flag back. The second argument to async (true, in this case) is passed to the lambda when it is executed.

The value or the exception is passed back to the parent code when it calls the get method on the future returned by async. The call to async may create a new thread, or defer the execution of the function until the call to get is made.

The same code may be implemented directly using std::thread, std::promise, and std::future but, among other things, it requires modifications to the thread function (here, to the lambda):

Related

23 Responses to “Async Tasks in C++11: Not Quite There Yet”

Another option, while slower, would be to have a dedicated mutex-manager thread which locks or unlocks via a message passing API. On the plus side it could apply some intelligence to locking policy knowing what threads are using what, and avoid deadlocks.

Interesting post. I found a related bug, or maybe “feature”. I wonder what you make of it. Even if you tell std::async to use std::launch::async, its still not truly parallell as it does not behave as a std::thread. Atleast thats how VS2011 works. It makes std::async almost useless to me. Here is a toy example that shows the problem (would be nice if you or someone tested it with Just::thread) http://pastebin.com/E6MfNnwZ

Why don’t you print thread ID inside each task? I’m pretty sure you’d get different IDs. If not, that would be a bug in the implementation of std::async. Your results show that the runtime is not scheduling tasks optimally. It would be a quality of implementation issue. VS 11 is still in beta though, so they might be still tweaking performance.

Thanks for having a look at that. Good idea about printing thread id’s to see what is going on. Turns out the threads themselves are reused, and not only scheduled inefficiently. As can be seen here: http://pastebin.com/5dWCjjNY

That’s interesting. I think I’ve heard that they were using their task runtime, ConcRT — the same as PPL — but I thought they weren’t serious. The next test would be to find out if thread local variables survive task termination, if the thread is reused. I don’t recall, have they implemented thread locals in this beta?

Just a follow up to put this to rest. I made a bugreport to MS and “STL” fixed the bug. With this interesting summary of how their implementation works:

“With this fix, launch::async doesn’t use std::thread. (If you want std::thread, you know where to find it.) It uses VC’s Concurrency Runtime, but differently than launch::async | launch::deferred. The differences are:

launch::async | launch::deferred may or may not run on a different thread, and ConcRT will avoid spinning up more threads than is necessary to saturate the hardware.

launch::async will run on a different thread, and ConcRT is willing to spin up many more threads than the hardware has logical cores for. Unlike std::thread, ConcRT will reuse threads if possible, for greater efficiency.

We believe that this conforms to the Standard, and results in desirable behavior for the vast majority of users (including your example).”

This is still a controversial topic. Herb Sutter thinks thread reuse is okay, Hans Boehm has reservations. I’ll know more next week after the meeting of the subcommittee on concurrency and parallelism for the next version of C++.

Hmm. As an end user I want the best of all options: I want the efficiency of a thread reuse solution, and the option to force a new thread if thread-specific side effects are important. Because consistant interfaces are also important, maybe the std::launch policy bitmap just needs a third option?

For what its worth, I think its simple enough using std::packaged_task to implement your own version of std::async which forces a new thread, while its not nearly as easy to implement a good thread pool w/ work stealing etc. Given that, I would tilt towards not guaranteeing thread uniqueness by default, and propose an explicit option for the next library update.

I’m not sure about the thread locals in this beta. I’ll leave it be. I tried to make an async replacement of my own, as a workaround, but I failed. I think I remember you had something of the sort in one of the video tutorials. Threads, promises, current_exception, etc. It seems the corensic.com website is down though so I’m having trouble stealing your code. 🙂

Thanks. I watched that video again and it helped me get started. Your videos are really good. I hope you make more. Watching it I thought I could make a simple async replacement of my own, but got stuck at the same point as last time I tried. I cant seem to get past the fact that the std::promise needs to outlive the async function (and live as long as the thread), so the promise cant live as a local variable in the async func. But the std::promise shouldn’t live in in the caller code either, as the caller only need to know about futures. And i dont know how to make the promise live in the thread function as async needs to return a future before it even calls the thread func. I’m scratching my head on this one. You wouldn’t happen to have any ideas? http://pastebin.com/yd1d8C4K

Thanks. for the reply. I tried that and cleaned up the code a bit http://pastebin.com/ZjqJN77n. However now I get these “Exception: future already retrieved” errors. I’m guessing it might be because the thread is accessing the promise by reference to local variable in myasync after it has been destroyed.