TBB Task DAG with Deferred Successors

I’m thinking about how to implement DAGs of tasks that can be used in a natural way. The problem is that a DAG is a more general construct than what task systems usually allow, due to valid performance concerns. Therefore, if I want to implement a DAG more generally, I need to come up with custom hacks.

Trees vs DAGs

The major difference between a tree and a DAG of tasks is that a DAG allows one task to have an arbitrary number of successor tasks. Also, with a DAG, a task from one subtree of tasks can send data to a task from a different subtree of tasks, which allows you to start tasks when all their inputs are ready rather than when all the previous tasks have run and finished. (Hmm… That makes me think of out-of-order versus in-order processors.) This expands on the functionality of a tree of tasks, since trees only allow outputs to be passed to their immediate parent, whereas DAGs can pass data to grandparents or great-grandparents, or to tasks in subsequent trees.

By default, tasks in task systems like TBB and Cilk are designed to have only one successor: a parent task, or a continuation task. Having a parent task makes it possible to have nested tasks, which is useful for naturally spawning tasks from tasks, similarly to how functions can call other functions. Continuation tasks make it potentially more efficient to spawn a follow-up task to handle the results of a task, and they can do so without affecting the reference count of the parent task.

DAG Implementation

To implement a general DAG, you need to (in one way or other) keep track of the connections between the outputs of some tasks and the inputs of other tasks in a more general way. There are two aspects to this connection:

How is memory allocated for the data passed from predecessor task to successor task?

How is the successor task spawned when its inputs are all satisfied?

According to Intel’s documentation (See: General Acyclic Graphs of Tasks), it’s suggested that the memory for the passed data is stored within the successor task itself. Each task object contains the memory for its inputs, as well as a counter that keeps track of how many inputs need to be received before the task can be spawned. In the TBB example, each task also keeps a list of successors, which allows predecessors to write their outputs to their successor’s inputs, and allows the predecessor to decrement their successor’s count of missing arguments (and can finally spawn the successor task if the predecessor finds that it just gave the successor its final missing input.)

The Problem with DAGs

In the TBB DAG example, all tasks are spawned up-front, which is easy to do in their example since the task graph is structured in a formal way. In my case, I want to use a DAG to implement something that looks like a sequence of function calls, except using tasks instead of functions, to allow different parts of the code to be executed in parallel. I want to use a DAG to make it possible to establish dependencies between these sequential tasks, to allow the programmer to create tasks that start when their inputs are available. In short, instead of creating the tasks up front, I want to create the tasks in roughly sequential order, for the purpose of readability.

The problem with what I want to do is that I can’t directly store the outputs of predecessors inside the successor. Since the predecessors need to be passed a pointer to where their outputs should be stored, the successor (which stores the inputs) needs to be allocated before the predecessors. This means you roughly need to allocate your tasks backwards (successor before predecessor), but spawn the tasks forward (predecessors before successors). I don’t like this pattern, since I’d rather have everything in order (from predecessor to successor). It might not be absolutely as efficient, but I’m hoping that the productivity and readability improvement is worth it.

The Interface

Instead of spawning successor tasks up front, I’d like to allocate the storage for the data up front, similarly to Cilk. Cilk has a “cont” qualifier for variables, which can be used to communicate data from one task to another. For example, the fibonacci example from the Cilk paper contains the following code:

This code computes fib(n-1) and fib(n-2), then passes the results of those two computations to a sum continuation task, which implements the addition within fib(n) = fib(n-1) + fib(n-2). The data is passed through the x and y variables, which are marked cont. I don’t know why the Cilk authors put the call to spawn sum before the calls to fib in this example, but perhaps this alternate arrangement of the code is possible:

With this alternative arrangement of the code, the order of the tasks being spawned mirrors the equivalent code in plain sequential C:

int x, y;
fib(&x, n-1);
fib(&y, n-2);
sum(&k, x, y);

Implementing cont

If the successor task is allocated before the predecessors, the predecessor that supplies that final missing input to the successor can also spawn the successor. However, if the successor is allocated after the predecessors, it’s possible that all predecessors finish their work before the successor is allocated. If we allocate the successor and find that all its inputs are already available, we can simply immediately spawn it. However, if some inputs of the successor are still not available, then we need to find a way to spawn the successor when those inputs become available.

To spawn successors when a cont becomes available, each cont can keep a list of successors that are waiting for it. When the cont is finally set, it can pass that input to its successors, and potentially also spawn successors if this cont was their final missing input. The difficulty with implementing this system lies in resolving the race condition between the predecessor, the cont, and successors.

Here’s an example of a race condition between predecessor/cont/successor. Suppose a successor task is spawned with some cont inputs. The successor might see that a cont input has not yet been set, so the successor adds itself to the list of successors in the cont. However, it might be possible that the cont suddenly becomes set in a different thread while the successor is adding itself to the cont’s successor list. The thread that is setting the cont might run before it sees the new successor added, so it might not notify the successor that the input is now complete (by decrementing its counter and possibly spawning it.)

cont successor linked list

It might be possible to solve the problem described above if it’s possible for the successor to atomically check that the cont is not satisfied and if so add itself to the cont’s list of successors. This might be possible if the cont uses a linked list of successors, since the successor could do a compare-and-swap that sets a new head for the list only if the input is unsatisfied.

If that compare-and-swap fails because the cont became set just before the CAS, the successor can just treat the input as satisfied and move on to checking the next input. If the CAS fails because another successor registered themselves concurrently, then the registration needs to be tried again in a loop. On the other side of that race condition, if the cont’s compare-and-swap to set its completion fails because a successor added themselves to the list concurrently, then the cont just tries again in a loop, which allows the cont to notify the successor that interrupted it of completion.

If the cont becomes set while the successor is hooking up other inputs, the cont will pass the input and decrement the counter, which itself is not a problem. However, the last cont might finish and spawn the successor before the function hooking the successor finishes, which shouldn’t be a problem as long as nothing else needs to happen after the successor finishes hooking itself up to its inputs. If something does need to happen immediately after the hookups, the successor can initialize its counter with an extra 1, which it only decrements at the end of the hookup (and potentially then spawns the task.)

A detail of this implementation is the allocation of the nodes of the linked list. The successor task needs to have allocated with it at least one linked list node per cont. Since this is a fixed number, the allocation can be done more efficiently.

There’s a difficulty with the compare-and-swap, which is that two different compare-and-swaps need to be done. First, the successor should only be added to the list if the cont is not yet set. Second, appending to a linked list in a CAS can fail if two successors try to add themselves to the list simultaneously. To solve this problem, I propose that the cont’s “is set” boolean is stored as the least significant bit of the head of the linked list. This allows operations on the cont to both switch the head of the list and compare and set completeness simultaneously. If pointers are aligned then the least significant bit is always 0, so no information is lost by reusing that bit. We just need to make sure to mask out that bit before dereferencing any pointers in the linked list.

cont: Derived from cont_base, stores data in “std::optional-style” to pass data from predecessor to successors.

spawn_when_ready: Spawns a task when a list of conts are all set. Interface still a bit low-level.

I’ve only done a few tests, so I don’t know if it’s 100% correct. I only really have one test case, which I’ve debugged in a basic way by adding random calls to sleep() to test different orders of execution. I wouldn’t consider using this in production without a thorough analysis. It’s honestly not that much code, but I’m not an expert on the intricacies of TBB.

Also, I was lazy and used sequential consistency for my atomic operations, which is probably overkill. Any lock-free experts in the house? 🙂 (Update: I’ve replaced the seq_cst with acquires and releases. Fingers crossed.)

I’ve also not sugar-coated the syntax much, so there’s still lots of low-level task management. I’d like to come up with syntax to spawn tasks in a much simpler way than manually setting up tbb::task derivatives, setting reference counts, allocating children, etc. This is a topic for another post.

With this implementation, I was able to build the following task dependency graph, with each edge annotated with its type of of dependency.