Often, parallel workflow is described in terms of a Directed Acyclic Graph or DAG. A popular library
for working with Graphs is NetworkX. Here, we will walk through a demo mapping
a nx DAG to task dependencies.

The full script that runs this demo can be found in
examples/parallel/dagdeps.py.

The ‘G’ in DAG is ‘Graph’. A Graph is a collection of nodes and edges that connect
the nodes. For our purposes, each node would be a task, and each edge would be a
dependency. The ‘D’ in DAG stands for ‘Directed’. This means that each edge has a
direction associated with it. So we can interpret the edge (a,b) as meaning that b depends
on a, whereas the edge (b,a) would mean a depends on b. The ‘A’ is ‘Acyclic’, meaning that
there must not be any closed loops in the graph. This is important for dependencies,
because if a loop were closed, then a task could ultimately depend on itself, and never be
able to run. If your workflow can be described as a DAG, then it is impossible for your
dependencies to cause a deadlock.

For demonstration purposes, we have a function that generates a random DAG with a given
number of nodes and edges.

defrandom_dag(nodes,edges):"""Generate a random Directed Acyclic Graph (DAG) with a given number of nodes and edges."""G=nx.DiGraph()foriinrange(nodes):G.add_node(i)whileedges>0:a=randint(0,nodes-1)b=awhileb==a:b=randint(0,nodes-1)G.add_edge(a,b)ifnx.is_directed_acyclic_graph(G):edges-=1else:# we closed a loop!G.remove_edge(a,b)returnG

So first, we start with a graph of 32 nodes, with 128 edges:

In [2]: G=random_dag(32,128)

Now, we need to build our dict of jobs corresponding to the nodes on the graph:

In [3]: jobs={}# in reality, each job would presumably be different# randomwait is a function that sleeps for a random intervalIn [4]: fornodeinG: ...: jobs[node]=randomwait

Once we have a dict of jobs matching the nodes on the graph, we can start submitting jobs,
and linking up the dependencies. Since we don’t know a job’s msg_id until it is submitted,
which is necessary for building dependencies, it is critical that we don’t submit any jobs
before other jobs it may depend on. Fortunately, NetworkX provides a
topological_sort() method which ensures exactly this. It presents an iterable, that
guarantees that when you arrive at a node, you have already visited all the nodes it
on which it depends:

Now, at least we know that all the jobs ran and did not fail (r.get() would have
raised an error if a task failed). But we don’t know that the ordering was properly
respected. For this, we can use the metadata attribute of each AsyncResult.

These objects store a variety of metadata about each task, including various timestamps.
We can validate that the dependencies were respected by checking that each task was
started after all of its predecessors were completed:

defvalidate_tree(G,results):"""Validate that jobs executed after their dependencies."""fornodeinG:started=results[node].metadata.startedforparentinG.predecessors(node):finished=results[parent].metadata.completedassertstarted>finished,"%s should have happened after %s"%(node,parent)

We can also validate the graph visually. By drawing the graph with each node’s x-position
as its start time, all arrows must be pointing to the right if dependencies were respected.
For spreading, the y-position will be the runtime of the task, so long tasks
will be at the top, and quick, small tasks will be at the bottom.