The asynchronous scheduler maintains indexed data structures that show which
tasks depend on which data, what data is available, and what data is waiting on
what tasks to complete before it can be released, and what tasks are currently
running. It can update these in constant time relative to the number of total
and available tasks. These indexed structures make the dask async scheduler
scalable to very many tasks on a single machine.

To keep the memory footprint small, we choose to keep ready-to-run tasks in a
LIFO stack such that the most recently made available tasks get priority. This
encourages the completion of chains of related tasks before new chains are started.
This can also be queried in constant time.

EDIT: The experiments run in this section are now outdated. Anecdotal testing
shows that performance has improved significantly. There is now about 200 us
overhead per task and about 1 ms startup time.

tl;dr The threaded scheduler overhead behaves roughly as follows:

1ms overhead per task

100ms startup time (if you wish to make a new ThreadPool each time)

Constant scaling with number of tasks

Linear scaling with number of dependencies per task

Schedulers introduce overhead. This overhead effectively limits the
granularity of our parallelism. Below we measure overhead of the async
scheduler with different apply functions (threaded, sync, multiprocessing), and
under different kinds of load (embarrassingly parallel, dense communication).

The quickest/simplest test we can do it to use IPython’s timeit magic:

As we increase the number of tasks in a graph, we see that the scheduling
overhead grows linearly. The asymptotic cost per task depends on the scheduler.
The schedulers that depend on some sort of asynchronous pool have costs of a few
milliseconds and the single threaded schedulers have costs of a few microseconds.