Dask supports a real-time task framework that extends Python’s
concurrent.futures
interface. This interface is good for arbitrary task scheduling, like
dask.delayed, but is immediate rather than lazy, which
provides some more flexibility in situations where the computations may evolve
over time.

These features depend on the second generation task scheduler found in
dask.distributed (which,
despite its name, runs very well on a single machine).

You can pass futures as inputs to submit. Dask automatically handles dependency
tracking; once all input futures have completed they will be moved onto a
single worker (if necessary), and then the computation that depends on them
will be started. You do not need to wait for inputs to finish before
submitting a new task; Dask will handle this automatically.

c=client.submit(add,a,b)# calls add on the results of a and b

Similar to Python’s map you can use Client.map to call the same
function and many inputs:

futures=client.map(inc,range(1000))

However note that each task comes with about 1ms of overhead. If you want to
map a function over a large number of inputs then you might consider
dask.bag or dask.dataframe instead.

Both of these accomplish the same result, but using scatter can sometimes be
faster. This is especially true if you use processes or distributed workers
(where data transfer is necessary) and you want to use df in many
computations. Scattering the data beforehand avoids excessive data movement.

Calling scatter on a list scatters all elements individually. Dask will spread
these elements evenly throughout workers in a round-robin fashion:

Dask will only compute and hold onto results for which there are active
futures. In this way your local variables define what is active in Dask. When
a future is garbage collected by your local Python session, Dask will feel free
to delete that data or stop ongoing computations that were trying to produce
it.

>>> delfuture# deletes remote data once future is garbage collected

You can also explicitly cancel a task using the Future.cancel or
Client.cancel methods.

>>> future.cancel()# deletes data even if other futures point to it

If a future fails, then Dask will raise the remote exceptions and tracebacks if
you try to get the result.

As noted above, Dask will stop work that doesn’t have any active futures. It
thinks that because no one has a pointer to this data that no one cares. You
can tell Dask to compute a task anyway, even if there are no active futures,
using the fire_and_forget function:

fromdask.distributedimportfire_and_forget>>>fire_and_forget(c)

This is particularly useful when a future may go out of scope, for example as
part of a function:

defprocess(filename):out_filename='out-'+filenamea=client.submit(load,filename)b=client.submit(process,a)c=client.submit(write,c,out_filename)fire_and_forget(c)return# here we lose the reference to c, but that's now okforfilenameinfilenames:process(filename)

However, each running task takes up a single thread, and so if you launch many
tasks that launch other tasks then it is possible to deadlock the system if you
are not careful. You can call the secede function from within a task to
have it remove itself from the dedicated thread pool into an administrative
thread that does not take up a slot within the Dask worker:

In the section above we saw that you could have multiple clients running at the
same time, each of which generated and manipulated futures. These clients can
coordinate with each other using Dask Queue and Variable objects, which
can communicate futures or small bits of data between clients sensibly.

Dask queues follow the API for the standard Python Queue, but now move futures
or small messages between clients. Queues serialize sensibly and reconnect
themselves on remote clients if necessary.

Queues can also send small pieces of information, anything that is msgpack
encodable (ints, strings, bools, lists, dicts, etc..). This can be useful to
send back small scores or administrative messages:

Under normal operation Dask will not run any tasks for which there is not
an active future (this avoids unnecessary work in many situations).
However sometimes you want to just fire off a task, not track its future,
and expect it to finish eventually. You can use this function on a future
or collection of futures to ask Dask to complete the task even if no active
client is tracking it.

The results will not be kept in memory after the task completes (unless
there is an active future) so this is only useful for tasks that depend on
side effects.

This opens up a new scheduling slot and a new thread for a new task. This
enables the client to schedule tasks on this node, which is
especially useful while waiting for other jobs to finish (e.g., with
client.gather).

>>> defmytask(x):... # do some work... client=get_client()... futures=client.map(...)# do some remote work... secede()# while that work happens, remove ourself from the pool... returnclient.gather(futures)# return gathered results

The Client connects users to a dask.distributed compute cluster. It
provides an asynchronous user interface around functions and futures. This
class resembles executors in concurrent.futures but also allows
Future objects within submit/map calls.

Parameters:

address: string, or Cluster

This can be the address of a Scheduler server like a string
'127.0.0.1:8786' or a cluster object like LocalCluster()

timeout: int

Timeout duration for initial connection to the scheduler

set_as_default: bool (True)

Claim this scheduler as the global dask scheduler

scheduler_file: string (optional)

Path to a file with scheduler information if available

security: (optional)

Optional security information

asynchronous: bool (False by default)

Set to True if this client will be used within a Tornado event loop

name: string (optional)

Gives the client a name that will be included in logs generated on
the scheduler for matters relating to this client

heartbeat_interval: int

Time in milliseconds between heartbeats to scheduler

See also

distributed.scheduler.Scheduler

Internal scheduler

Examples

Provide cluster’s scheduler node address on initialization:

>>> client=Client('127.0.0.1:8786')

Use submit method to send individual computations to the cluster

>>> a=client.submit(add,1,2)>>> b=client.submit(add,10,20)

Continue using submit or map on results to build up larger computations

This is true if the user signaled that we might be when creating the
client as in the following:

client=Client(asynchronous=True)

However, we override this expectation if we can definitively tell that
we are running from a thread that is not the event loop. This is
common when calling get_client() from within a worker task. Even
though the client was originally created in asynchronous mode we may
find ourselves in contexts when it is better to operate synchronously.

You can specify data of interest either by providing futures or
collections in the futures= keyword or a list of explicit keys in
the keys= keyword. If neither are provided then all call stacks
will be returned.

Starts computation of the collection on the cluster in the background.
Provides a new dask collection that is semantically identical to the
previous one, but now based off of futures currently in execution.

Parameters:

collections: sequence or single dask object

Collections like dask.array or dataframe or dask.value objects

optimize_graph: bool

Whether or not to optimize the underlying graphs

workers: str, list, dict

Which workers can run which parts of the computation
If a string a list then the output collections will run on the listed

workers, but other sub-computations can run anywhere

If a dict then keys should be (tuples of) collections and values

should be addresses or lists.

allow_other_workers: bool, list

If True then all restrictions in workers= are considered loose
If a list then only the keys for the listed collections are loose

This calls a function on all currently known workers immediately,
blocks until those results come back, and returns the results
asynchronously as a dictionary keyed by worker address. This method
if generally used for side effects, such and collecting diagnostic
information or installing libraries.

If your function takes an input argument named dask_worker then
that variable will be populated with the worker itself.

This moves data from the local client process into the workers of the
distributed scheduler. Note that it is often better to submit jobs to
your workers to have them load the data rather than loading data
locally and then scattering it out to them.

This allows multiple clients to share futures or small bits of data between
each other with a multi-producer/multi-consumer queue. All metadata is
sequentialized through the scheduler.

Elements of the Queue must be either Futures or msgpack-encodable data
(ints, strings, lists, dicts). All data is sent through the scheduler so
it is wise not to send large objects. To share large objects scatter the
data and share the future instead.

This allows multiple clients to share futures and data between each other
with a single mutable variable. All metadata is sequentialized through the
scheduler. Race conditions can occur.

Values must be either Futures or msgpack-encodable data (ints, lists,
strings, etc..) All data will be kept and sent through the scheduler, so
it is wise not to send too much. If you want to share a large amount of
data then scatter it and share the future instead.