Debugging parallel programs is hard. Normal debugging tools like logging and
using pdb to interact with tracebacks stop working normally when exceptions
occur in far-away machines or different processes or threads.

Dask has a variety of mechanisms to make this process easier. Depending on
your situation some of these approaches may be more appropriate than others.

These approaches are ordered from lightweight or easy solutions to more
involved solutions.

When a task in your computation fails the standard way of understanding what
went wrong is to look at the exception and traceback. Often people do this
with the pdb module, IPython %debug or %pdb magics, or by just
looking at the traceback and investigating where in their code the exception
occurred.

Normally when a computation computes in a separate thread or a different
machine these approaches break down. Dask provides a few mechanisms to
recreate the normal Python debugging experience.

By default, Dask already copies the exception and traceback wherever they
occur and reraises that exception locally. If your task failed with a
ZeroDivisionError remotely then you’ll get a ZeroDivisionError in your
interactive session. Similarly you’ll see a full traceback of where this error
occurred, which, just like in normal Python, can help you to identify the
troublsome spot in your code.

However, you cannot use the pdb module or %debug IPython magics with
these tracebacks to look at the value of variables during failure. You can
only inspect things visually. Additionally, the top of the traceback may be
filled with functions that are dask-specific and not relevant to your
problem, you can safely ignore these.

Dask ships with a simple single-threaded scheduler. This doesn’t offer any
parallel performance improvements, but does run your Dask computation
faithfully in your local thread, allowing you to use normal tools like pdb,
%debug IPython magics, the profiling tools like the cProfile module and
snakeviz. This allows you to use
all of your normal Python debugging tricks in Dask computations, as long as you
don’t need parallelism.

This only works for single-machine schedulers. It does not work with
dask.distributed unless you are comfortable using the Tornado API (look at the
testing infrastructure
docs, which accomplish this). Also, because this operates on a single machine
it assumes that your computation can run on a single machine without exceeding
memory limits. It may be wise to use this approach on smaller versions of your
problem if possible.

If a remote task fails, we can collect the function and all inputs, bring them
to the local thread, and then rerun the function in hopes of triggering the
same exception locally, where normal debugging tools can be used.

With the single machine schedulers, use the rerun_exceptions_locally=True
keyword.

x.compute(rerun_exceptions_locally=True)

On the distributed scheduler use the recreate_error_locally method on
anything that contains Futures :

Sometimes only parts of your computations fail, for example if some rows of a
CSV dataset are faulty in some way. When running with the distributed
scheduler you can remove chunks of your data that have produced bad results if
you switch to dealing with Futures.

Not all errors present themselves as Exceptions. For example in a distributed
system workers may die unexpectedly or your computation may be unreasonably
slow due to inter-worker communication or scheduler overhead or one of several
other issues. Getting feedback about what’s going on can help to identify
both failures and general performance bottlenecks.

For the single-machine scheduler see diagnostics
documentation. The rest of the section will assume that you are using the
distributed scheduler where
these issues arise more commonly.

First, the distributed scheduler has a number of diagnostic web pages showing dozens of
recorded metrics like CPU, memory, network, and disk use, a history of previous
tasks, allocation of tasks to workers, worker memory pressure, work stealing,
open file handle limits, etc.. Many problems can be correctly diagnosed by
inspecting these pages. By default these are available at
http://scheduler:8787/http://scheduler:8788/ and http://worker:8789/
where scheduler and worker should be replaced by the addresses of the
scheduler and each of the workers. See web diagnostic docs for more information.

The scheduler and workers and client all emits logs using Python’s standard
logging module. By default
these emit to standard error. When Dask is launched by a cluster job scheduler
(SGE/SLURM/YARN/Mesos/Marathon/Kubernetes/whatever) that system will track
these logs and will have an interface to help you access them. If you are
launching Dask on your own they will probably dump to the screen unless you
redirect stderr to a file
.

You can control the logging verbosity in the ~/.dask/config.yaml file.
Defaults currently look like the following:

logging:distributed:infodistributed.client:warningbokeh:error

So for example you could add a line like distributed.worker:debug to get
very verbose output from the workers.

If you are using the distributed scheduler from a single machine you may be
setting up workers manually using the command line interface or you may be
using LocalCluster
which is what runs when you just call Client()

>>> fromdask.distributedimportClient,LocalCluster>>> client=Client()# This is actually the following two commands>>> cluster=LocalCluster()>>> client=Client(cluster.scheduler.address)

LocalCluster is useful because the scheduler and workers are in the same
process with you, so you can easily inspect their state while
they run (they are running in a separate thread).

You can also do this for the workers if you run them without nanny processes.

>>> cluster=LocalCluster(nanny=False)>>> client=Client(cluster)

This can be very helpful if you want to use the dask.distributed API and still
want to investigate what is going on directly within the workers. Information
is not distilled for you like it is in the web diagnostics, but you have full
low-level access.

Sometimes you want to inspect the state of your cluster, but you don’t have the
luxury of operating on a single machine. In these cases you can launch an
IPython kernel on the scheduler and on every worker, which lets you inspect
state on the scheduler and workers as computations are completing.

This does not give you the ability to run %pdb or %debug on remote
machines, the tasks are still running in separate threads, and so are not
easily accessible from an interactive IPython session.