For many problems the built-in dask collections (dask.array,
dask.dataframe, dask.bag, and dask.delayed) are sufficient. For
cases where they aren’t it’s possible to create your own dask collections. Here
we describe the required methods to fullfill the dask collection interface.

Warning

The custom collection API is experimental and subject to change
without going through a deprecation cycle.

Note

This is considered an advanced feature. For most cases the built-in
collections are probably sufficient.

Return the finalizer and (optional) extra arguments to convert the computed
results into their in-memory representation.

Used to implement dask.compute.

Returns:

finalize : callable

A function with the signature finalize(results,*extra_args).
Called with the computed results in the same structure as the
corresponding keys from __dask_keys__, as well as any extra
arguments as specified in extra_args. Should perform any necessary
finalization before returning the corresponding in-memory collection
from compute. For example, the finalize function for
dask.array.Array concatenates all the individual array chunks into
one large numpy array, which is then the result of compute.

extra_args : tuple

Any extra arguments to pass to finalize after results. If no
extra arguments should be an empty tuple.

Return the rebuilder and (optional) extra arguments to rebuild an equivalent
dask collection from a persisted graph.

Used to implement dask.persist.

Returns:

rebuild : callable

A function with the signature rebuild(dsk,*extra_args). Called
with a persisted graph containing only the keys and results from
__dask_keys__, as well as any extra arguments as specified in
extra_args. Should return an equivalent dask collection with the
same keys as self, but with the results already computed. For
example, the rebuild function for dask.array.Array is just the
__init__ method called with the new graph but the same metadata.

extra_args : tuple

Any extra arguments to pass to rebuild after dsk. If no extra
arguments should be an empty tuple.

First the individual collections are converted to a single large graph and
nested list of keys. How this happens depends on the value of the
optimize_graph keyword, which each function takes:

If optimize_graph is True (default) then the collections are first
grouped by their __dask_optimize__ methods. All collections with the
same __dask_optimize__ method have their graphs merged and keys
concatenated, and then a single call to each respective
__dask_optimize__ is made with the merged graphs and keys. The
resulting graphs are then merged.

If optimize_graph is False then all the graphs are merged and all
the keys concatenated.

After this stage there is a single large graph and nested list of keys which
represents all the collections.

Computation

After the graphs are merged and any optimizations performed, the resulting
large graph and nested list of keys are passed on to the scheduler. The
scheduler to use is chosen as follows:

If a get function is specified directly as a keyword, use that.

Otherwise, if a global scheduler is set, use that.

Otherwise fall back to the default scheduler for the given collections.
Note that if all collections don’t share the same __dask_scheduler__
then an error will be raised.

Once the appropriate scheduler get function is determined, it’s called
with the merged graph, keys, and extra keyword arguments. After this stage
results is a nested list of values. The structure of this list mirrors
that of keys, with each key substituted with its corresponding result.

Postcompute

After the results are generated the output values of compute need to be
built. This is what the __dask_postcompute__ method is for.
__dask_postcompute__ returns two things:

A finalize function, which takes in the results for the corresponding
keys

A tuple of extra arguments to pass to finalize after the results

To build the outputs, the list of collections and results is iterated over,
and the finalizer for each collection is called on its respective results.

In pseudocode this process looks like:

defcompute(*collections,**kwargs):# 1. Graph Merging & Optimization# -------------------------------ifkwargs.pop('optimize_graph',True):# If optimization is turned on, group the collections by# optimization method, and apply each method only once to the merged# sub-graphs.optimization_groups=groupby_optimization_methods(collections)graphs=[]foroptimize_method,colsinoptimization_groups:# Merge the graphs and keys for the subset of collections that# share this optimization methodsub_graph=merge_graphs([x.__dask_graph__()forxincols])sub_keys=[x.__dask_keys__()forxincols]# kwargs are forwarded to ``__dask_optimize__`` from computeoptimized_graph=optimize_method(sub_graph,sub_keys,**kwargs)graphs.append(optimized_graph)graph=merge_graphs(graphs)else:graph=merge_graphs([x.__dask_graph__()forxincollections])# Keys are always the samekeys=[x.__dask_keys__()forxincollections]# 2. Computation# --------------# Determine appropriate get function based on collections, global# settings, and keyword argumentsget=determine_get_function(collections,**kwargs)# Pass the merged graph, keys, and kwargs to ``get``results=get(graph,keys,**kwargs)# 3. Postcompute# --------------output=[]# Iterate over the results and collectionsforres,collectioninzip(results,collections):finalize,extra_args=collection.__dask_postcompute__()out=finalize(res,**extra_args)output.append(out)# `dask.compute` always returns tuplesreturntuple(output)

Persist is very similar to compute, except for how the return values are
created. It too has three stages:

Graph Merging & Optimization

Same as in compute.

Computation

Same as in compute, except in the case of the distributed scheduler,
where the values in results are futures instead of values.

Postpersist

Similar to __dask_postcompute__, __dask_postpersist__ is used to
rebuild values in a call to persist. __dask_postpersist__ returns
two things:

A rebuild function, which takes in a persisted graph. The keys of
this graph are the same as __dask_keys__ for the corresponding
collection, and the values are computed results (for the single machine
scheduler) or futures (for the distributed scheduler).

A tuple of extra arguments to pass to rebuild after the graph

To build the outputs of persist, the list of collections and results is
iterated over, and the rebuilder for each collection is called on the graph
for its respective results.

In pseudocode this looks like:

defpersist(*collections,**kwargs):# 1. Graph Merging & Optimization# -------------------------------# **Same as in compute**graph=...keys=...# 2. Computation# --------------# **Same as in compute**results=...# 3. Postpersist# --------------output=[]# Iterate over the results and collectionsforres,collectioninzip(results,collections):# res has the same structure as keyskeys=collection.__dask_keys__()# Get the computed graph for this collection.# Here flatten converts a nested list into a single listsubgraph={k:rfor(k,r)inzip(flatten(keys),flatten(res))}# Rebuild the output dask collection with the computed graphrebuild,extra_args=collection.__dask_postpersist__()out=rebuild(subgraph,*extra_args)output.append(out)# dask.persist always returns tuplesreturntuple(output)

Defining the above interface will allow your object to used by the core dask
functions (dask.compute, dask.persist, dask.visualize, etc…). To
add corresponding method versions of these subclass from
dask.base.DaskMethodsMixin, which adds implementations of compute,
persist, and visualize based on the interface above.

Here we create a dask collection representing a tuple. Every element in the
tuple is represented as a task in the graph. Note that this is for illustration
purposes only - the same user experience could be done using normal tuples with
elements of dask.delayed.

# Saved as dask_tuple.pyfromdask.baseimportDaskMethodsMixinfromdask.optimizationimportcull# We subclass from DaskMethodsMixin to add common dask methods to our# class. This is nice but not necessary for creating a dask collection.classTuple(DaskMethodsMixin):def__init__(self,dsk,keys):# The init method takes in a dask graph and a set of keys to use# as outputs.self._dsk=dskself._keys=keysdef__dask_graph__(self):returnself._dskdef__dask_keys__(self):returnself._keys@staticmethoddef__dask_optimize__(dsk,keys,**kwargs):# We cull unnecessary tasks here. Note that this isn't necessary,# dask will do this automatically, this just shows one optimization# you could do.dsk2,_=cull(dsk,keys)returndsk2# Use the threaded scheduler by default.__dask_scheduler__=staticmethod(dask.threaded.get)def__dask_postcompute__(self):# We want to return the results as a tuple, so our finalize# function is `tuple`. There are no extra arguments, so we also# return an empty tuple.returntuple,()def__dask_postpersist__(self):# Since our __init__ takes a graph as its first argument, our# rebuild function can just be the class itself. For extra# arguments we also return a tuple containing just the keys.returnTuple,(self._keys,)def__dask_tokenize__(self):# For tokenize to work we want to return a value that fully# represents this object. In this case it's the list of keys# to be computed.returntuple(self._keys)

Dask implements its own deterministic hash function to generate keys based on
the value of arguments. This function is available as dask.base.tokenize.
Many common types already have implementations of tokenize, which can be
found in dask/base.py.

When creating your own custom classes you may need to register a tokenize
implementation. There are two ways to do this:

Note

Both dask collections and normal python objects can have
implementations of tokenize using either of the methods
described below.

The __dask_tokenize__ method

Where possible, it’s recommended to define the __dask_tokenize__ method.
This method takes no arguments and should return a value fully
representative of the object.

Register a function with dask.base.normalize_token

If defining a method on the class isn’t possible, you can register a tokenize
function with the normalize_token dispatch. The function should have the
same signature as described above.

In both cases the implementation should be the same, only the location of the
definition is different.