The most common way to use dask-yarn is to distribute an archived Python
environment throughout the YARN cluster as part of the application. Packaging
the environment for distribution is typically handled using

You can package a virtual environment using venv-pack. The virtual environment
can be created using either venv or virtualenv. Note that the python linked
to in the virtual environment must exist and be accessible on every node in the
YARN cluster. If the environment was created with a different Python, you can
change the link path using the --python-prefix flag. For more information see
the venv-pack documentation.

You can now start a cluster with the packaged environment by passing the
path to the constructor, e.g. YarnCluster(environment='my-env.tar.gz',...).

Note that if the environment is a local file, the archive will be automatically
uploaded to a temporary directory on HDFS before starting the application. If
you find yourself reusing the same environment multiple times, you may want to
upload the environment to HDFS once beforehand to avoid repeating this process
for each cluster (the environment is then specified as
hdfs:///path/to/my-env.tar.gz).

After startup you may want to verify that your versions match with the
following:

fromdask_yarnimportYarnClusterfromdask.distributedimportClientcluster=YarnCluster(environment='my-env.tar.gz')client=Client(cluster)client.get_versions(check=True)# check that versions match between all nodes

fromdask_yarnimportYarnCluster# Use a conda environment at /path/to/my/conda/envcluster=YarnCluster(environment='conda:///path/to/my/conda/env')# Use a virtual environment at /path/to/my/virtual/envcluster=YarnCluster(environment='venv:///path/to/my/virtual/env')# Use a Python executable at /path/to/my/pythoncluster=YarnCluster(environment='python:///path/to/my/python')

As before, these environments can have any Python packages, but must include
dask-yarn (and its dependencies) at a minimum. It’s also very important
that these environments are uniform across all nodes; mismatched environments
can lead to hard to diagnose issues. To check this, you can use the
Client.get_versions method:

fromdask.distributedimportClientclient=Client(cluster)client.get_versions(check=True)# check that versions match between all nodes