The high volume of time series data can demand an analysis at scale.
So, time series need to be processed on a group of computational units instead of a singular machine.

Accordingly, it may be necessary to distribute the extraction of time series features to a cluster.
Indeed, it is possible to extract features with tsfresh in a distributed fashion.
This page will explain how to setup a distributed tsfresh.

Essentially, a Distributor organizes the application of feature calculators to data chunks.
It maps the feature calculators to the data chunks and then reduces them, meaning that it combines the results of the
individual mapping into one object, the feature matrix.

close all connections, shutdown all resources and clean everything
(by close())

So, how can you use such a Distributor to extract features with tsfresh?
You will have to pass it into as the distributor argument to the extract_features()
method.

The following example shows how to define the MultiprocessingDistributor, which will distribute the calculations to a
local pool of threads:

fromtsfresh.examples.robot_execution_failuresimport \
download_robot_execution_failures, \
load_robot_execution_failuresfromtsfresh.feature_extractionimportextract_featuresfromtsfresh.utilities.distributionimportMultiprocessingDistributor# download and load some time series datadownload_robot_execution_failures()df,y=load_robot_execution_failures()# We construct a Distributor that will spawn the calculations# over four threads on the local machineDistributor=MultiprocessingDistributor(n_workers=4,disable_progressbar=False,progressbar_title="Feature Extraction")# just to pass the Distributor object to# the feature extraction, along the other parametersX=extract_features(timeseries_container=df,column_id='id',column_sort='time',distributor=Distributor)

This example actually corresponds to the existing multiprocessing tsfresh API, where you just specify the number of
jobs, without the need to construct the Distributor:

We provide distributor for the dask framework, where
“Dask is a flexible parallel computing library for analytic computing.”

Dask is a great framework to distribute analytic calculations to a cluster.
It scales up and down, meaning that you can even use it on a singular machine.
The only thing that you will need to run tsfresh on a Dask cluster is the ip address and port number of the
dask-scheduler.

Lets say that your dask scheduler is running at 192.168.0.1:8786, then we can easily construct a
ClusterDaskDistributor that connects to the sceduler and distributes the
time series data and the calculation to a cluster:

Compared to the MultiprocessingDistributor example from above, we only had to
change one line to switch from one machine to a whole cluster.
It is as easy as that.
By changing the Distributor you can easily deploy your application to run to a cluster instead of your workstation.

You can also use a local DaskCluster on your local machine to emulate a Dask network.
The following example shows how to setup a LocalDaskDistributor on a local cluster
of 3 workers: