Tools for performing hyperparameter optimization of Scikit-Learn models using
Dask.

Introduction

This library provides implementations of Scikit-Learn's GridSearchCV and
RandomizedSearchCV. They implement many (but not all) of the same
parameters, and should be a drop-in replacement for the subset that they do
implement. For certain problems, these implementations can be more efficient
than those in Scikit-Learn, as they can avoid expensive repeated computations.

:ref:`Avoid repeated work <avoid-repeated-work>`. Candidate estimators with
identical parameters and inputs will only be fit once. For
composite-estimators such as Pipeline this can be significantly more
efficient as it can avoid expensive repeated computations.

Install

Walkthrough

Drop-In Replacement

Dask-searchcv provides (almost) drop-in replacements for Scikit-Learn's
GridSearchCV and RandomizedSearchCV. With the exception of a few
keyword arguments, the api's are exactly the same, and often only an import
change is necessary:

Flexible Backends

Dask-searchcv can use any of the dask schedulers. By default the threaded
scheduler is used, but this can easily be swapped out for the multiprocessing
or distributed scheduler:

Works Well With Dask Collections

Dask collections such as dask.array, dask.dataframe and
dask.delayed can be passed to fit. This means you can use dask to do
your data loading and preprocessing as well, allowing for a clean workflow.
This also allows you to work with remote data on a cluster without ever having
to pull it locally to your computer:

Avoid Repeated Work

When searching over composite estimators like sklearn.pipeline.Pipeline or
sklearn.pipeline.FeatureUnion, dask-searchcv will avoid fitting the same
estimator + parameter + data combination more than once. For pipelines with
expensive early steps this can be faster, as repeated work is avoided.

Looking closely, you can see that the Scikit-Learn version ends up fitting
earlier steps in the pipeline multiple times with the same parameters and data.
Due to the increased flexibility of Dask over Joblib, we're able to merge these
tasks in the graph and only perform the fit step once for any
parameter/data/estimator combination. For pipelines that have relatively
expensive early steps, this can be a big win when performing a grid search.