This document list general directions that core contributors are interested
to see developed in scikit-learn. The fact that an item is listed here is in
no way a promise that it will happen, as resources are limited. Rather, it
is an indication that help is welcomed on this topic.

A more subtle change over the last decade is that, due to changing interests
in ML, PhD students in machine learning are more likely to contribute to
PyTorch, Dask, etc. than to Scikit-learn, so our contributor pool is very
different to a decade ago.

Scikit-learn remains very popular in practice for trying out canonical
machine learning techniques, particularly for applications in experimental
science and in data science. A lot of what we provide is now very mature.
But it can be costly to maintain, and we cannot therefore include arbitrary
new implementations. Yet Scikit-learn is also essential in defining an API
framework for the development of interoperable machine learning components
external to the core library.

The list is numbered not as an indication of the order of priority, but to
make referring to specific points easier. Please add new entries only at the
bottom.

Everything in Scikit-learn should conform to our API contract

Pipeline and FeatureUnion modify their input
parameters in fit. Fixing this requires making sure we have a good
grasp of their use cases to make sure all current functionality is
maintained. #8157#7382

Tree-based models should be able to handle both continuous and categorical
features #4899

In dataset loaders

As generic transformers to be used with ColumnTransforms (e.g. ordinal
encoding supervised by correlation with target variable)

Improved handling of missing data

Making sure meta-estimators are lenient towards missing data

Non-trivial imputers

Learners directly handling missing data

An amputation sample generator to make parts of a dataset go missing

Handling mixtures of categorical and continuous variables

Passing around information that is not (X, y): Sample properties

We need to be able to pass sample weights to scorers in cross validation.

We should have standard/generalised ways of passing sample-wise properties
around in meta-estimators. #4497#7646

Passing around information that is not (X, y): Feature properties

Feature names or descriptions should ideally be available to fit for, e.g.
. #6425#6424

Per-feature handling (e.g. “is this a nominal / ordinal / English language
text?”) should also not need to be provided to estimator constructors,
ideally, but should be available as metadata alongside X. #8480

Passing around information that is not (X, y): Target information

We have problems getting the full set of classes to all components when
the data is split/sampled. #6231#8100

We have no way to handle a mixture of categorical and continuous targets.

Make it easier for external users to write Scikit-learn-compatible
components

More flexible estimator checks that do not select by estimator name
#6599#6715

Example of how to develop a meta-estimator

More self-sufficient running of scikit-learn-contrib or a similar resource

Grid search and cross validation are not applicable to most clustering
tasks. Stability-based selection is more relevant.

Improved tracking of fitting

Verbose is not very friendly and should use a standard logging library
#6929

Callbacks or a similar system would facilitate logging and early stopping

Use scipy BLAS Cython bindings

This will make it possible to get rid of our partial copy of suboptimal
Atlas C-routines. #11638

This should speed up the Windows and Linux wheels

Allow fine-grained parallelism in cython

Now that we do not use fork-based multiprocessing in joblib anymore it’s
possible to use the prange / openmp thread management which makes it
possible to have very efficient thread-based parallelism at the Cython
level. Example with K-Means: #11950

Distributed parallelism

Joblib can now plug onto several backends, some of them can distribute the
computation across computers

However, we want to stay high level in scikit-learn

A way forward for more out of core

Dask enables easy out-of-core computation. While the dask model probably
cannot be adaptable to all machine-learning algorithms, most machine
learning is on smaller data than ETL, hence we can maybe adapt to very
large scale while supporting only a fraction of the patterns.

Currently serialization (with pickle) breaks across versions. While we may
not be able to get around other limitations of pickle re security etc, it
would be great to offer cross-version safety from version 1.0. Note: Gael
and Olivier think that this can cause heavy maintenance burden and we
should manage the trade-offs. A possible alternative is presented in the
following point.

Documentation and tooling for model lifecycle management

Document good practices for model deployments and lifecycle: before
deploying a model: snapshot the code versions (numpy, scipy, scikit-learn,
custom code repo), the training script and an alias on how to retrieve
historical training data + snapshot a copy of a small validation set +
snapshot of the predictions (predicted probabilities for classifiers)
on that validation set.

Document and tools to make it easy to manage upgrade of scikit-learn
versions:

Try to load the old pickle, if it works, use the validation set
prediction snapshot to detect that the serialized model still behave
the same;

If joblib.load / pickle.load not work, use the versioned control
training script + historical training set to retrain the model and use
the validation set prediction snapshot to assert that it is possible to
recover the previous predictive performance: if this is not the case
there is probably a bug in scikit-learn that needs to be reported.

(Optional) Improve scikit-learn common tests suite to make sure that (at
least for frequently used) models have stable predictions across-versions
(to be discussed);

Extend documentation to mention how to deploy models in Python-free
environments for instance ONNX.
and use the above best practices to assess predictive consistency between
scikit-learn and ONNX prediction functions on validation set.

Document good practices to detect temporal distribution drift for deployed
model and good practices for re-training on fresh data without causing
catastrophic predictive performance regressions.

More didactic documentation

More and more options have been added to scikit-learn. As a result, the
documentation is crowded which makes it hard for beginners to get the big
picture. Some work could be done in prioritizing the information.