Monday, July 29, 2013

Yesterday a week-long scikit-learn coding sprint in Paris ended.
And let me just say: a week is pretty long for a sprint. I think most of us were pretty exhausted in the end. But we put together a release candidate for 0.14 that Gael Varoquaux tagged last night.

You can install it via:pip install -U https://github.com/scikit-learn/scikit-learn/archive/0.14a1.zip

The purpose of the release candidate is to give users a chance to give us feedback before the release. So please try it out and report back if you have any issues.

New Website

Before I start talking about the release candidate and the sprint, I want to mention the new face of scikit-learn.org. I think it is an improvement with respect to design, but it is an even bigger and more important improvement with respect to navigation and accessibility of the docs. The new page was drafted by Nelle Varoquaux, Vincent Michel and me, and the design is mostly due to Gilles Louppe, who I think did an amazing job.

Basically we redid the front page to give a short overview of the package, and added a documentation overview, to make it easier to find things.
Feedback on design and navigation are more than welcome - on the mailing list or the issue tracker.

We are also trying to address the issue of having completely separate pages for different versions, and having google links that are not to the latest stable. But we will have to see how that will play out.

Release Candidate for 0.14

Now to the release candidate:
This is the first time we did a candidate. What this means is that people can choose to install the upcoming release and we can include changes based on their feedback before switching the default install version to 0.14.

There are a lot of new real killer features in this version. You can find the full change log here. Let me say a bit about the most sexy new features.

Python 3 Support

We now have full Python3 support, more precisely Python 3.3. Using six, we use a single code base to support Python 2.6, 2.7 and 3.3.

Faster Trees and Forests

Gilles did a complete rewrite of the tree module, with the goal of decreasing runtime and memory consumption of all tree based estimators.
As a consequence, random forests are about 50%-300% faster, and for extremely randomized trees it looks even better. That makes the scikit-learn implementation similarly fast as the commercial implementation by wise.io.
It is very hard to create a fair benchmark as run times vary widely with parameter settings and data sets. To his credit, Gilles didn't want to publish any timing results before he had time to perform extensive tests. But it does look pretty good.

AdaBoost Classification and Regression

AdaBoost is a classical weighed boosting method, that was implemented for scikit-learn by Noel Dawe and Gilles. By default, the implementation uses decision trees or stumps, but can be used with any other estimator that supports sample weights.
The algorithm is generally applicable and often performs very well in practice.
Unfortunately there is an issue with building the Sphinx documentation, and the API is currently not visible on the dev website. You can still look at the docstring in IPython, though, and it will be fixed for the release.
Here is one of the examples:

Restricted Boltzmann Machines

This one is a guest performance by Yann Dauphin, an expert in deep learning and feature learning. Restricted Boltzmann Machines are usually used as feature extraction algorithms or for matrix completion problems.
They are a generative graphical model that can approximate very complex data distributions, and were made popular as an initialization for neural networks in the deep learning paradigm.
The scikit-learn implementation is of the Bernoulli Restricted Boltzmann Machine, which means that the input as well as the learned features are binary.
Often, this is relaxed to input and output that is continuous between 0 and 1, but concentrated at these two values.
This is one of the basic building block for deep learning, and can be made into a Deep Belief Network simply by stacking them using a Pipeline.
On the down-side, RBMs take often long to train on the CPU, and it is not always clear if they perform better in feature extraction tasks than more simple methods, such as K-Means based encodings.
One of the benefits of having an implementation in scikit-learn will be that much more people will be experimenting with it, which will lead to a better understanding of the behavior in practice.
Of course, no mention of deep learning is complete without a plot of learned filters:

Missing Value Imputation

A very recent addition, and the product of the Google Summer of Code by Nicolas Trésegnie. Until now, scikit-learn did not support missing values in any estimators, as they are often hard to handle. Unfortunately, missing values pop up frequently in practical applications.
Nicolas introduces a new estimator, the Imputer, which can be used to preprocess data and fill in missing values using several strategies.
Currently, only simple, but still effective methods are implemented, such as using the mean or median of a feature. For details, see the documentation.

Randomized Parameter Optimization

Randomized parameter search, as an alternative to grid search is one of the few things that I did for the current release. It is an implementation of the approach put forward by James Bergstra.
The basic idea is to overcome the curse of dimensionality for hyper-parameters by using random sampling. Let me elaborate: For algorithms with many hyper parameters - such as complicated pipelines or neural networks - it is often not feasible to do a grid-search over all parameter settings of interest.
It is indeed not always clear which parameters are relevant, and which are not.
By specifying distributions over the parameter-space and sampling from this distribution, it is possible to overcome this problem in parts.

Let me illustrate that with an example:
Imagine you have two continuous parameters, one of which is completely irrelevant (which you don't know in advance). Say both parameters lie between 0 and 1. If you use a standard grid-search, using steps of 0.2 in both directions, you need 25 fitting runs to obtain the results for the whole grid, and you will
have obtained an estimate for 5 different values of the relevant parameter.
If you instead sample randomly from the uniform distribution over both parameters 25 times, you will get 25 settings of the relevant parameter, giving you a much finer search with the same number of fits.

Model evaluation and selection with more scoring functions

Until now, our API for grid search and cross validation allowed only functions that get a vector of ground truth values y_true and a vector of predictions y_hat. That made it impossible to use scores such as recall, area under the curve, or ranking losses, which all need certainty estimates.

In 0.14, we introduced a new interface that is much more flexible. We now support any callable with arguments (estimator, X_test, y_test), i.e. a fitted estimator, the test data and the ground truth labels. This allows for quite sophisticated evaluation schemes, that even have full access to the fitted model.
For convenience, we also allow string options for all the common methods.
A list can be found in the documentation.

There have been numerous other improvements and bug-fixes. In particular the metrics module for model evaluation and the corresponding documentation were greatly improved by Arnaud Joly. Oliver Grisel implemented out of core learning for naive Bayes estimators using partial_fit. Another great improvement has been the rewrite of the neighbors module by Jake Vanderplas, which made many neighbors based algorithms more efficient.
We now also use the neighbors module in the DBSCAN clustering, which makes our implementation much faster and more scalable.

The Sprint

First and foremost, I want to thank Nelle (president of afpy) for the organization of the sprint. She did a spectacular job in organizing travel, not losing people in Paris, and generally holding it all together. Also, she brought croissants every morning.
I also want to thank Alex Gramfort who got us place at Telecom ParisTech for most of the sprint, and the people at tinyclues, who gave us their office for the weekend.

As I already mentioned some of the great contributions of the sprint above, and you can read the rest in the change log, here is just a brief account of my personal experience (i.e. the interesting part of the blog post, if any, ends here ;)

The sprint was a very different experience for me than the last one, or any coding session I had so far, as I spent a lot of time on organization and on pushing work on the website.
I'm very bad at web design, and I have not much experience with jinja. But I have been convinced for quite some time, that we needed to revamp the website, in particular to make the documentation more accessible and easier to navigate.
Luckily, I found some much more experienced web-designers, who did the actual work: Gilles Louppe, Nelle Varoquaux and Jaques Grobler. I really like the result, even though it is not finished yet.
In particular, I think the new documentation overview is a great improvement.

For the rest of the time, I mostly reviewed pull requests, discussed API and tried to find priorities for the release. This made me task-switch quite a lot, and I don't feel I actually accomplished much. I am very happy with what the team achieved overall, though, and I guess I did my part.

Maybe we should do only five days next time, and not release (candidate) immediately afterwards. On the other hand, it is rare that so many people reserve so much time for the project, and it is good to get things done.

It should but I don't see the point. AdaBoost has build-in multi-class support. Or do you want to do multi-label? That is in principle also possible with AdaBoost directly, but would require hacking the code a bit, I think.

Thank you very much! This is an amazing tool, and I look forward to enjoying all your new implementations (AdaBoost in particular is a welcome addition, and the RF speedup is great! ) . (My research work is all with scikit learn. I've been teaching our Bioinformatics and computational labs all about it whenever I can!)