We (Gaël Varoquaux, Cheng Soon Ong and Antti Honkela) are organising another MLOSS workshop at ICML 2015 in Lille, France this July. The theme for this edition is "Open Ecosystems" whereby we wish to invoke discussion on benefits (or drawbacks?) of multiple tools in the same ecosystem. Our invited speakers (John Myles White and Matthew Rocklin) will share some of their experiences on Julia and Python, and we would be happy to hear from others either on the same or different ecosystems through contributed talks. Usual demonstrations of new great software are naturally also welcome!

In addition to the talks, we have planned two more active sessions:

an open discussion with themes voted by workshop participants similar to MLOSS 2013; and

a hackathon for planning and starting to develop infrastructure for measuring software impact.

If you have any comments or suggestions regarding these, please add a comment here or email the organisers!

How many of the papers that are in the top 100 most cited about software?

21, with an additional 12 papers which are not specifically about software itself, but about methods or statistics that were implemented later in software. When you take a step back and think about the myriad areas of research and the stratospheric numbers of citations the top 100 get, it is quite remarkable that one fifth of the papers are actually about software. I mean really about software, not software as an afterthought. Some examples:

To put in perspective how rarified the air is in the top 100 citations, the if we combined all citations received by all JMLR papers in the last five years (according to SCImago), this one gigantic paper would not even make it into the top 100.

Yes, yes, citations do not directly measure the quality of the paper, and there are size of community effects and all that. To be frank, being highly cited seems to be mostly luck.

What would you include a linux distribution to customise it for machine learning researchers and developers? Which are the tools that would cover the needs of 90% of PhD students who aim to do a PhD related to machine learning? How would you customise a mainstream linux distribution to (by default) include packages that would allow the user to quickly be able to do machine learning on their laptop?

There are several communities which have their own custom distribution:

Scientific Linux which is based on Red Hat Enterprise Linux is focused making it easy for system administrators of larger organisations. The two big users are FermiLab and CERN who each have their own custom "spin". Because of its experimental physics roots, it does not have a large collection of pre-installed scientific software, but makes it easy for users to install their own.

Bio-Linux is at the other end of the spectrum. Based on Ubuntu, it aims to provide a easy to use bioinformatics workstation by including more than 500 bioinformatics programs, including graphical menus to them and sample data for testing them. It is targeted at the end user, with simple instructions for running it Live from DVD or USB, to install it, and to dual boot it.

Fedora Scientific is the latest entrant, providing a nice list of numerical tools, visualisation packages and also LaTeX packages. Its documentation lists packages for C, C++, Octave, Python, R and Java. Version control is also not forgotten. A recent summary of Fedora Scientific was written as part of Open Source Week.

It would seem that Fedora Scientific would satisfy the majority of machine learning researchers, since it provides packages for most things already. Some additional tools that may be useful include:

tools for managing experiments and collecting results, to make our papers replicable

GPU packages for CUDA and OpenCL

Something for managing papers for reading, similar to Mendeley

Something for keeping track of ideas and to do lists, similar to Evernote

There's definitely tons of stuff that I've forgotten!

Perhaps a good way to start is to have the list of package names useful for the machine learning researcher in some popular package managers such as yum, apt-get, dpkg. Please post your favourite packages in the comments.

GSoC 2014 is between 19 May and 18 August this year. The students should now be just sinking their teeth into the code, and hopefully having a lot of fun while gaining invaluable experience. This amazing program is in its 10th year now, and it is worth repeating how it benefits everyone:

students - You learn how to write code in a team, and work on projects that are long term. Suddenly, all the software engineering lectures make sense! Having GSoC in your CV really differentiates you from all the other job candidates out there. Best of all, you actually have something to show your future employer that cannot be made up.

mentors - You get help for your favourite feature in a project that you care about. For many, it is a good introduction to project management and supervision.

organisation - You recruit new users and, if you are lucky, new core contributors. GSoC experience also tends to push projects to be more beginner friendly, and to make it easier for new developers to get involved.

I was curious about how many machine learning projects were in GSoC this year and wrote a small ipython notebook to try to find out.

Looking at the organisations with the most students, I noticed that the Technical University Vienna has come together and joined as a mentoring organisation. This is an interesting development, as it allows different smaller projects (the titles seem disparate) to come together and benefit from a more sustainable open source project.

On to machine learning... Using a bunch of heuristics, I tried to identify machine learning projects from the organisation name and project titles. I found more than 20 projects with variations of "learn" in them. This obviously misses out projects from R some of which are clearly machine learning related, but I could not find a rule to capture them. I am pretty sure I am missing others too. I played around with some topic modelling, but this is hampered by the fact that I could not figure out a way to scrape the project descriptions from the dynamically generated list of project titles on the GSoC page.

Best practices

These articles provide a great resource to get started on the long road to doing "proper science". Some common suggestions which are relevant to practical machine learning include:

Use version control

Start now. No, not after your next paper, do it right away! Learn one of the modern distributed version control systems, git or mercurial currently being the most popular, and get an account on github or bitbucket to start sharing. Even if you don't share your code, it is a convenient offsite backup. Github is the most popular for open source projects, but bitbucket has the advantage of free private accounts. If you have an email address from an educational institution, you get the premium features for free too.

Distributed version control systems can be conceptually daunting, but it is well worth the trouble to understand the concepts instead of just robotically type in commands. There are numerous tutorials out there, and here are some which I personally found entertaining, git foundations and hginit. For those who don't like the command line, have a look at GUIs such as sourcetree, tortoisegit, tortoisehg, and gitk. If you work with other people, it is worth learning the fork and pull request model, and use the gitflow convention.

Please add your favourite tips and tricks in the comments below!

Open source your code and scripts

Publish everything. Even the two lines of Matlab that you used to plot your results. The readers of your NIPS and ICML papers are technical people, and it is often much simpler for them to look at your Matlab plot command than to parse the paragraph that describes the x and y axes, the meaning of the colours and line types, and the specifics of the displayed error bars. Tools such as ipython notebooks and knitr are examples of easy to implement literate programming frameworks that allow you to make your supplement a live document.

It is often useful to try to conceptually split your computational code into "programs" and "scripts". There is no hard and fast rule for where to draw the line, but one useful way to think about it is to contrast code that can be reused (something to be installed), and code that runs an experiment (something that describes your protocol). An example of the former is your fancy new low memory logistic regression training and testing code. An example of the latter is code to generate your plots. Make both types of code open, document and test them well.

Make your data a resource

Your result is also data. When open data is mentioned, most people immediately conjure images of the inputs to prediction machines. But intermediate stages of your workflow are often left out of making things available. For example, if in addition to providing the two lines of code for plotting, you also provided your multidimensional array containing your results, your paper now becomes a resource for future benchmarking efforts. If you made your precomputed kernel matrices available, other people can easily try out new kernel methods without having to go through the effort of computing the kernel.

Efforts such as mldata.org and mlcomp.org provide useful resources to host machine learning oriented datasets. If you do create a dataset, it is useful to get an identifier for it so that people can give you credit.

Challenges to open science

While the articles call these rules "simple", they are by no means easy to implement. While easy to state, there are many practical hurdles to making every step of your research reproducible .

Social coding

Unlike publishing a paper, where you do all your work before publication, publishing a piece of software often means that you have to support it in future. It is remarkably difficult to keep software available in the long term, since most junior researchers move around a lot and often leave academia altogether. It is also challenging to find contributors that can help out in stressful periods, and to keep software up to date and useful. Open source software suffers from the tragedy of the commons, and it quickly becomes difficult to maintain.

While it is generally good for science that everything is open and mistakes are found and corrected, the current incentive structure in academia does not reward support for ongoing projects. Funding is focused on novel ideas, publications are used as metrics for promotion and tenure, and software gets left out.

The secret branch

When developing a new idea, it is often tempting to do so without making it open to public scrutiny. This is similar to the idea of a development branch, but you may wish to keep it secret until publication. The same argument applies for data and results, where there may be a moratorium. I am currently unaware of any tools that allow easy conversion between public and private branches. Github allows forks of repositories, which you may be able to make private.

Once a researcher gets fully involved in an application area, it is inevitable that he starts working on the latest data generated by his collaborators. This could be the real time stream from
Twitter or the latest double blind drug study. Such datasets are often embargoed from being made publicly available due to concerns about privacy. In the area of biomedical research there are efforts to allow bona fide researchers access to data, such as dbGaP. It seamlessly provides a resource for public and private data. Instead of a hurdle, a convenient mechanism to facilitate the transition from private to open science would encourage many new participants.

What is the right access control model for open science?

Data is valuable

It is a natural human tendency to protect a scarce resource which gives them a competitive advantage. For researchers, these resources include source code and data. While it is understandable that authors of software or architects of datasets would like to be the first to benefit from their investment, it often happens that these resources are not made publicly available even after publication.

We were very lucky this year to have an amazing set of keynote speakers at ACML 2013 who have made key contributions to getting machine learning into the real world. Here are some links to the open source software projects that they mentioned during their talks. The videos of the talks should be available at some point on the ACML website

We started off with Geoff Holmes, who spoke at MLOSS 06. He told us about how WEKA has been used in industry (satisfying Kiri Wagstaff's Challenge #2), and the new project for streaming data MOA. Later in the day, Chih-Jen Lin told us how important it was to understand both machine learning and optimisation, such that you can exploit the special structure for fast training of SVMs. This is how he obtained amazing speedups in LIBLINEAR. On the second day, Ralf Herbrich (who also gave a tutorial) gave us a behind the scenes tour of TrueSkill, the player matching algorithm used on XBox Live. Source code in F# is available here and the version generalised to track skill over time is available here.

I was recently asked to become an Action Editor for the
Machine Learning and Open Source Software (MLOSS) track of Journal of
Machine Learning Research. Of course, I gladly accepted since the aim of
the JMLR MLOSS track (as well as the broader MLOSS project) -- to encourage
the creation and use of open source software within machine learning -- is well
aligned with my own interests and attitude towards scientific software.

Shortly after I joined, one of the other editors raised a question about how we
are to interpret an item in the review criteria that states that reviewers
should consider the "freedom of the code (lack of dependence on proprietary
software)" when assessing submissions. What followed was an engaging email
discussion amongst the Action Editors about the how to clarify our position.

After some discussion (summarised below), we settled on the following guideline
which tries to ensure MLOSS projects are as open as possible while recognising
the fact that MATLAB, although "closed", is nonetheless widely used within the
machine learning community and has an open "work-alike" in the form of
GNU Octave:

Dependency on Closed Source Software

We strongly encourage submissions that do not depend on closed source and
proprietary software. Exceptions can be made for software that is widely used
in a relevant part of the machine learning community and accessible to most
active researchers; this should be clearly justified in the submission.

The most common case here is the question whether we will accept software
written for Matlab. Given its wide use in the community, there is no strict
reject policy for MATLAB submissions, but we strongly encourage submissions to
strive for compatibility with Octave unless absolutely impossible.

The Discussion

There were a number of interesting arguments raised during the discussion, so I
offered to write them up in this post for posterity and to solicit feedback from
the machine learning community at large.

Reviewing and decision making

A couple of arguments were put forward in favour of a strict "no proprietary
dependencies" policy.

Firstly, allowing proprietary dependencies may limit our ability to find reviewers
for submissions -- an already difficult job. Secondly, stricter policies have the
benefit of being unambiguous, which would avoid future discussions about the
acceptability of future submission.

Promoting open ports

An argument made in favour of accepting projects with proprietary
dependencies was that doing so may actually increase the chances of its code being
forked to produce a version with no such dependencies.

Mikio Braun explored this idea further along with
some broader concerns in a blog post about the role of curation and how it
potentially limits collaboration.

Where do we draw the line?

Some of us had concerns about what exactly constitutes a proprietary dependency
and came up with a number of examples that possibly fall into a grey area.

For example, how do operating systems fit into the picture?
What if the software in question only compiles on Windows or OS X? These are both
widely used but proprietary. Should we ensure MLOSS projects also work on Linux?

Taking a step up the development chain, what if the code base is most easily built
using proprietary development tools such as Visual Studio or XCode?
What if libraries such as MATLAB's Statistics Toolbox or Intel's MKL library are
needed for performance reasons?

Things get even more subtle when we note that certain data formats (e.g., for
medical imaging) are proprietary. Should such software be excluded even though the
algorithms might work on other data?

These sorts of considerations suggested that a very strict policy may be difficult
to enforce in practice.

What is our focus?

It is pretty clear what position Richard Stallman or other fierce free software
advocates would take on the above questions: reject all of them!
It is not clear that such an extreme position would necessarily suit the goals of
the MLOSS track of JMLR.

Put another way, is the focus of MLOSS the "ML" or the "OSS"?
The consensus seemed to be that we want to promote open source software to
benefit machine learning, not the other way around.

Looking At The Data

Towards the end of the discussion, I made the argument that if we cannot be
coherent we should at least be consistent and presented some data on all
the accepted MLOSS submissions.
The list below shows the breakdown of languages used by the 50 projects that
have been accepted to the JMLR track to date. I'll note that some projects use
and/or target multiple languages and that, because I only spent half an hour
surveying the projects, I may have inadvertently misrepresented some (if
I've done so, let me know).

C++: 15; Java: 13; MATLAB:11; Octave: 10; Python:9; C: 5; R: 4.

From this we can see that MATLAB is fairly well-represented amongst the accepted
MLOSS projects. I took a closer look and found that of the 11 projects that are
written in (or provide bindings for) MATLAB, all but one of them provide support
for GNU Octave compatibility as well.

Closing Thoughts

I think the position we've adopted is realistic, consistent, and suitably
aspirational.
We want to encourage and promote projects that strive for openness and the positive
effects it enables (e.g., reproducibility and reuse) but do not want to strictly
rule out submissions that require a widely used, proprietary platform such as MATLAB.

Of course, a project like MLOSS is only as strong as the community it serves so we
are keen to get feedback about this decision from people who use and create machine
learning software so feel free to leave a comment or contact one of us by email.

How good is the software associated with scientific papers? There seems to be a general impression that the quality of scientific software is not that great. How do we check for software quality? Well, by doing code review.

GSoC has just announced the list of participating organisations. This is a great opportunity for students to get involved in projects that matter, and to learn about code development which is bigger than the standard "one semester" programming project that they are usually exposed to at university.

Some statistics:

177 of 417 projects were accepted, which is a success rate of 42%.

40 of the 177 project are accepted for the first time, which is a 23% proportion of new blood.

These seem to be in the same ballpark as most other competitive schemes for obtaining funding. Perhaps there is some type of psychological "mean" which reviewers gravitate to when they are evaluating submissions. For example, consider that out of the 4258 students that applied for projects in 2012, 1212 students got accepted, a rate of 28%.

To the students out there, please get in touch with potential mentors before putting in your applications. You'd be surprised at how much it could improve your application!