Venturing into machine learning? These tools do the heavy lifting for you

Over the past year, machine learning has gone mainstream in an unprecedented way. The trend isn't fueled by cheap cloud environments and ever more powerful GPU hardware alone; it’s also the explosion of frameworks now available for machine learning. All are open source, but even more important is how they are being designed to abstract away the hardest parts of machine learning, and make its techniques available to a broad class of developers.

Here’s a baker's dozen machine learning frameworks, either freshly minted or newly revised within the past year. All caught our attention for being a product of a major presence in IT, for attempting to bring a novel simplicity to their problem domain, or for targeting a specific challenge associated with machine learning.

Apache Spark may be best known for being part of the Hadoop family, but this in-memory data processing framework was born outside of Hadoop and is making a name for itself outside the Hadoop ecosystem as well. Spark has become a go-to machine learning tool, thanks to its growing library of algorithms that can be applied to in-memory data at high speed.

Spark isn’t standing still, as the algorithms available in Spark are constantly being expanded and revised. Last year's 1.5 release added many new algorithms, improved existing ones, and further bolstered MLlib support in Python, a major platform for math and stats users. The newly released Spark 1.6 makes it possible, among other things, to suspend and resume Spark ML jobs via persistent pipelines.

“Deep learning” frameworks power heavy-duty machine-learning functions, such as natural language processing and image recognition. Singa, recently accepted into the Apache Incubator, is an open source framework intended to make it easy to train deep-learning models on large volumes of data.

Deep-learning framework Caffe is “made with expression, speed, and modularity in mind.” Originally developed in 2013 for machine vision projects, Caffe has since expanded to include other applications, such as speech and multimedia.

Speed is a major priority, so Caffe has been written entirely in C++, with CUDA acceleration support, although it can switch between CPU and GPU processing as needed. The distribution includes a set of free and open source reference models for common classification jobs, with other models created and donated by the Caffe user community.

Given the sheer amount of data and computational power needed to perform machine learning, the cloud is an ideal environment for ML apps. Microsoft has outfitted Azure with its own pay-as-you-go machine learning service, Azure ML Studio, with monthly, hourly, and free-tier versions. (The company’s HowOldRobot project was created with this system.)

Azure ML Studio allows users to create and train models, then turn them into APIs that can be consumed by other services. Users get up to 10GB of storage per account for model data, although you can also connect your own Azure storage to the service for larger models. A wide range of algorithms are available, courtesy of both Microsoft and third parties. You don’t even need an account to try out the service; you can log in anonymously and use Azure ML Studio for up to eight hours.

Amazon’s general approach to cloud services has followed a pattern. Provide the basics, bring in a core audience that cares, let them build on top of it, then find out what they really need and deliver that.

The same could be said of its first foray into offering machine learning as a service, Amazon Machine Learning. It connects to data stored in Amazon S3, Redshift, or RDS, and can run binary classification, multiclass categorization, or regression on said data to create a model. However, the service is highly Amazon-centric. Aside from being reliant on data stored in Amazon, the resulting models can’t be imported or exported, and datasets for training models can’t be larger than 100GB. Still, it’s a start, and it shows how machine learning is being made a practicality instead of a luxury.

The more computers you have to throw at any machine learning problem, the better -- but ganging together machines and developing ML applications that run well across all of them can be tricky. Microsoft’s DMTK (Distributed Machine Learning Toolkit) framework tackles the issue of distributing various kinds of machine learning jobs across a cluster of systems.

DMTK is billed as a framework rather than a full-blown out-of-the-box-solution, so the number of actual algorithms included with it is small. But the design of DMTK allows for future expansion, and for users to make the most of clusters with limited resources. For instance, each node in the cluster has a local cache, reducing the amount of traffic with the central server node that provides parameters for the job in question.

Much like Microsoft’s DMTK, Google TensorFlow is a machine learning framework designed to scale across multiple nodes. As with Google’s Kubernetes, it was built to solve problems internally at Google, and Google eventually elected to release it as an open source product.

TensorFlow implements what are called data flow graphs, where batches of data (“tensors”) can be processed by a series of algorithms described by a graph. The movements of the data through the system are called “flows” -- hence, the name. Graphs can be assembled with C++ or Python and can be processed on CPUs or GPUs. Google’s long-term plan is to have TensorFlow fleshed out by third-party contributors.

Hot on the heels of releasing the DMTK, Microsoft unveiled yet another machine learning toolkit, the Computational Network Toolkit, or CNTK for short.

CNTK is similar to Google TensorFlow, since it lets users create neural networks by way of a directed graph. Microsoft also considers it comparable to projects like Caffe, Theano, and Torch. Its main touted advantage over those frameworks is speed, specifically the ability to exploit both multiple CPUs and multiple GPUs in parallel. Microsoft claims using CNTK in conjunction with GPU clusters on Azure accelerated speech recognition training for Cortana by an order of magnitude.

Originally developed as part of Microsoft’s research into speech recognition, CNTK was originally offered as an open source project back in April 2015, but it’s since been re-released on GitHub under a much more liberal, MIT-style license.

Veles is a distributed platform for deep-learning applications, and like TensorFlow and DMTK, it’s written in C++, although it uses Python to perform automation and coordination between nodes. Datasets can be analyzed and automatically normalized before being fed to the cluster, and a REST API allows the trained model to be used in production immediately (assuming your hardware’s good enough).

Veles’ use of Python goes beyond merely employing it as glue code. IPython (now Jupyter), the data-visualization and analysis tool, can visualize and publish results from a Veles cluster. Samsung hopes releasing the project as open source will stimulate further development, such as ports to Windows and Mac OS X.

Developed over the course of 2015 by doctoral students Klaus Greff and Rupesh Srivastava at IDSIA (Institute Dalle Molle for Artificial Intelligence) at Lugano, Switzerland, the goal of the Brainstorm project is “to make deep neural networks fast, flexible, and fun.” Support is already included for a variety of recurrent neural network models, such as LSTM.

Brainstorm uses Python to provide two “handers,” or data management APIs -- one for CPU computation by the Numpy library, and another that leverages GPUs via CUDA. Most of the work is done through Python scripting, so don’t expect a rich front-end GUI, save for whatever one you’d bring yourself. But the long-term plan is to create something that employs “lessons learned from earlier open source projects,” and uses “new design elements compatible with multiple platforms and computing back ends.”

Our previous roundup of machine learning resources touched mlpack, a C++-based machine learning library originally rolled out in 2011 and designed for “scalability, speed, and ease-of-use,” according to the library’s creators. Implementing mlpack can be done through a cache of command-line executables for quick-and-dirty, “black box” operations, or with a C++ API for more sophisticated work.

The 2.0 version has lots of refactorings and new features, including many new kinds of algorithms, and changes to existing ones to speed them up or slim them down. For example, it ditches the Boost library’s random number generator for C++11’s native random functions.

One long-standing disadvantage is a lack of bindings for any language other than C++, meaning users of everything from R to Python can’t make use of mlpack unless someone rolls their own wrappers for said languages. Work’s been done to add MATLAB support, but projects like this tend to enjoy greater uptake when they’re directly useful in the major environments where machine learning work takes place.

Another relatively recent production, the Marvin neural network framework is a product of the Princeton Vision Group. It was “born to be hacked,” as its creators explain in the documentation for the project, and relies only on a few files written in C++ and the CUDA GPU framework. Despite the code itself being deliberately minimal, the project does come with a number of pretrained models that can be reused with proper citation and contributed to with pull requests like the project’s own code.

Nervana, a company building its own deep-learning hardware and software platform, has offered up a deep-learning framework named Neon as an open source project. It uses pluggable modules to allow the heavy lifting to be done on CPUs, GPUs, or Nervana’s own custom hardware.

Neon is written chiefly in Python, with a few pieces in C++ and assembly for speed. This makes it immediately available to others doing data-science work in Python or to most any other framework that has Python bindings.