Machine learning with Oryx: Run wild with real-time ML

The open source machine learning frameworks just keep on coming. Oryx 2 is focused on real-time, large scale machine learning and uses the power of 3 tiers. Grab it by the horns and create custom applications.

Large scale machine learning

From their website, Oryx is a “realization of the lambda architecture built on Apache Spark and Apache Kafka“. Apache Spark has the benefit of being incredibly fast in-memory, beating out Hadoop in the race. (Of course, Apache Spark and Hadoop can be used for different things, so your preference between the two may vary and depend on more than just speed alone. As you will see under the requirements, Oryx requires both.) Meanwhile, Apache Kafka is a distributed streaming platform that builds real-time streaming applications and data pipelines.

If it sounds familiar to you, it should! Oryx 2 is actually a sequel of its original project. Now updated it uses new architecture consisting of three tiers that can be implemented together or independently of one another.

Three-tiered cake

A generic lambda architecture tier, providing batch/speed/serving layers, which is not specific to machine learning

A specialization on top providing ML abstractions for hyperparameter selection, etc.

An end-to-end implementation of the same standard ML algorithms as an application (ALS, random decision forests, k-means) on top

It’s all about mixing and matching the layers. While you don’t have to use them all, they can work together. Again, let’s take it right from the Oryx’s mouth and learn more about each layer:

A Batch Layer, which computes a new “result” (think model, but, could be anything) as a function of all historical data, and the previous result. This may be a long-running operation which takes hours, and runs a few times a day for example.

A Speed Layer, which produces and publishes incremental model updates from a stream of new data. These updates are intended to happen on the order of seconds.

A Serving Layer, which receives models and updates and implements a synchronous API exposing query operations on the result.

A data transport layer, which moves data between layers and receives input from external sources

The Batch and Speed layers are implemented as Spark Streaming processes, so they each run on a Hadoop cluster. Meanwhile, the data transport layer is an Apache Kafka topic and the serving layer helps maintain the model state in memory.

GitHub provides a helpful architecture diagram that will help you master this system.

Sarah Schlothauer is an assistant editor for JAXenter.com. She received her Bachelor's degree from Monmouth University in Long Branch, New Jersey and is currently enrolled at Goethe University in Frankfurt, Germany where she is working on her Masters. She lives in Frankfurt with her husband and cat.