Building and deploying large-scale machine learning pipelines

Check out the Data science and machine learning sessions at Strata Data in New York, September 25-28, 2017, for more on current trends and practical use cases in applied data science.

Data scientists have to manage and maintain complex data projects, and the analytic problems they need to tackle usually involve specialized machine learning pipelines.

Some of these primitives might be specific to particular domains and data types (text, images, video, audio, spatiotemporal) or more general purpose (statistics, machine learning).

“We’re trying to put (machine learning systems) in self-driving cars, power networks … If we want machine learning models to actually have an impact in everyday experience, we’d better come out with the same guarantees as one of these complicated airplane designs.” — Can we bound approximation errors and convergence rates for layered pipelines?

And their longer term goal is to be able to derive performance characteristics and analyze the robustness of complex, distributed software systems directly from pseudocode.

(A related AMPLab project Velox provides a framework for managing models in production.) As algorithms become even more pervasive, we need better tools for building complex yet robust and stable machine learning systems.

While other systems like scikit-learn and GraphLab support pipelines, a popular distributed framework like Apache Spark takes these ideas to extremely large data sets and a wider audience.

Lessons Learned from Building Scalable Machine Learning Pipelines

As a short case study we present an ML pipeline design for an advertising use case: Click prediction in an RTB (Real Time Bidding) setting.

The goal of this ML pipeline is to gather data on inventory, users, and advertiser information, and to train ML models to predict the likelihood of someone clicking on an ad, at the time of auction.

This pipeline faces many challenges and requirements: To perform this large scale machine learning task, many components work together to provide a reliable and efficient way for handling these ML product requirements.

By keeping the right level of granularity at the observation level in the ML feature data warehouse, research teams are able to use custom look-backs for both back and live testing without having to pre-process data sets in redundant ways.

Therefore, the model serving engine must at all costs reduce both the amount of time it spends on converting bid-request data to supported features and on calculating the score from the currently attached predictive models.

However, in real business situationsKPIs and model health are the first aspects that need to be tracked efficiently, and at the right level of aggregation/abstraction, while allowing deeper investigation for transient issues that are common in large scale distributed systems like ours.

The paper contains other details and solutions related to this pipeline that can help future ML practitioners design their ML pipelines, including some of the lessons learned that we found useful for our work.

Building and deploying large-scale machine learning pipelines

Check out the Data science and machine learning sessions at Strata Data in New York, September 25-28, 2017, for more on current trends and practical use cases in applied data science.

Data scientists have to manage and maintain complex data projects, and the analytic problems they need to tackle usually involve specialized machine learning pipelines.

Some of these primitives might be specific to particular domains and data types (text, images, video, audio, spatiotemporal) or more general purpose (statistics, machine learning).

“We’re trying to put (machine learning systems) in self-driving cars, power networks … If we want machine learning models to actually have an impact in everyday experience, we’d better come out with the same guarantees as one of these complicated airplane designs.” — Can we bound approximation errors and convergence rates for layered pipelines?

And their longer term goal is to be able to derive performance characteristics and analyze the robustness of complex, distributed software systems directly from pseudocode.

(A related AMPLab project Velox provides a framework for managing models in production.) As algorithms become even more pervasive, we need better tools for building complex yet robust and stable machine learning systems.

While other systems like scikit-learn and GraphLab support pipelines, a popular distributed framework like Apache Spark takes these ideas to extremely large data sets and a wider audience.

Data Science for Startups: Data Pipelines

Part three of my ongoing series about building a data science discipline at a startup.

While my previous blog post discussed what type of data to collect and how to send data to an endpoint, this post will discuss how to process data that has been collected, enabling data scientists to work with the data.

This post will show how to set up a scalable data pipeline that sends tracking data to a data lake, database, and subscription service for use in data products.

Before deploying a data pipeline, you’ll want to answer the following questions, which resemble our questions about tracking specs: In a small organization, a data scientist may be responsible for the pipeline, while larger organizations usually have an infrastructure team that is responsible for keeping the pipeline operational.

Based on my experience, I’ve noticed four different approaches to pipelines: Each of the steps in this evolution support the collection of larger data sets, but may introduce additional operational complexity.

For a startup, the goal is to be able to scale data collection without scaling operational resources, and the progression to managed services provides a nice solution for growth.

The data pipeline that we’ll walk through in the next section of this post is based on the most recent era of data pipelines, but it’s useful to walk through different approaches because the requirements for different companies may fit better with different architectures.

While many game companies were already collecting massive amounts of data about gameplay, most telemetry was stored in the form of log files or other flat file formats that were stored locally on the game servers.

This approach is simple and enables teams to save data in whatever format is needed, but has no fault tolerance, does not store data in a central location, has significant latency in data availability, and has standard tooling for building an ecosystem for analysis.

The main difference from the approach at SOE was that instead of having game servers scp files to a central location, we used Amazon Kinesis to stream events from servers to a staging area on S3.

A team of a few analysts working with a few months of gameplay data may work fine, but after collecting years of data and growing the number of analysts, query performance can be a significant problem, causing some queries to take hours to complete.

The main benefits of this approach are that all event data is available in a single location queryable with SQL and great tooling is available, such as Tableau and DataGrip, for working with relational databases.

The drawbacks are that it’s expensive to keep all data loaded into a database like Vertica or Redshift, events needs to have a fixed schema, and truncating tables may be necessary to keep the servers performant.

Another issue with using a database as the main interface for data is that machine learning tools such as Spark’s MLlib cannot be used effectively, since the relevant data needs to be unloaded from the database before it can be operated on.

The main downside is that it introduces additional complexity, and can result in analysts having access to less data than if a traditional database approach was used, due to lack of tooling or access policies.

The down sides are that it may involve significant operational overhead, may introduce large event latencies, and may lack mature tooling for the end users of the data lake.

Serverless Era In the current era, analytics platforms incorporate a number of managed services, which enable teams to work with data in near real-time, scale up systems as necessary, and reduce the overhead of maintaining servers.

The main drawbacks are that managed services can be expensive, and taking this approach will likely result in using platform specific tools that are not portable to other cloud providers.

For a startup, the serverless approach is usually the best way to start building a data pipeline, because it can scale to match demand and requires minimal staff to maintain the data pipeline.

The pipeline reads messages from PubSub and then transforms the events for persistence: the BigQuery portion of the pipeline converts messages to TableRow objects and streams directly to BigQuery, while the AVRO portion of the pipeline batches events into discrete windows and then saves the events to Google Storage.

To deploy this data pipeline, you’ll need to set up a java environment with the maven dependencies listed above, set up a Google Cloud project and enable billing, enable billing on the storage and BigQuery services, and create a PubSub topic for sending and receiving messages.

The data pipeline defined in this tutorial shows how to output events to both BigQuery and a data lake that can be used to support a large number of analytics business users.

While this post doesn’t show how to utilize these files in downstream ETLs, having a data lake is a great way to maintain a copy of your data set in case you need to make changes to your database.

The code below applies transformations that convert the PubSub messages into String objects, group the messages into 5 minute intervals, and output the resulting batches to AVRO files on Google Storage.

The transform step reads the message payloads from PubSub, parses the message as a JSON object, extracts the eventType and eventVersion attributes, and creates a TableRow object with these attributes in addition to a timestamp and the message payload.

In order to effectively use these events for queries, you’ll need to build additional ETLs for creating processed event tables with schematized records, but you now have a data collection mechanism in place for storing tracking events.

In order to deploy to the cloud and take advantage of the auto scaling capabilities of this data pipeline, you need to specify a new runner class as part of your runtime arguments.

In order to deploy a job that scales up based on demand, you’ll need to specify additional attributes, such as: Additional details on setting up a DataFlow task to scale to heavy workload conditions are available in this Google article and this post from Spotify.

The code below shows how to parse raw events, add additional attributes to the PubSub message for filtering, and publish the events to a second topic.

fourth approach that can be used is having downstream ETLs processes apply schemas to the raw events and break apart the raw events table into event specific tables.

We covered the types of data in a pipeline, desired properties of a high functioning data pipeline, the evolution of data pipelines, and a sample pipeline built on GCP.

Using managed resources enables small teams to take advantage of serverless and autoscaling infrastructure to scale up to massive event volumes with minimal infrastructure management.

While the approach presented here isn’t directly portable to other clouds, the Apache Beam library used to implement the core functionality of this data pipeline is portable and similar tools can be leveraged to build scalable data pipelines on other cloud providers.

Meet Michelangelo: Uber’s Machine Learning Platform

While data scientists were using a wide variety of tools to create predictive models (R, scikit-learn, custom algorithms, etc.), separate engineering teams were also building bespoke one-off systems to use these models in production.

Prior to Michelangelo, it was not possible to train models larger than what would fit on data scientists’ desktop machines, and there was neither a standard place to store the results of training experiments nor an easy way to compare one experiment to another.

Michelangelo is designed to address these gaps by standardizing the workflows and tools across teams though an end-to-end system that enables users across the company to easily build and operate machine learning systems at scale.

Then, the delivery-partner needs to get to the restaurant, find parking, walk inside to get the food, then walk back to the car, drive to the customer’s location (which depends on route, traffic, and other factors), find parking, and walk to the customer’s door to complete the delivery.

We generally prefer to use mature open source options where possible, and will fork, customize, and contribute back as needed, though we sometimes build systems ourselves when open source solutions are not ideal for our use case.

Michelangelo is built on top of Uber’s data and compute infrastructure, providing a data lake that stores all of Uber’s transactional and logged data, Kafka brokers that aggregate logged messages from all Uber’s services, a Samza streaming compute engine, managed Cassandra clusters, and Uber’s in-house service provisioning and deployment tools.

We designed Michelangelo specifically to provide scalable, reliable, reproducible, easy-to-use, and automated tools to address the following six-step workflow: Next, we go into detail about how Michelangelo’s architecture facilitates each stage of this workflow.

They should also provide strong guard rails and controls to encourage and empower users to adopt best practices (e.g., making it easy to guarantee that the same data generation/preparation process is used at both training time and prediction time).

Currently, the offline pipelines are used to feed batch model training and batch prediction jobs and the online pipelines feed online, low latency predictions (and in the near future, online learning systems).

We provide containers and scheduling to run regular jobs to compute features which can be made private to a project or published to the Feature Store (see below) and shared across teams, while batch jobs run on a schedule or a trigger and are integrated with data quality monitoring tools to quickly detect regressions in the pipeline–either due to local or upstream code or data issues.

Models that are deployed online cannot access data stored in HDFS, and it is often difficult to compute some features in a performant manner directly from the online databases that back Uber’s production services (for instance, it is not possible to directly query the UberEATS order service to compute the average meal prep time for a restaurant over a specific period of time).

We support two options for computing these online-served features, batch precompute and near-real-time compute, outlined below: We found great value in building a centralized Feature Store in which teams around Uber can create and manage canonical features to be used by their teams and shared with others.

At a high level, it accomplishes two things: At the moment, we have approximately 10,000 features in Feature Store that are used to accelerate machine learning projects, and teams across the company are adding new ones all the time.

We currently support offline, large-scale distributed training of decision trees, linear and logistic models, unsupervised models (k-means), time series models, and deep neural networks.

For every model that is trained in Michelangelo, we store a versioned object in our model repository in Cassandra that contains a record of: The information is easily available to the user through a web UI and programmatically through an API, both for inspecting the details of an individual model and for comparing one or more models with each other.

In the case of decision tree models, we let the user browse through each of the individual trees to see their relative importance to the overall model, their split points, the importance of each feature to a particular tree, and the distribution of data at each split, among other variables.

Selecting two features lets the user understand the feature interactions as a two-way partial dependence diagram, as showcased below: Michelangelo has end-to-end support for managing model deployment via the UI or API and three modes in which a model can be deployed: In all cases, the required model artifacts (metadata files, model parameter files, and compiled DSL expressions) are packaged in a ZIP archive and copied to the relevant hosts across Uber’s data centers using our standard code deployment infrastructure.

In the case of offline models, the predictions are written back to Hive where they can be consumed by downstream batch jobs or accessed by users directly through SQL-based query tools, as depicted below: More than one model can be deployed at the same time to a given serving container.

expect the same set of features) when deploying a new model to replace an old model, users can deploy the new model to the same tag as the old model and the container will start using the new model immediately.

For A/B testing of models, users can simply deploy competing models either via UUIDs or tags and then use Uber’s experimentation framework from within the client service to send portions of the traffic to each model and track performance metrics.

To make sure that a model is working well into the future, it is critical to monitor its predictions so as to ensure that the data pipelines are continuing to send accurate data and that production environment has not changed such that the model is no longer accurate.

In the case of a regression model, we publish R-squared/coefficient of determination, root mean square logarithmic error (RMSLE), root mean square error (RMSE), and mean absolute error metrics to Uber’s time series monitoring systems so that users can analyze charts over time and set threshold alerts, as depicted below: The last important piece of the system is an API tier.

As the platform layers mature, we plan to invest in higher level tools and services to drive democratization of machine learning and better support the needs of our business: If you are interesting in tackling machine learning challenges that push the limits of scale, consider applying for a role on our team! Jeremy Hermann is an Engineering Manager and Mike Del Balso is a Product Manager on Uber’s Machine Learning Platform team.