Track: Predictive Architectures in the Real World

Location: Cyril Magnin I

Day of week: Tuesday

Predictive data pipelines have become essential to building engaging experiences on the web today. Whether you enjoy personalized news feeds on LinkedIn and Facebook, profit from near real-time updates to search engines and recommender systems, or benefit from near-realtime fraud detection on a lost or stolen credit card, you have come to rely on the fruits of predictive data pipelines as an end user.

Running a successful machine learning project in production takes more than a clever algorithm. In this track, the experts who built some of the most successful commercial recommendation systems, will tell us what it really takes. How do you build the architectures, data pipelines and devops best practices that help drive real-world machine learning?

Track Host: Gwen Shapira

Gwen is a principal data architect at Confluent helping customers achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating microservices, relational and big data technologies. She currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures", and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects. When Gwen isn't coding or building data pipelines, you can find her pedaling on her bike exploring the roads and trails of California, and beyond.

The “People You May Know” (PYMK) recommendation service helps LinkedIn’s members identify other members that they might want to connect to and is the major driver for growing LinkedIn's social network. The principal challenge in developing a service like PYMK is dealing with the sheer scale of computation needed to make precise recommendations with a high recall. PYMK service at LinkedIn has been operational for over a decade, during which it has evolved from an Oracle-backed system that took weeks to compute recommendations to a Hadoop backed system that took a few days to compute recommendations to its most modern embodiment where it can compute recommendations in near real time.

This talk will present the evolution of PYMK to its current architecture. We will focus on various systems we built along the way, with an emphasis on systems we built for our most recent architecture, namely Gaia, our real-time graph computing capability, and Venice our online feature store with scoring capability, and how we integrate these individual systems to generate recommendations in a timely and agile manner, while still being cost-efficient. We will briefly talk about the lessons learned about scalability limits of our past and current design choices and how we plan to tackle the scalability challenges for the next phase of growth.

Early detection of abnormal events can be critical for many business applications, however there are numerous challenges when implementing real-time anomaly models at scale. Server failure, developer error and malicious activities are very different scenarios with different engineering requirements. Moreover, most analytical models have been traditionally designed for the batch processing paradigm and usually cannot be easily adapted to unbounded datasets and real-time latencies.

At PayPal, we must be able to analyze billions of events every day in real-time across a wide range of services, devices and locations. In a collaboration between our Platform engineering team and data science teams, we have built a generic framework for developing robust and scalable anomaly detection streaming applications, focusing on flexibility to support different types of statistical and machine learning models. Inspired by the design of scikit-learn and Spark MLlib, we have designed a simple pipeline-based API on top of Spark Structured Streaming, that captures common patterns of the anomaly detection domain.

At the base of the framework, we took advantage of Spark Structured Streaming fast and scalable execution engine together with stream-oriented building blocks to allow easy extension to new production grade models. We found real-time anomaly detection to provide powerful capabilities in many different fields, internally we use the framework for a variety of use cases ranging from fraud prevention, operations and even security.

Production machine learning involves intentionally deploying and running some of the ugliest, hardest-to-debug spaghetti code that you have ever seen (i.e., code that was generated by a computer) into the critical path of your operational environment. Because so much of machine learning code has an academic origin and most experienced practitioners have primarily worked in offline, batch-oriented computing environments, there is often an impedance mismatch between devops and machine learning practitioners that causes unnecessary pain for everyone involved. In this talk, we're going to go deep into the monitoring and visibility needs of machine learning models in order to bridge these gaps and make everyone's working life a bit simpler, more pleasant, and more productive.

Feature Engineering can be loosely described as the process of extracting useful signals from the underlying raw data for use in predictive decisioning systems such as Machine Learning (ML) models, or Business rules engines. The raw data is often available via heterogenous types of underlying systems such as offline/batch computed data in Hadoop or other data warehouses, key-value datastores, production microservices, streaming data jobs or services. Traditionally, such engineering has been achieved via the use of adhoc data pipelines, or feature serving layers/services. In our experience at Uber, such practices have turned out to be quite fragile resulting in hard to maintain infrastructure, and a large amount of redundant engineering. Moreover With ML models, it has exposed serious problems such as training/serving skew.

In this talk, we'll be presenting the infrastructure we're building within Uber's Michelangelo ML Platform that:

Enables a general approach to Feature Engineering across diverse data systems such as offline/batch data warehouses (eg Apache Hive), realtime data in Uber's key-value stores (such as Cassandra) or production microservices, or in near realtime via the use of stream processing infrastructure based on Apache Kafka for eg.

Demonstrates how the ML training/serving skew problem is addressed by ensuring data parity across online/serving and offline/training systems.

Discusses the scalability challenges, and the sensitivities around serving data in single digit milliseconds.

This talk will take the audience through the evolution of Spotify's architecture that serves recommendations (playlist, albums, etc) on the home tab. We'll discuss the tradeoffs of the different architectural decisions we made and how we went from batch pipelines to services to a combination of services and streaming pipelines.

Join a panel of experts and discuss the unique challenges of building and running data architectures for predictions, recommendations and machine learning. Josh Wills built the infrastructure for Slack’s search, learning and intelligence products. Sumit Rangwala is constantly improving People You May Know recommendations for LinkedIn. Eric Chen leads the team that built Michaelangelo, massive scale soup-to-nuts machine learning infrastructure at Uber. Emily Samuels and Anil Muppalla evolved Spotify’s recommendations from batch to real-time. If you ever wondered what it really takes to build data architectures that support large-scale data products, this is the time and place to ask, learn and get inspired.