Building a Centralised Machine Learning Pipeline with Spark and Kafka

Many organisations face the difficult challenge of enabling Machine Learning projects to get to market more quickly and to allow data science teams to share their features. In this talk, I will be discussing the machine learning pipeline developed at a large Australian telecommunications company to achieve this goal using Kafka and Spark as well as the challenges faced along the way. I’ll begin by discussing the utility and motivation for a centralised feature store, before looking at the complexities of such an undertaking (both technical and organisational). We will then dig into the technical details of implementation by discussing the scalability headaches we faced and dive into the details of the solutions used to drastically improve the speed and organisational scalability of the system. Several areas that will be covered are providing a declarative API that allowed us to compile feature definitions into optimised spark code, the complexity of a true streaming dedupe, adjusting the workflow for different machine learning use cases, fine-tuning the resource allocation to avoid unnecessary bottlenecks and allowing for streaming and batch data sources. Finally, we will touch on lessons learnt along the way and offer advice on things to avoid as well as how to take things to the next stage.