A Guide to Data Engineering Talks at Spark + AI Summit 2019

Selected highlights from the new track

Big data practitioners grapple with data quality issues and data pipeline complexities—it’s the bane of their existence. Whether you are chartered with advanced analytics, developing new machine learning models, providing operational reporting or managing the data infrastructure, the concern with data quality is a common theme. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.

Recognizing the importance of data engineering, this year the Spark + AI Summit includes a new track dedicated to data engineering where presenters will be talking about data engineering and sharing their experience with Apache Spark as applied to their use cases.

Apache Spark is held by many to now be the defacto big data processing engine. In his talk Migrating to Apache Spark at Netflix, Ryan Blue from Netflix will talk about the mass migration to Spark from Pig and other MR engines.

Concern about production at scale is never far from a data engineer’s heart. In their talk Scaling Apache Spark on Kubernetes at Lyft, Li Gao and Rohit Menon from Lyft will discuss challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale.

Apache Spark is a living and thriving project which can present opportunities for upgrading as new versions are released so that newer capabilities can be used. Hao Wan and Liyin Tang in their talk Apache Spark at Airbnb, share their major production use cases including both streaming and batch applications, the lessons learned and tips for migrating to 2.x.

And the last talk that I want to highlight here is Understanding Query Plans and Spark UIs, by Xiao Li Apache Spark Committer and PMC member at Databricks. His talk will address how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.

Active Learning

If you are someone who learns best by doing, don’t forget to consider the Building Robust Production Data Pipelines with Databricks Delta tutorial, a 90-minute session with an expert-led talk designed to introduce the next-generation unified analytics engine, followed by hands-on exercises allowing attendees to learn by doing.