Thursday May 17th, 15:30-16:15

Spark 2.x and Beyond

About this Talk

Apache Spark aims to solve the problem of working with large scale distributed data. At only 3 years old, the project has taken over as one of the premier data processing frameworks.

This talk will introduce a new way to run SQL queries on structured, distributed data in Spark 2.x. We'll walk through how to get started with Spark and with the help of live code, we'll answer practical, common questions about some fun data sets. We'll show how fast and easy it is to both explore and process data using Scala and Spark SQL and leave you with the tools to get started with your own distributed data.

Required knowledge

This talk assumes knowledge of Scala and SQL programming, but otherwise will be an introduction to the Spark and data engineering concepts.

Learning objectives

Attendees will learn how to get started with Spark, the difference between 1.x and 2.x APIs, and gain familiarity with how to manipulate data. This is a practical, hands on talk that is designed to inspire attendees to play around with data themselves.
I've included an example setup for attendees of the presentation to play along with after they get the basics. You can download the project here: www.krobinson.me/files/intro-spark.zip. Once you unzip the folder, setup instructions are in the README.md. Setup may take about 15 minutes (longer if you need to install Java).

Speaker(s)

Kelley has worked in a variety of engineering roles, ranging from trading live cattle derivatives to building production data pipelines in Scala. She spends a lot of time thinking about how to make technical concepts accessible to new audiences. In her spare time, Kelley spends a lot of time cooking and greatly enjoys reorganizing her tiny kitchen to accommodate completely necessary small appliance purchases.