Date

Friday, 01 Sep 2017 3:00 PM

This 4-hour training course introduces Apache Spark v2.2, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating exploratory code that scales up to production-grade quality relatively quickly (REPL driven development).

The main focus will be on what is new in Spark v2.2 and this includes DataSets (compile-time type-safe DataFrames), Structured Streaming, as well as the de-emphasizing of RDDs.

The plan is to follow the agenda below but if participants want to dive deeper into high-complexity topics I will instead focus on live coding ad-hoc demos.

1. The first part of the workshop covers Spark SQL with Scala, specifically the limited toy examples emphasized by Spark documentation and tutorials. Spark SQL, used in isolation, can realistically only be used for such didactic use cases. As a practitioner I know from experience that when ingesting real-world datasets, Spark SQL will very quickly show its limitations and therefore some more powerful techniques are needed.

2. The second part of the workshop covers these techniques without which Spark SQL is largely ineffective. This section of the workshop is about sharing lessons learned the hard way, and experience gathered in the trenches of the real world.

3. The third part of the workshop, titled "Machine Learning By Example", covers multiclass classification using SparkML's Pipeline API with Scala. SparkML is the machine learning module that ships with Spark.