Apache Spark: core concepts, architecture and internals

This post covers core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. There's a github.com/datastrophic/spark-workshop project created alongside with this post which contains Spark Applications examples and dockerized Hadoop environment to play with. Slides are also available at slideshare.

Intro

Spark is a generalized framework for distributed data processing providing functional API for manipulating data at scale, in-memory data caching and reuse across computations. It applies set of coarse-grained transformations over partitioned data and relies on dataset's lineage to recompute tasks in case of failures. Worth mentioning is that Spark supports majority of data formats, has integrations with various storage systems and can be executed on Mesos or YARN.

Powerful and concise API in conjunction with rich library makes it easier to perform data operations at scale. E.g. performing backup and restore of Cassandra column families in Parquet format:

Recap

Spark is built around the concepts of Resilient Distributed Datasets and Direct Acyclic Graph representing transformations and dependencies between them.

Spark Application (often referred to as Driver Program or Application Master) at high level consists of SparkContext and user code which interacts with it creating RDDs and performing series of transformations to achieve final result. These transformations of RDDs are then translated into DAG and submitted to Scheduler to be executed on set of worker nodes.

RDD: Resilient Distributed Dataset

RDD could be thought as an immutable parallel data structure with failure recovery possibilities. It provides API for various transformations and materializations of data as well as for control over caching and partitioning of elements to optimize data placement. RDD can be created either from external storage or from another RDD and stores information about its parents to optimize execution (via pipelining of operations) and recompute partition in case of failure.

From a developer's point of view RDD represents distributed immutable data (partitioned data + iterator) and lazily evaluated operations (transformations). As an interface RDD defines five main properties:

Here's an example of RDDs created during a call of method sparkContext.textFile("hdfs://...") which first loads HDFS blocks in memory and then applies map() function to filter out keys creating two RDDs:

HadoopRDD:

getPartitions = HDFS blocks

getDependencies = None

compute = load block in memory

getPrefferedLocations = HDFS block locations

partitioner = None

MapPartitionsRDD

getPartitions = same as parent

getDependencies = parent RDD

compute = compute parent and apply map()

getPrefferedLocations = same as parent

partitioner = None

RDD Operations
Operations on RDDs are divided into several groups:

Transformations

apply user function to every element in a partition (or to the whole partition)

apply aggregation function to the whole dataset (groupBy, sortBy)

introduce dependencies between RDDs to form DAG

provide functionality for repartitioning (repartition, partitionBy)

Actions

trigger job execution

used to materialize computation results

Extra: persistence

explicitly store RDDs in memory, on disk or off-heap (cache, persist)

checkpointing for truncating RDD lineage

Here's a code sample of some job which aggregates data from Cassandra in lambda style combining previously rolled-up data with the data from raw storage and demonstrates some of the transformations and actions available on RDDs

Execution workflow recap

Here's a quick recap on the execution workflow before digging deeper into details: user code containing RDD transformations forms Direct Acyclic Graph which is then split into stages of tasks by DAGScheduler. Stages combine tasks which don’t require shuffling/repartitioning if the data. Tasks run on workers and results then return to client.

DAG

Here's a DAG for the code sample above. So basically any data processing workflow could be defined as reading the data source, applying set of transformations and materializing the result in different ways. Transformations create dependencies between RDDs and here we can see different types of them.

The dependencies are usually classified as "narrow" and "wide":

Narrow (pipelineable)

each partition of the parent RDD is used by at most one partition of the child RDD

allow for pipelined execution on one cluster node

failure recovery is more efficient as only lost parent partitions need to be recomputed

Wide (shuffle)

multiple child partitions may depend on one parent partition

require data from all parent partitions to be available and to be shuffled across the nodes

if some partition is lost from all the ancestors a complete recomputation is needed

Splitting DAG into Stages

Spark stages are created by breaking the RDD graph at shuffle boundaries

RDD operations with "narrow" dependencies, like map() and filter(), are pipelined together into one set of tasks in each stage
operations with shuffle dependencies require multiple stages (one to write a set of map output files, and another to read those files after a barrier).

In the end, every stage will have only shuffle dependencies on other stages, and may compute multiple operations inside it. The actual pipelining of these operations happens in the RDD.compute() functions of various RDDs

There are two types of tasks in Spark: ShuffleMapTask which partitions its input for shuffle and ResultTask which sends its output to the driver. The same applies to types of stages: ShuffleMapStage and ResultStage correspondingly.

Shuffle

During the shuffle ShuffleMapTask writes blocks to local drive, and then the task in the next stages fetches these blocks over the network.