Spark Expositions: Understanding RDDs

Introduction

RDD’s or resilient distributed datasetsare the fundamental data abstraction
in Apache Spark. An RDD amounts to what is a distributed dataset in memory.
It is distributed or partitioned among various nodes in a Spark cluster.
The official definition of an RDD in the official documentation is as follows:

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

The Resilient part is due to the fact that it is fault tolerant and hence if a
partition of the dataset is lost due to a node going done it can be recovered in a timely manner. The recoverability comes from the fact that its lineage graph is maintained by the driver program and hence the partition can be recomputed.

The lineage graph consists of a series of transformations and actions to be
computed on a partition of data and this is stored on the driver program.
The RDD in effect is a set of instructions which are lazily evaluated.
The evaluation only occurs when an action is encountered.
Here is an example of what an RDD’s lineage graph looks like:

The Distributed part stems from the fact that the data is distributed among worker nodes of a cluster. The driver program sends the RDD along with which partition of the data that should be computed on that particular cluster. The program on the worker node that is responsible for executing the set of instructions encapsulated in the RDD is called the Executor. The exact call is as follows, in

/**
* Implemented by subclasses to return the set of partitions in this RDD. This method will only
* be called once, so it is safe to implement a time-consuming
computation in it.
*/
protected def getPartitions: Array[Partition]

Types of Operations

RDDs support 2 types of operations: Transformations and Actions.

TransformationsTransformations convert an RDD from one type to another.
They are lazily evaluated, meaning that they’re only executed when data is ready to be output.

ActionsActions trigger the execution of the chain of operations that result in data being returned. They are necessary for transformations to be evaluated. Without an action, an RDD is just a chain of transformations that are yet to be evaluated.

Contents of an RDD

An RDD is characterized by the of the following 5 properties:

A list of partitions that comprise the dataset.

A function to perform the computation for each partition.

A list of dependencies on other RDDs i.e. parent RDDs.
The parent RDD is the initial RDD on which a transformation is
executed.

A Partitioner for key-value/Pair RDDs (Pair RDDs are defined later).

A list of preferred locations/hosts to load the various partitions into
which the data has been split.

Where does the RDD live ?

The RDD lives in the Spark Context on the driver program which runs on the master node in a cluster setup.

MapPartitionsRDD – applies the provided function to every partition of the
parent RDD. Normally returned when an RDD is created from a file via sc.textFile(..)

ParallelCollectionRDD – RDD representing a collection of elements. It containe
s numSlices Partitions and locationPrefs, which is a Map. It is obtained from call to sc.parallelize(..) on a collection in memory.

PipedRDD – An RDD that pipes the contents of each parent partition through an e
xternal command (printing them one per line) and returns the output as a collect
ion of strings.

There are also depictions of

PairRDD – an RDD of key-value pairs. It can be a ParallelCollectionRDD containin
g key-value pairs. There is no concrete PairRDD class, since it is an abstraction, however a class, PairRDDFunctions provides a set of transformations that can
be performed on Pair RDDs.