History

Ecosystem

Spark SQLSpark SQL is a Spark library for structured data processing. Spark SQL brings native SQL support to Spark as well as the notion of DataFrames. Information workers are free to use either interface or toggle between both while the underlying execution engine remains the same.

Spark StreamingSpark Streaming can ingest and process live streams of data at scale. Since Spark Streaming is an extension of the core Spark API, streaming jobs can be expressed in the same manner as writing a batch query.

GraphX (Graph Processing)GraphX is a Spark library that allows users to build, transform and query graph structures with properties attached to each vertex (aka node) and edge (aka relationship).

Spark CoreSpark Core is the underlying execution engine that all other functionality is built on top of. Spark Core provides basic functionalities such as task, scheduling, memory management, fault recovery, etc as well as Spark's primary data abstraction - Resilient Distributed Datasets (RDDs).

Apache Spark vs. MapReduce

TL;DR - Spark is faster and easier to use.

MapReduce (introduced back in 2004), a mature software framework, has been the mainstay programming model for large-scale data processing. While MapReduce is great for single-pass computations, it is inefficient when multiple passes of data are required. While not a big deal for batch processing, MapReduce can be painfully slow in scenarios which require the sharing of intermediate results. This is quite common for certain classes of applications such as interactive ad-hoc queries, machine learning, real-time streaming, etc.

MapReduce, as was the case for many frameworks at the time, would need to write an interim state to disk (i.e. a distributed file system) in order to reuse data between computations. Each iteration would incur a significant performance overhead with each pass due to data replication, data serialisation, disk I/O, etc, consuming a substantial amount of the overall execution time.

In contrast, Spark's programming model revolves around Resilient Distributed Datasets (RDDs), an abstraction of distributed memory (in addition to distributed disk), making the framework an order of magnitude faster for algorithms that are iterative in nature.

In addition to being performant, Spark provides high-level operators (Transformations and Actions) that can be expressed in a number of language APIs (Java, Scala, Python, SQL, and R), making Spark easy to use in comparison to MapReduce which can get quite verbose as developers are required to write low-level code.

Spark Operations

Spark supports two types of operations:

TransformationsTransformations take an input, perform some type of manipulation (e.g. map, filter, sample, distinct), and produce an output.

Since data structures within Spark are immutable (i.e. unable to be changed once created), the output of a transformation is not the results themselves but a new (transformed) data abstraction (i.e. RDD or DataFrame).

Transformations are lazy (i.e. Spark will not compute the results until an action requires the results to be returned). This allows Spark to optimise the physical execution plan right at the last minute to run as efficiently as possible.

Once downloaded, extract the zipped contents, navigate to the spark directory, and start the spark shell.

For example:

cd ~/Downloads/spark-2.3.1-bin-hadoop2.7

bin/pyspark

Note: In this example I have started the Spark shell in Python, alternatively, you can use Scala by typing bin/spark-shell

In terms of next steps, check out Apache's Quick Start which contains some sample code for both Scala and Python. In a subsequent post, I will cover how you can tap into Spark-as-a-Service in the cloud using Azure Databricks (stay tuned).