Introduction to Apache Spark

24 September 2017

What is Apache Spark

Apache Spark is a distributed, in-memory and disk based optimized system which does real-time analytics using Resilient Distributed Data(RDD) Sets.Spark includes a streaming library, and a rich set of programming interfaces to make data processing and transformation easier.It supports both in memory and disk based computation.

The Spark engine runs in a variety of environments, from cloud services to Hadoop or Mesos clusters. It is used to perform ETL, interactive queries (SQL), advanced analytics (e.g. machine learning) and streaming over large datasets in a wide range of data stores (e.g. HDFS, Cassandra, HBase, S3). Spark supports a variety of popular development languages including Java, Python and Scala.

There are two styles of application that benefit greatly from Spark processing model

Iterative Algorithms

Here a function is applied to dataset repeatedly until an exits condition is met.

Interactive analysis

Here a user issues a series of ad hoc explolatory queries to Spark datasets

Components of Spark

General Execution: Spark Core

Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development.

Streaming Analytics: Spark Streaming

Many applications need the ability to process and analyze not only batch data, but also streams of new data in real-time. Running on top of Spark, Spark Streaming enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics. It readily integrates with a wide variety of popular data sources, including HDFS, Flume, Kafka, and Twitter.

Graph Computation: GraphX

GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale. It comes complete with a library of common algorithms.

Structured Data: Spark SQL

Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is an engine for Hive data that enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data. It also provides powerful integration with the rest of the Spark ecosystem (e.g., integrating SQL query processing with machine learning).

Machine Learning: MLlib

Machine learning has quickly emerged as a critical piece in mining Big Data for actionable insights. Built on top of Spark, MLlib is a scalable machine learning library that delivers both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (up to 100x faster than MapReduce). The library is usable in Java, Scala, and Python as part of Spark applications, so that you can include it in complete workflows.

Spark Streaming vs Spark Batch

Many of the IT based companies collects data in real-time from various sources like sensors ,IOT devices ,social networks,Mobile devices ,web applications and online transactions.These data needs to be monitored constantly and acted upon quickly which is not possible through a batch based Spark applications .To address this issue Spark streaming is used through which real-time stream processing is possible.

Spark Streaming Use Cases

There can be many situations where Spark streaming can be used which depends upon the overall objective and business case of the company .In general it can be divided into four broad ways in which spark streaming is being used today .

Streaming ETL: Data is continuously obtained from different sources which is then cleaned and aggregated in real-time before pushing to data warehouse or data marts.

Data enrichment: Live data is enriched with more information by adding or joining it with static content/datasets in real-time.

Complex sessions and continuous learning: Events related to a live session (eg click stream data) are grouped together ana analyzed .Some of these session information can be used to continuously update machine learning models.

Resilient Distributed DataSet(RDD) in Spark

RDD is the fault-tolerant primary data structure/abstraction in Apache Spark which is immutable distributed collection of objects .

It is read-only collection of objects that is
partitioned across multiple machines in a cluster.

.Datasets in RDD is divided into logical partition across the nodes of the cluster that can be operated in parallel.

The term ‘resilient’ in ‘Resilient Distributed
Dataset’ refers to the fact that a lost partition can be reconstructed automatically by
Spark by recomputing it from the RDDs that it was computed from.

Spark uses RDD to hide the data partitioning so that it can be used to design parallel computation framework

Spark uses RDD to hide data
Using RDD Spark hides data partitioning and so that in turn allowed them to design parallel computational framework with a higher-level programming interface (API) for four mainstream programming languages.

Actions: It triggers a computation on a RDD and does something with the results—either returning them to the user, or saving them to external storage.

Actions have an immediate effect, but transformations do not—they are lazy, in the
sense that they don’t perform any work until an action is performed on the transformed
RDD.

### Aggregation transformations

The three main transformations for aggregating RDDs of pairs by their keys are reduceByKey(), foldByKey() and aggregateByKey().

Types of RDD in Spark

HadoopRDD

JavaPairRDD

JavaDoubleRDD

EdgeRDD

VertexRDD

RandomRDD

JdbcRDD

Caching and Persisting Mechanism in Spark

Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.

RDDs can be cached using cache operation and can be persisted using persist operation.
Calling cache() will persist each partition of the RDD in the executor’s memory. If an
executor does not have enough memory to store the RDD partition, the computation
will not fail, but instead the partition will be re-computed as needed.

Only syntactic difference between cache and persist .In fact
cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.

Closures in Spark

It is those variables and methods which must be visible for the executor to perform its computations on the RDD.
Closure is serialized and sent to each executor.

The variables within the closure sent to each executor

Features of RDDs

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

Cluster Manager Types

Apache Spark supports three cluster managers:

Standalone: a simple cluster manager included with Spark that makes it easy to set up a cluster.

Apache Mesos: a general cluster manager that can also run Hadoop MapReduce and service applications.

Hadoop YARN: the resource manager in Hadoop 2.

Partitions and Partitioning in Spark

A partition is a logical chunk of large distributed data sets .Spark manages data using partitions that helps parallellize distributed data processing with minimal network traffic for sending data between executors.

Spark uses partitioner property to determine the algorithm to determine on which worker that particular record of RDD should be stored on.if partitioner is NONE that means partitioning is not based upon characteristic of data but distribution is random and guaranteed to be uniform across nodes.

Types of Partitioning in Spark

Hash Partitioning(Default Partitioner )

Range partitioning

Shared Variables in Spark

Shared variables are abstraction in Spark which is used in parallel operations in different nodes.
By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables and accumulators

Accumulator variables: They are variables that are only “added” to, such as counters and sums.
An accumulator is a shared variable that tasks can only add to, like counters in Map‐
Reduce . After a job has completed, the accumulator’s final value can be retrieved from the driver program

Broadcast Variables:
It is used to cache a value in memory on all nodes .A broadcast variable is serialized and sent to each executor, where it is cached so that later tasks can access it if needed.

Broadcast variables play a similar role to the distributed cache in MapReduce , although the implementation in Spark stores the data in memory, only spilling to disk when memory is exhausted.
A broadcast variable is created by passing the variable to be broadcast to the broadcast() method on SparkContext.

Parallellized Collections

Parallellized collections are created by calling SparkContext’s parallellize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

Data Locality in Spark

Data locality refers to how close data is to the code processing it .Having code and the data together tends to make the computation fast in Spark . If the code and data are separated ,either one has to move to the other .Moving the serialized code through the network is faster than moving the data through the network .

Data locality is how close data is to the code processing it. There are several levels of locality based on the data’s current location. In order from closest to farthest:

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible

NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes

NO_PREF data is accessed equally quickly from anywhere and has no locality preference

RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch

RDD Lineage in Spark

Spark does not support data replication in the memory. In the event of any data loss, it is rebuilt using the “RDD Lineage”. It is a process that reconstructs lost data partitions.

The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.

Minimizing data transfers when working with Spark

Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:

Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.

Using Accumulators – Accumulators help update the values of variables in parallel while executing.

The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

Pair RDD in Spark

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

Levels of persistence in Apache Spark

Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.

The various storage/persistence levels in Spark are -

MEMORY_ONLY

MEMORY_ONLY_SER

MEMORY_AND_DISK

MEMORY_AND_DISK_SER, DISK_ONLY

OFF_HEAP

Lazy Evaluation in Spark

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.