Spark Interview Questions Part-1

·Iterative
Algorithm: Generally MapReduce is not good to process iterative algorithms
like Machine Learning and Graph processing.
Graph and Machine Learning algorithms are iterative by nature and less
saves to disk, this type of algorithm needs data in memory to run algorithm
steps again and again or less transfers over network means better performance.

·In Memory
Processing: MapReduce uses disk storage for storing processed intermediate
data and also read from disks which is not good for fast processing. . Because Spark keeps data in Memory
(Configurable), which saves lot of time, by not reading and writing data to
disk as it happens in case of Hadoop.

Ans:
Spark is often called cluster computing
engine or simply execution engine. Spark uses many concepts from Hadoop
MapReduce. Both Spark and Hadoop work together well. Spark with HDFS and YARN
gives better performance and also simplifies the work distribution on cluster.
As HDFS is storage engine for storing huge volume of data and Spark as a
processing engine (In memory as well as more efficient data processing).

HDFS:
It is used as a Storage engine for Spark as well as Hadoop.

YARN:
It is a framework to manage Cluster using pluggable scedular.

Run
other than MapReduce: With Spark you can run MapReduce algorithm as well as
other higher level of operators for instance map(), filter(), reduceByKey(), groupByKey() etc.

3.How can you use Machine Learning library “SciKit
library” which is written in Python, with Spark engine?

Ans:
Machine learning tool written in Python, e.g. SciKit library, can be used as a Pipeline
API in Spark MLlib or calling pipe().

4.Why Spark is good at low-latency iterative
workloads e.g. Graphs and Machine Learning?

Ans: Machine Learning algorithms for
instance logistic regression require many iterations before creating optimal
resulting model. And similarly in graph algorithms which traverse all the nodes
and edges. Any algorithm which needs many iteration before creating results can
increase their performance when the intermediate partial results are stored in
memory or at very fast solid state drives.

Spark can cache/store
intermediate data in memory for faster model building and training.

Also, when graph
algorithms are processed then it traverses graphs one connection per iteration
with the partial result in memory. Less
disk access and network traffic can make a huge difference when you need to
process lots of data.

5.Which all kind of data processing supported
by Spark?

Ans:
Spark offers three kinds of data processing using batch, interactive
(Spark Shell), and stream processing
with the unified API and data structures.

6.How do you define SparkContext?

Ans: It’s an entry point for a Spark
Job. Each Spark application starts by instantiating a Spark context. A Spark
application is an instance of SparkContext. Or you can say, a Spark context
constitutes a Spark application.

A Spark context
can be used to create RDDs, accumulators and broadcast variables, access Spark
services and run jobs.

A Spark context
is essentially a client of Spark’s execution environment and it acts as the
master of your Spark.

7.How can you define SparkConf?

Ans:
Spark properties control most application settings and are configured
separately for each application. These properties can be set directly on a SparkConf
passed to your SparkContext. SparkConf allows you to configure some of the
common properties (e.g. master URL and application name), as well as arbitrary
key-value pairs through the set() method. For example, we could initialize an
application with two threads as follows:

Note that we run with local[2], meaning
two threads - which represents “minimal” parallelism, which can help detect
bugs that only exist when we run in a distributed context.

valconf=newSparkConf()

.setMaster("local[2]")

.setAppName("CountingSheep")

valsc=newSparkContext(conf)

8.Which all are the, ways to configure Spark
Properties and order them least important to the most important.

Ans:
There are the following ways to set up properties for Spark and user programs
(in the order of importance from the least important to the most important):

·conf/spark-defaults.conf - the default

·--conf - the command line option used by
spark-shell and spark-submit

·SparkConf

9.What is the Default level of parallelism in
Spark?

Ans:
Default level of parallelism is the number of partitions when not specified
explicitly by a user.

10.Is it possible to have multiple
SparkContext in single JVM?

Ans: Yes, spark.driver.allowMultipleContexts is true (default: false ). If true Spark logs warnings instead of throwing
exceptions when multiple SparkContexts are active, i.e. multiple SparkContext
are running in this JVM. When creating an instance of SparkContex.