First steps with Apache Spark

I have come across Apache Spark when looking for tools for an ETL process. I am a big fan of Scala, and Spark with Scala was mentioned a few times in an ETL context. Therefore, I couldn’t not to use it as a potential option.

Simple Start

Spark can be hosted on a local machine – such a deployment mode is called Spark Standalone. Pre-built package can be obtained from Spark’s download page. Once unpacked it is ready to go! It contains a set of scripts to manage local spark cluster, which is well described in the documentation.

Spark can also be started in an embedded mode. This happens out of the box when running main method that creates SparkContext and connects to “local” master. In this case Spark binaries don’t have to be downloaded.

Spark Console

Spark offers interactive shell to play with its functionality. If you have downloaded spark, running

Structured & Unstructured Data

RDD

SparkContext is the main entry point for Spark functionality. It represents the connection to a Spark cluster, and can be used to create RDDs (Resilient Distributed Datasets), accumulators and broadcast variables on that cluster. RDD is at the core of Spark and provides efficient way to work with unstructured data.

In order to create SparkContext in a Spark application, the following code can be used:

The code above creates an application of name appName running on a standalone cluster that uses all available logical cores on a local machine (check master-url docs for more information).

Spark SQL

If the data is in structured or semi-structured format like JSON, CSV or even a table in the database, Spark SQL would be a perfect fit. It allows running SQL queries against the data and is optimised for storing and distributing it in the cluster. In order to use Spark SQL, SparkSession has to be created. It can be done in the following way: