apache spark intro notes

It is the most active and the fastest growing distributed data processing (a.k.a big data) frameworks (as of 2017). There are a bunch of reasons for this, the most important being

A much more efficient programming model which is a great improvement/replacement of map reduce : Spark has lazy evaluations, you can cache intermediate data sets and intermediate datasets are stored in memory. This gets rid of one of the frequent complaints for map reduce’s sub-optimal performance : data set at every intermediate stage is spilled to disk.

Simpler APIs : Spark uses abstractions like RDDs and DataFrames and Datasets to hide a lot of complexity of underlying data processing. This makes programming much simpler. As a database developer, I feel like this is the equivalent of using SQL compared to working with tuples of lists in python. The Spark syntax is much more declarative, because of these abstractions and it gets better at each stage.

APIs in multiple languages : Spark has APIs in python, scala and Java, enabling different groups from data engineers, who typically use java and are increasingly using scala, to data scientists and analysts, who are increasingly using python to work with the framework.

Because it works on a core (for example, RDDs are essential to all the above implementations), all use-cases benefit from improvements to the core. With projects like Project Tungsten, these improvements help all applications with new releases.

{quote}
For example, when Spark’s core engine adds an optimization, SQL and machine learning libraries automatically speed up as well. Second, the costs associated with running the stack are minimized, because instead of running 5– 10 independent software systems, an organization needs to run only one.

One of the common complaints with R is performance issues with really large datasets
machine learning : use the libraries you are familiar with, but run them on large cluster (spark does the heavy lifting).

One significant increase in productivity is spark’s various advantages make it easier to do iterative algorithms and because they all share a similar api, you can combine different programming models as well.

BDAS stack at https://amplab.cs.berkeley.edu/software/ is also pretty interesting (final picture)

Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel.

Spark enables data scientists to tackle problems with larger data sizes than they could before with tools like R or Pandas.

Spark can create distributed datasets from any file stored in the Hadoop distributed filesystem (HDFS) or other storage systems supported by the Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc.). It’s important to remember that Spark does not require Hadoop, it simply has support for storage systems implementing the Hadoop APIs.

who am i

I am a data engineer with interests in databases, data science, algorithms and programming in general.