Tutorial : Quick overview of Spark 1.6 Core Functionality

In this blog we will discuss about Spark 1.6 Core Functionality and provides a quick introduction to using Spark. It demonstrates the basic functionality of RDDs. Later on we demonstrate Spark SQL and DataFrame API functionality. We have tried to cover basics of Spark 1.6 core functionality and programming contexts.

Introduction to Apache Spark

Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics.It is a cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is a framework for performing general data analytics on distributed computing cluster like Hadoop. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. It provides in memory computations for increase speed and data process over map reduce.It runs on top of existing Hadoop cluster and access Hadoop data store (HDFS), can also process structured data in Hive and Streaming data from HDFS, Flume, Kafka, Twitter.

Features of Apache Spark

Some of Spark’s features which are really highlighting it in the Big Data world.

1. Speed

Spark can be 100x faster than Hadoop for large scale data processing by exploiting in memory computing and other optimizations.

2. Ease of Use

This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps.

3.Combine SQL, Streaming & Complex Analytics.

In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single work-flow.

4. Advanced Analytics

Spark not only supports ‘Map’ and ‘Reduce’ But it also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

5. A Unified Engine

6.Runs Everywhere

Spark runs on Hadoop, Mesos, Standalone, or in the cloud and it can access diverse data sources including HDFS, Cassandra, HBase, S3.

Spark Core

Spark Core is the basic functionality of Spark, including components for fault recovery, memory management, interacting with storage systems and more.

Initializing Spark

You first need to build a SparkConf object before to create a SparkContext. SparkConf contains information about your application . The appName parameter is a name for your application to show on the cluster UI. master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode.

RDDs Operations

The concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

RDDs support two types of operations:

1.Transformations

Transformations are operations on RDDs that return a new RDD. Some basic common transformations functions supported by Spark.

map()

Apply a map() function to each element in the RDD and return an RDD of the result.

This is the start of Spark Tutorial, from next week onwards we would be working on this topic to make it grow. We would look at how we can create more useful tutorial into it , then we would be adding more content to it together. If you have any suggestion feel free to suggest us 🙂 Stay tuned.