Rapid7 Blog

Introduction to Apache Spark

POST STATS:

SHARE

Apache Spark is a fast and general-purpose cluster computing system. The latest version can be downloaded from http://spark.apache.org/downloads.html. In this post, we will try to perform some basic data manipulations using spark and python.

Using Spark Python Shell

The first thing a Spark program should do is create a SparkContext object, this tells Spark how to access a cluster. When you use the python shell, a context variable named “sc” will be created automatically.

To access a python Spark shell, you can run the following inside your spark directory:

Resilient Distributed Datasets (RDDs)
Resilient distributed datasets (RDD) are fault-tolerant collections of elements that can be operated in parallel. The easiest way to create an in memory RDD is by calling the parallelize function as follow:

As you can see, we’ve identified a set of words which is more palatable.
A more complex example

We will write a python script named “weather.py” which will analyse a file from the US National Weather service. To get the initial data, get the file named “200705hourly.txt” which you can extract from the following compressed file http://cdo.ncdc.noaa.gov/qclcd_ascii/QCLCD200705.zip. That file contains the weather statistics for July 2005 in an hourly format. We are going to compile that data for the whole month using RDDs.

Spark is a great tool and we’ve now seen, quite easy to get started with. It also has a lot more features which we will not cover in this post, such as Machine Learning Algorithms, Cluster Deployment, Streaming and Graph Analysis. All these features can be accessed programmatically not only using Python, but also Java and Scala if you’re more familiar with those.