Developing and Running a Spark WordCount Application

This tutorial describes how to write, compile, and run a simple Spark word count application in two of the languages supported by Spark: Scala and Python. The Scala code was originally developed for a Cloudera tutorial written by Sandy Ryza.

Writing the Application

The example application is an enhanced version of WordCount, the canonical MapReduce example. In this
version of WordCount, the goal is to learn the distribution of letters in the most popular words in a corpus. The application:

Creates a SparkConf and SparkContext. A Spark application corresponds to an instance of the SparkContext class. When running a shell, the SparkContext is created for you.

Gets a word frequency threshold.

Reads an input set of text documents.

Counts the number of times each word appears.

Filters out all words that appear fewer times than the threshold.

For the remaining words, counts the number of times each letter occurs.

In MapReduce, this requires two MapReduce applications, as well as persisting the intermediate data to HDFS between them. In Spark, this application requires about 90 percent fewer lines
of code than one developed using the MapReduce API.

Compiling and Packaging Scala Applications

The tutorial uses Maven to compile and package the programs. Excerpts of the tutorial pom.xml are included below. For best practices using Maven to build Spark applications, see
Building Spark Applications.

to create sparkwordcount-1.0-SNAPSHOT-jar-with-dependencies.jar in the target directory.

Running the Application

The input to the application is a large text file in which each line contains all the words in a document, stripped of punctuation. Put an input file in a directory on HDFS. You can
use tutorial example input file:

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.