Building and running Spark applications on IBM Open Platform with Apache Hadoop

As everyone in the world of computing knows, Apache Spark is one of the most interesting and talked about projects in today’s open source community. Although Apache Spark is so much talked about, it is still a long way for being a “user friendly” application, especially in the application “building” area.

When building your own standalone Java application you would typically use something like Apache ANT or use built-in tools within your IDE to generate the required JAR file or any other construct which is required. Both of these subjects are covered in detail in lots of books and online documentation. Apache Spark typically uses a tool called SBT , which is a short-name for “Simple Build Tool”. This build tool is mainly used within the Scala ecosystem, and in some cases can become quite the opposite of “simple”. More can be found about the tool here : http://www.scala-sbt.org/.

When I started working with Apache Spark I had some major issues with SBT and its integration with Spark, so to help others avoid the same issues I encountered, I have decided to post this hands-on tutorial.

With SBT and Spark, one can build a Spark standalone application or an application on a commercial Hadoop distribution which is YARN-enabled. Most of us will probably not configure Hadoop from scratch and will use some kind of a commercial distribution. I have used IBM BigInsights 4.0 quick start edition (now called IBM IOP for Hadoop) for this purpose.

So, the tutorial contains two parts:
Part 1: building your first Spark standalone application
Part 2: building your first Spark application on IBM IOP for Hadoop (BigInsights) which is “yarn enabled”

Part 1 – Building your first Spark standalone application

Step 1: install SBT on the target machine

Step 2: code the simple program

Step 3: copy the file into the SBT enabled system

Step 4: create the input text file at /home/Spark/input.txt

Step 5: create and edit the simple.sbt file

Step 6: create the mkDirStructure.sh to automate the directory creation