Write MapReduce Jobs in Idiomatic Clojure with Parkour

Thanks to Marshall Bockrath-Vandegrift of advanced threat detection/malware company (and CDH user) Damballa for the following post about his Parkour project, which offers libraries for writing MapReduce jobs in Clojure. Parkour has been tested (but is not supported) on CDH 3 and CDH 4.

Clojure is Lisp-family functional programming language which targets the JVM. On the Damballa R&D team, Clojure has become the language of choice for implementing everything from web services to machine learning systems. One of Clojure’s key features for us is that it was designed from the start as an explicitly hosted language, building on rather than replacing the semantics of its underlying platform. Clojure’s mapping from language features to JVM implementation is frequently simpler and clearer even than Java’s.

Parkour is our new Clojure library that carries this philosophy to the Apache Hadoop’s MapReduce platform. Instead of hiding the underlying MapReduce model behind new framework abstractions, Parkour exposes that model with a clear, direct interface. Everything possible in raw Java MapReduce is possible with Parkour, but usually with a fraction of the code.

Example

Every new MapReduce library needs a word-count example, so let’s walk through Parkour’s.

1

2

3

4

5

6

(ns parkour.examples.word-count

(:require[clojure.string:asstr]

[parkour(conf:asconf)(mapreduce:asmr)(graph:aspg)

(tool:astool)]

[parkour.io(text:astext)(seqf:asseqf)])

(:import[org.apache.hadoop.io Text LongWritable]))

Parkour is designed as a collection of layered APIs in separate namespaces, not an all-or-nothing framework. If you want to use Parkour’s core Clojure-MapReduce integration, but build the actual jobs from Java, Parkour provides the necessary flexibility.

1

2

3

4

5

6

7

8

9

10

11

(defn mapper

[input]

(->>(mr/vals input)

(mapcat#(str/split % #"\s+"))

(map#(-> [% 1]))))

(defn reducer

[input]

(->>(mr/keyvalgroups input)

(map(fn[[wordcounts]]

[word(reduce+0counts)]))))

Parkour mappers and reducers look like Clojure collection functions because they are Clojure collection functions. Parkour treats the entire set of key-value tuples allocated to a task as a literal collection of those tuples. You write Parkour task functions in terms of Clojure’s rich library of lazy sequence and reducers operations, not just an API that looks like them.

Parkour mappers and reducers also are their associated Hadoop tasks. They can have direct access to the job configuration, context, counters, and so on, and can do anything you could do in a raw Java MapReduce task. The provided task functions run directly in place of the equivalent Hadoop Java class.

The Parkour APIs do require you to use named Clojure vars to specify all functions Hadoop invokes during job execution. Vars are the moral equivalent of Java’s named classes, and make explicit the boundary between local and remote execution.

1

2

3

4

5

6

7

8

9

10

(defn word-count

[conf dseq dsink]

(->(pg/input dseq)

(pg/map#'mapper)

(pg/partition[Text LongWritable])

(pg/combine#'reducer)

(pg/reduce#'reducer)

(pg/output dsink)

(pg/execute conf"word-count")

first))

Parkour recasts job configuration in terms of “configuration step” functions over Hadoop Job objects. These are equivalent to – and frequently invoke – standard job-setup methods like setMapperClass(). This abstraction allows any job to be specified as a simple composition of configuration steps. More important, because functions are first class, they may be passed in to job setup. This inverts the control pattern usually exposed by Java MapReduce job driver methods, allowing callers to inject arbitrarily complex portions of the job configuration.

The Parkour job graph API provides helpers for adding all the commonly necessary steps in the right order, while leaving the freedom to add arbitrary additional steps.

1

2

3

4

5

6

(defn tool

[conf&args]

(let[[outpath&inpaths]args

input(apply text/dseq inpaths)

output(seqf/dsink[Text LongWritable]outpath)]

(->>(word-count conf input output)(into{})prn)))

Parkour distributed sinks (dsinks) and distributed sequences (dseqs) are extensions of the “configuration step” concept. A distributed sink marries a function configuring a job for particular output with a function configuring a job to consume that output as an input. A distributed sequence automatically takes any function configuring a job for some input and allows local access to the same key-value tuples produced remotely by the backing InputFormat.

Job execution returns dseqs for the job output dsinks. Those dseqs may be passed as inputs to additional jobs or processed client-side as reducible collections. This combination of clear local/remote demarcation with seamless composition simplifies many programs involving multiple MapReduce jobs and/or local processing.

1

2

(defn-main

[&args](System/exit(tool/run tool args)))

Parkour contains a number of utility namespaces that integrate non-MapReduce facilities with Clojure. As show here, the parkour.tool namespace supports using plain Clojure functions as Hadoop Tools for command-line option parsing. Similarly the parkour.fs namespace allows the Clojure standard I/O functions to work directly on Hadoop Paths, and the parkour.config namespace provides a Clojure map-like API for working with Hadoop Configurations.

Next Steps

If you like what you see here, Parkour has detailed documentation, example programs, and Apache-licensed source code, all hosted on Github. Check it out and get started simplifying your MapReduce programs with Clojure!

Marshall Bockrath-Vandegrift is principal engineer for the Damballa R&D team, where he works to move new botnet-detection research from proof-of-concept to production. He lives in Atlanta with his wife and cats.