Programming with Weave, Part I

November 11, 2013

Terence Yim is a Software Engineer at Cask, responsible for designing and building realtime processing systems on Hadoop/HBase. Prior to Cask, Terence worked at both LinkedIn and Yahoo!, building high performance large scale distributed systems.

In this second blog post of the Weave series, we would like to show you how writing a distributed application can be as simple as writing a standalone Java application using Weave.

Writing applications to run on YARN

Before we dive into the details about Weave, let’s talk briefly about what a developer has to do in order to write a YARN application, other than standard MapReduce. A YARN application always consists of three parts:

Application Client

Application Client is responsible for submitting the application to YARN.

It can query the application status after the application is submitted.

Application Master (AM)

The AM is the entry point for the application to negotiate with
YARN Resource Manager (RM) for Containers on the cluster.

A YARN Container is an abstract entity representing resources such as memory,
CPU cores and network bandwidth available to the process running inside.

When a Container is assigned to an application, the AM can run commands inside the Container,
e.g. start a Java application.

It knows when a Container is completed or killed and determines what actions need to be taken when that happens.

The AM is responsible for sending heartbeats to RM (Resource Manager)
so that YARN can be up-to-date on the application status.

Application

This is where you put the actual application logic.
The Application Master is the one which will launch this application,
potentially multiple instances of them, in the YARN Containers it acquired.

Analogy to Standalone Java Application

It is not difficult to see that a YARN application is very similar to what many Java developers are familiar with: Standalone Java Application.

YARN

Standalone Java App

Application Client

java command that provides options and arguments to the application.

Application Master

main() method preparing threads for the application.

Container Application

Runnable implementation, where each runs in its own thread.

This close resemblance between the two made us wonder – wouldn’t it be great if you can write a distributed application just like you write a standalone java application? It inspired us to develop Weave with the goal of enabling everyone to write distributed applications and make it as easy as writing threads.

Programming in Weave

So, how simple is it to develop a distributed application on Weave? Let’s take a classic Hello World example:

As you can see from the example, the effort to make your application run on a YARN cluster is minimal. There are four steps:

Write your application by extending from AbstractWeaveRunnable and implementing the run() method.

Create and start a WeaveRunnerService.

Submit your application through the WeaveRunnerService.

Control your application using WeaveController, for example, wait for completion.

Weave Architecture

If you try to write the HelloWorld application above using YARN API directly, it will require hundreds, if not thousands, of lines of code to do the same thing. We found that many YARN applications have very similar needs:

Start, stop, monitor and restart Containers.

Collect logs and metrics from live Containers.

Elastic scaling by adding or removing Container instances.

Distribute commands to live Containers to adjust runtime behavior.

Service registry for announcement and discovery.

Delegation token generation and renewal when running in a secure Hadoop cluster.

Instead of coding it again and again for every application you write, Weave provides a simple and intuitive API for setting up your application, and a generic Application Master to handle interactions with YARN. This is how the high-level architecture of Weave looks like:

There are several things to highlight in the architecture:

An application can contain one or more WeaveRunnable. Each WeaveRunnable can have multiple live instances.

Each instance of WeaveRunnable will be running in a separate YARN Container, which has dedicated system resources assigned to it.

In the HelloWorld example above, there is only one runnable, but the Weave API allows you to have more.

All logs that are emitted from WeaveRunnable through SLF4J API are sent to a Kafka that runs inside the Application Master.

In the client API, you can attach a LogHandler to receive logs in real time for further processing.
In the HelloWorld example, they just get printed to standard output.