Heron has been Open Sourced, woo! Heron is Twitter’s distributed stream computation system for running Storm compatible topologies in production.

A Heron topology is a directed acyclic graph used to process streams of data. Heron topologies consist of three basic components: spouts and bolts, which are connected via streams of tuples. Below is a visual illustration of a simple topology:

Spouts are responsible for emitting tuples into the topology, while bolts are responsible for processing those tuples. In the diagram above, spout S1 feeds tuples to bolts B1 and B2 for processing; in turn, bolt B1 feeds processed tuples to bolts B3 and B4, while bolt B2 feeds processed tuples to bolt B4.

This is just a simple example; you can use bolts and spouts to form arbitrarily complex topologies.

Topology Lifecycle

Once you’ve set up a Heron cluster, you can use Heron’s CLI tool to manage the entire lifecycle of a topology, which typically goes through the following stages:

submit the topology to the cluster. The topology is not yet processing streams but is ready to be activated.

activate the topology. The topology will begin processing streams in accordance with the topology architecture that you’ve created.

restart an active topology if, for example, you need update the topology configuration.

deactivate the topology. Once deactivated, the topology will stop processing but remain running in the cluster.

kill a topology to completely remove it from the cluster. It is no longer known to the Heron cluster and can no longer be activated. Once killed, the only way to run that topology is to re-submit it.

Spouts

A Heron spout is a source of streams, responsible for emitting tuples into the topology. A spout may, for example, read data from a Kestrel queue or read tweets from the Twitter API and emit tuples to one or more bolts. Information on building spouts can be found in Building Spouts.

Bolts

A Heron bolt consumes streams of tuples emitted by spouts and performs some set of user-defined processing operations on those tuples, which may include performing complex stream transformations, performing storage operations, aggregating multiple streams into one, emitting tuples to other bolts within the topology, and much more. Information on building bolts can be found in Building Bolts.

Data Model

Heron has a fundamentally tuple-driven data model. You can find more information in Heron’s Data Model.

Logical Plan

Physical Plan

A topology’s physical plan is related to its logical plan but with the crucial difference that a physical plan maps the actual execution logic of a topology, including the machines running each spout or bolt and more. Here’s a rough visual representation of a physical plan:

At this point you can navigate to the Mesos UI http://192.168.3.5:5050/#/ and see things working. Every topology is a Mesos Framework.

Tasks on the cluster

Start Heron Tracker

The Heron Tracker is a web service that continuously gathers information about your Heron cluster. You can launch the tracker by running the command (which is already installed if you started with your getting your toes wet, else ). You need to update the statemanager for your heron_tracker.yaml like so.

The Heron UI provides a lot of information about a topology or a part of it quickly, thus reducing debugging time considerably. Some of these features are listed below. A complete set of features can be found in following sections.

See logical plan of a topology

See physical plan of a topology

Configs of a topology

See some basic metrics for each of the instances and components

Links to get logs, memory histogram, jstack, heapdump, and exceptions of a particular instance

Now onto the Apache Kafka Spout & Bolt & Example topology which mirrors data from one topic to another (foo to bar or whatever set as the params above). Here is how to build and work on the topology yourself.

How Is Heron Better Than Storm?

The provisioning of resources is abstracted from the duties of the cluster manager so that Heron can play nice with the rest of the shared infrastructure.

Each Heron Instance executes only a single task so is easy to debug.

The design makes transparent which component of the topology is failing or slowing down as the metrics collection is granular and can easily map an issue to a specific process in the system.

Heron allows a topology writer to specify exactly the resources for each component, avoiding over-provisioning.

Having a Topology Manager per topology enables them to be managed independently and failure of one topology does not affect the others.

The backpressure mechanism gives a consistent rate of delivering results and a way to reason about the system. It is also a key mechanism that supports migration of topologies from one set of containers to another.

More good stuff like this from the Heron paper if you haven’t already read it.

There are also more ways to run Heron besides Native Mesos Framework. Heron supports deployment on Apache Auroraout of the box as well as for Slurm too. Apache REEF and others are in the works and looking forward to the future of topology driven as a service solutions for the enterprise.