The Heron Streamlet API

Create Heron topologies in using a simplified interface inspired by functional programming

The Heron Streamlet API is in beta

The Heron Streamlet API is well tested and can be used to build and test topologies locally. The API is not yet fully stable, however, and breaking changes are likely in the coming weeks.

When it was first released, Heron offered a Topology API—heavily indebted to the Storm API—for developing topology logic. In the original Topology API, developers creating topologies were required to explicitly:

Problems with the Topology API

Although the Storm-inspired API provided a powerful low-level interface for creating topologies, the spouts-and-bolts model also presented a variety of drawbacks for Heron developers:

Drawback

Description

Verbosity

In the original Topology API for both Java and Python, creating spouts and bolts required substantial boilerplate and forced developers to both provide implementations for spout and bolt classes and also to specify the connections between those spouts and bolts.

Difficult debugging

When spouts, bolts, and the connections between them need to be created “by hand,” it can be challenging to trace the origin of problems in the topology’s processing chain

Tuple-based data model

In the older topology API, spouts and bolts passed tuples and nothing but tuples within topologies. Although tuples are a powerful and flexible data type, the topology API forced all spouts and bolts to implement their own serialization/deserialization logic.

Advantages of the Streamlet API

In contrast with the Topology API, the Heron Streamlet API offers:

Advantage

Description

Boilerplate-free code

Instead of needing to implement spout and bolt classes over and over again, the Heron Streamlet API enables you to create stream processing logic out of functions, such as map, flatMap, join, and filter functions, instead.

Easy debugging

With the Heron Streamlet API, you don’t have to worry about spouts and bolts, which means that you can more easily surface problems with your processing logic.

Completely flexible, type-safe data model

Instead of requiring that all processing components pass tuples to one another (which implicitly requires serialization to and deserializaton from your application-specific types), the Heron Streamlet API enables you to write your processing logic in accordance with whatever types you’d like—including tuples, if you wish.

In the Streamlet API for Java, all streamlets are typed (e.g. Streamlet<MyApplicationType>), which means that type errors can be caught at compile time rather than at runtime.

Streamlet API topology model

Instead of spouts and bolts, as with the Topology API, the Streamlet API enables you to create processing graphs that are then automatically converted to spouts and bolts under the hood. Processing graphs consist of the following components:

Operators supply the graph’s processing logic, operating on data passed into the graph by sources.

Sinks are the terminal endpoints of the processing graph, determining what the graph does with the processed data. Sinks can involve storing data in a database, logging results to stdout, publishing messages to a topic in a pub-sub messaging system, and much more.

Streamlets

The core construct underlying the Heron Streamlet API is that of the streamlet. A streamlet is an unbounded, ordered collection of elements of some data type (streamlets can consist of simple types like integers and strings or more complex, application-specific data types).

Source streamlets supply a Heron processing graph with data inputs. These inputs can come from a wide variety of sources, such as pub-sub messaging systems like Apache
Kafka and Apache Pulsar (incubating), random generators, or static files like CSV or Apache Parquet files.

Source streamlets can then be manipulated in a wide variety of ways. You can, for example:

In this diagram, the source streamlet is produced by a random generator that continuously emits random integers between 1 and 100. From there:

A filter operation is applied to the source streamlet that filters out all values less than or equal to 30

A new streamlet is produced by the filter operation (with the Heron Streamlet API, you’re always transforming streamlets into other streamlets)

A map operation adds 15 to each item in the streamlet, which produces the final streamlet in our graph. We could hypothetically go much further and add as many transformation steps to the graph as we’d like.

Once the final desired streamlet is created, each item in the streamlet is sent to a sink. Sinks are where items leave the processing graph.

Supported languages

The Heron Streamlet API and topologies

With the Heron Streamlet API you still create topologies, but only implicitly. Heron automatically performs the heavy lifting of converting the streamlet-based processing logic that you create into spouts and bolts and, from there, into containers that are then deployed using whichever scheduler your Heron cluster relies upon.

From the standpoint of both operators and developers managing topologies’ lifecycles, the resulting topologies are equivalent. From a development workflow standpoint, however, the difference is profound. You can think of the Streamlet API as a highly convenient tool for creating spouts, bolts, and the logic that connects them.

The basic workflow looks like this:

When creating topologies using the Heron Streamlet API, you simply write code (example below) in a highly functional style. From there:

that code is automatically converted into spouts, bolts, and the necessary connective logic between spouts and bolts

the spouts and bolts are automatically converted into a logical plan that specifies how the spouts and bolts are connected to each other

the logical plan is automatically converted into a physical plan that determines how the spout and bolt instances (the colored boxes above) are distributed across the specified number of containers (in this case two)

With a physical plan in place, the Streamlet API topology can be submitted to a Heron cluster.

Java processing graph example

The code below shows how you could implement the processing graph shown above in Java:

As you can see, the Java code for the example streamlet processing graph requires very little boilerplate and is heavily indebted to Java 8 lambda patterns.

Streamlet operations

In the Heron Streamlet API, processing data means transforming streamlets into other streamlets. This can be done using a wide variety of available operations, including many that you may be familiar with from functional programming:

In this example, a supplier streamlet emits an indefinite series of 1s. The map operation then adds 12 to each incoming element, producing a streamlet of 13s. The effect of this operation is to transform the Streamlet<Integer> into a Streamlet<Integer> with different values (map operations can also convert streamlets into streamlets of a different type).

FlatMap operations

FlatMap operations are like map operations but with the important difference that each element of the streamlet is “flattened” into a collection type. In the Java example below, a supplier streamlet emits the same sentence over and over again; the flatMap operation transforms each sentence into a Java List of individual words.

Java example

Streamlet<String>sentences=builder.newSource(()->"I have nothing to declare but my genius");Streamlet<List<String>>words=sentences.flatMap((sentence)->Arrays.asList(sentence.split("\\s+")));

The effect of this operation is to transform the Streamlet<String> into a Streamlet<List<String>> containing each word emitted by the source streamlet.

Filter operations

Filter operations retain some elements in a streamlet and exclude other elements on the basis of a provided filtering function.

Java example

Here, one streamlet is an endless series of “ooh”s while the other is an endless series of “aah”s. The union operation combines them into a single streamlet of alternating “ooh”s and “aah”s.

Clone operations

Clone operations enable you to create any number of “copies” of a streamlet. Each of the “copy” streamlets contains all the elements of the original and can be manipulated just like the original streamlet.

Transform operations

operations that don’t neatly fit into the other categories or into a lambda-based logic

Transform operations require you to implement three different methods:

A setup method that enables you to pass a context object to the operation and to specify what happens prior to the transform step

A transform operation that performs the desired transformation

A cleanup method that allows you to specify what happens after the transform step

The context object available to a transform operation provides access to:

the current state of the topology

the topology’s configuration

the name of the stream

the stream partition

the current task ID

Here’s a Java example of a transform operation in a topology where a stateful record is kept of the number of items processed:

importcom.twitter.heron.streamlet.Context;importcom.twitter.heron.streamlet.SerializableTransformer;importjava.util.function.Consumer;publicclassCountNumberOfItemsimplementsSerializableTransformer<String,String>{privateintnumberOfItems;publicvoidsetup(Contextcontext){numberOfItems=(int)context.getState("number-of-items");context.getState().put("number-of-items",numberOfItems+1);}publicvoidtransform(Stringin,Consumer<String>consumer){StringtransformedString=// Apply some operation to the incoming value
consumer.accept(transformedString);}publicvoidcleanup(){System.out.println(String.format("Successfully processed new state: %d",numberOfItems));}}

This operation does a few things:

In the setup method, the Context object is used to access the current state (which has the semantics of a Java Map). The current number of items processed is incremented by one and then saved as the new state.

In the transform method, the incoming string is transformed in some way and then “accepted” as the new value.

In the cleanup step, the current count of items processed is logged.

Here’s that operation within the context of a streamlet processing graph:

builder.newSource(()->"Some string over and over");.transform(newCountNumberOfItems()).log();

a reduce function that produces a single value for each key in the streamlet

Reduce by key and window operations produce a new streamlet of key-value window objects (which include a key-value pair including the extracted key and calculated value, as well as information about the window in which the operation took place).

All key-values across both the left and right stream, regardless of whether or not any given element has a matching key in the other stream

Inner joins

Inner joins operate over the Cartesian product of the left stream and the right stream, i.e. over all the whole set of all ordered pairs between the two streams. Imagine this set of key-value pairs accumulated within a time window:

Left streamlet

Right streamlet

(“player1”, 4)

(“player1”, 10)

(“player1”, 5)

(“player1”, 12)

(“player1”, 17)

(“player2”, 27)

An inner join operation would thus apply the join function to all key-values with matching keys, thus 3 × 2 = 6 in total, producing this set of key-values:

Included key-values

(“player1”, 4)

(“player1”, 5)

(“player1”, 10)

(“player1”, 12)

(“player1”, 17)

Note that the ("player2", 27) key-value pair was not included in the stream because there’s no matching key-value in the left streamlet.

If the supplied join function, say, added the values together, then the resulting joined stream would look like this:

Operation

Joined Streamlet

4 + 10

(“player1”, 14)

4 + 12

(“player1”, 16)

5 + 10

(“player1”, 15)

5 + 12

(“player1”, 17)

17 + 10

(“player1”, 27)

17 + 12

(“player1”, 29)

Inner joins are the “default” join type in the Heron Streamlet API. If you call the join method without specifying a join type, an inner join will be applied.

Java example

classScore{StringplayerUsername;intplayerScore;// Setters and getters
}Streamlet<Score>scores1=/* A stream of player scores */;Streamlet<Score>scores2=/* A second stream of player scores */;scores1.join(scores2,// Key extractor for the left stream (scores1)
score->score.getPlayerUsername(),// Key extractor for the right stream (scores2)
score->score.getPlayerScore(),// Window configuration
WindowConfig.TumblingCountWindow(50),// Join function (selects the larger score as the value using
// using a ternary operator)
(x,y)->(x.getPlayerScore()>=y.getPlayerScore())?x.getPlayerScore():y.getPlayerScore()).log();

In this example, two streamlets consisting of Score objects are joined. In the join function, a key and value extractor are supplied along with a window configuration and a join function. The resulting, joined streamlet will consist of key-value pairs in which each player’s username will be the key and the joined—in this case highest—score will be the value.

By default, an inner join is applied in join operations but you can also specify a different join type. Here’s a Java example for an outer right join:

importcom.twitter.heron.streamlet.JoinType;scores1.join(scores2,// Key extractor for the left stream (scores1)
score->score.getPlayerUsername(),// Key extractor for the right stream (scores2)
score->score.getPlayerScore(),// Window configuration
WindowConfig.TumblingCountWindow(50),// Join type
JoinType.OUTER_RIGHT,// Join function (selects the larger score as the value using
// using a ternary operator)
(x,y)->(x.getPlayerScore()>=y.getPlayerScore())?x.getPlayerScore():y.getPlayerScore()).log();

Outer left joins

An outer left join includes the results of an inner joinplus all of the unmatched keys in the left stream. Take this example left and right streamlet:

Left streamlet

Right streamlet

(“player1”, 4)

(“player1”, 10)

(“player2”, 5)

(“player4”, 12)

(“player3”, 17)

The resulting set of key-values within the time window:

Included key-values

(“player1”, 4)

(“player1”, 10)

(“player2”, 5)

(“player3”, 17)

In this case, key-values with a key of player4 are excluded because they are in the right stream but have no matching key with any element in the left stream.

Outer right joins

An outer right join includes the results of an inner joinplus all of the unmatched keys in the right stream. Take this example left and right streamlet (from above):

Left streamlet

Right streamlet

(“player1”, 4)

(“player1”, 10)

(“player2”, 5)

(“player4”, 12)

(“player3”, 17)

The resulting set of key-values within the time window:

Included key-values

(“player1”, 4)

(“player1”, 10)

(“player2”, 5)

(“player4”, 17)

In this case, key-values with a key of player3 are excluded because they are in the left stream but have no matching key with any element in the right stream.

Outer joins

Outer joins include all key-values across both the left and right stream, regardless of whether or not any given element has a matching key in the other stream. If you want to ensure that no element is left out of a resulting joined streamlet, use an outer join. Take this example left and right streamlet (from above):

Note that all key-values were indiscriminately included in the joined set.

Sink operations

In processing graphs like the ones you build using the Heron Streamlet API, sinks are essentially the terminal points in your graph, where your processing logic comes to an end. A processing graph can end with writing to a database, publishing to a topic in a pub-sub messaging system, and so on. With the Streamlet API, you can implement your own custom sinks.

Java example

importcom.twitter.heron.streamlet.Context;importcom.twitter.heron.streamlet.Sink;publicclassFormattedLogSinkimplementsSink<T>{privateStringstreamletName;publicvoidsetup(Contextcontext){streamletName=context.getStreamletName();}publicvoidput(Telement){Stringmessage=String.format("Streamlet %s has produced an element with a value of: '%s'",streamletName,element.toString());System.out.println(message);}publicvoidcleanup(){}}

In this example, the sink fetches the name of the enclosing streamlet from the context passed in the setup method. The put method specifies how the sink handles each element that is received (in this case, a formatted message is logged to stdout). The cleanup method enables you to specify what happens after the element has been processed by the sink.

Here is the FormattedLogSink at work in an example processing graph:

Builderbuilder=Builder.newBuilder();builder.newSource(()->"Here is a string to be passed to the sink").toSink(newFormattedLogSink());

Log operations rely on a log sink that is provided out of the box. You’ll need to implement other sinks yourself.

Consume operations

Consume operations are like sink operations except they don’t require implementing a full sink interface. Consume operations are thus suited for simple operations like formatted logging.

Java example

Partitioning

In the topology API, processing parallelism can be managed via adjusting the number of spouts and bolts performing different operations, enabling you to, for example, increase the relative parallelism of a bolt by using three of that bolt instead of two.

The Heron Streamlet API provides a different mechanism for controlling parallelism: partitioning. To understand partitioning, keep in mind that rather than physical spouts and bolts, the core processing construct in the Heron Streamlet API is the processing step. With the Heron Streamlet API, you can explicitly assign a number of partitions to each processing step in your graph (the default is one partition).

Repartition operations

As explained above, when you set a number of partitions for a specific operation (included for source streamlets), the same number of partitions is applied to all downstream operations until a different number is explicitly set.