Main menu

Category Archives: Storm

Hadoop is a batch-oriented big data solution at its heart and leaves gaps in ad-hoc and real-time data processing at massive scale so some people have already started counting its days as we know it now. As one of the alternatives, we have already seen Google BigQuery to support ad-hoc analytics and this time the post is about Twitter’s Storm real-time computation engine which aims to provide solution in the real-time data analytics world. Storm was originally developed by BackType and running now under Twitter’s name, after BackType has been acquired by them. The need for having a dedicated real-time analytics solution was explained by Nathan Marz as follows: “There’s no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing…. The lack of a “Hadoop of realtime” has become the biggest hole in the data processing ecosystem. Storm fills that hole.”

Storm Architecture

Storm architecture very much resembles to Hadoop architecture; it has two types of nodes: a master node and the worker nodes. The master node runs Nimbus that is copying the code to the cluster nodes and assigns tasks to the workers – it has a similar role as JobTracker in Hadoop. The worker nodes run the Supervisor which starts and stops worker processes – its role is similar to TaskTrackers in Hadoop. The coordination and all states between Nimbus and Supervisors are managed by Zookepeer, so the architecture looks as follows:

Storm is written is Clojure and Java.

One of the key concepts in the Storm is topology; in essence a Storm cluster executes a topology – topology defines the data sources, the processing tasks and the data flow between the nodes. Topology and MapReduce jobs in Hadoop can be considered analogous.

Storm has a concept of streams which are basically a sequence of tuples, they represent the data that is being passed around the Storm nodes. There are two main components to manipulate stream data: spouts which are reading data from a source (e.g. a queue or an API, etc) and emit a list of fields. Bolts are consuming the data coming from input streams, processing them and then emit a new stream or store the data in a database.

One important thing when you define a topology is determine how data will be passed around the nodes. As discussed above, a node (running either spouts our bolts) will emit a stream. Stream grouping functionality will allow to decide which node(s) will receive the emitted tuples. Storms has a number of grouping functions like shuffle grouping (sending streams to a randomly chosen bolt), fields grouping (it guarantees that a given set of fields is always sent to the same bolt), all grouping (the tuples are sent to all instances of the same bolt), direct grouping (the source determines which bolt receives the tuples) and you can implement your own custom grouping method, too.

If you want to know more about Storm internals, you can download the code and find a great tutorial on github.

A Storm application

The best way to start with Storm is to download storm-starter package from github. This contains a variety of examples from basic WorldCount to more complex implementations. In this post we will have a closer look at WordCountTopology.java. It has a maven m2-pom.xml file so you can compile and execute it using mvn command:

RandomSentenceSpout has a method called nextTuple() that is inherited from ISpout interface. When this method is called, Storm is requesting that the Spout emit tuples to the output collector. In this case, the tuples will be randomly selected sentences from a predefined String array.

The next step in the topology definition is the SplitSentence bolt. The SplitSentence bolt actually invokes a python code – splitsentence.py – that splits the the sentences into words using python split() method

That method receives the words as tuples and uses a Map to count the number of the words. Then it will emit the result.

You can run the Storm topology in two modes (again, similar to Hadoop stand-alone and distributed modes). One mode is based on LocalCluster class and that enables to run the storm topology on your own machine, debug it, etc. Then when you are ready to run it an a storm cluster, then you shall use StormSubmitter class to submit the topology to the storm cluster:

The parallelism can be controller by various methods and arguments, like setNumWorker()s, setMaxTaskParallelism() and parallelism_hints argument in building the topology, see e.g. 5 in builder.setSpout() method. The parallelism_hint defines the number of tasks that should be assigned to execute the given spout. Each task will run on a thread in a process somwehere around the cluster.

builder.setSpout("spout", new RandomSentenceSpout(), 5);

When we run the application, we can see that there are multiple threads running in parallel that are emitting the original random sentences, then another threads are splitting them into words and yet another threads are counting the words.

Big Data analytics can come in many flavours; from batch processing to a-hoc analytics to real-time processing. Hadoop, the granddad of all big data is focused on batch-oriented solution – should you need to support real-time analytics, Storm can offer an interesting alternative.