Storm is a free and open source distributed real-time
computation system. Storm makes it easy to reliably process unbounded streams
of data, doing for real-time processing what Hadoop did for batch processing.
Storm is simple, can be used with any programming language, and is a lot of fun
to use!

Storm has many use cases: realtime analytics, online machine
learning, continuous computation, distributed RPC, ETL, and more. Storm is
fast: a benchmark clocked it at over a
million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed,
and is easy to set up and operate.

Hadoop is fundamentally a batch
processing system. Data is introduced into HDFS , processed by the nodes and
once process is complete, resulting data is back to HDFS. But the problem was
how to perform the realtime data processing. Storm came into the picture to
solve this problem.

Storm makes it easy to process
unbound stream of data with real-time processing. It process data into
topologies and continues the processing data as it arrives.

Storm
Components :

Topology : As on
Hadoop, you run "Map-Reduce jobs", on Storm, you will run
'Topologies'. Key difference between both is : MapReduce job eventually
finished, whereas a topology runs forever(until you kill it

Nimbus : master node runs a daemon called
"Nimbus" that is similar to Hadoop's "JobTracker". Nimbus
is responsible for distributing code around the cluster, assigning tasks to
machines, and monitoring for failures.

Supervisor : Each worker node runs a daemon called the
"Supervisor". The supervisor listens for work assigned to its
machine and starts and stops worker processes as necessary based on what
Nimbus has assigned to it. Each worker process executes a subset of a
topology; a running topology consists of many worker processes spread
across many machines.

Stream
: A stream is an unbounded
sequence of tuples. Storm processes transforming a stream into a new stream
in a distributed and reliable way. For example, transforming tweets stream into a stream of trending
topics.

Spout: It’s a source of streams. It reads tuple from
external source and emits those as stream in the topology.

Bolt
: It consumes input streams,
does some processing, and possibly emits new streams. Complex stream
transformations, like computing a stream of trending topics from a stream
of tweets, require multiple steps and thus multiple bolts. Bolts can do
anything from run functions, filter tuples, do streaming aggregations, do
streaming joins, talk to databases etc.

Zookeeper cluster works as coordinator
between Nimbus and supervisors. Nimbus and supervisor are stateless and fail-fast.
All states are kept in Zookeeper or local disk.

Why should anyone use Storm:

Now we have the clear idea of storm
in real-time processing. As Hadoop Map-reduce eases the batch processing, in
the same way Storm eases the parallel real-time computation.

Following are some key point, that
shows the importance of Storm.

Scalable : It scales massive numbers
of messages per second.To
scale a topology, all you have to do is add machines and increase the
parallelism settings of the topology. As an example, one of Storm's initial
applications processed 1,000,000 messages per second on a 10 node cluster,
including hundreds of database calls per second as part of the topology.
Storm's usage of Zookeeper for cluster coordination makes it scale to much
larger cluster sizes.

Guarantees no loss of data : Storm
guarantees that every message will be processed.

Robust: Goal of Storm is to make
user painless for Storm cluster management unlike hadoop clusters.