To run local and remote computation clusters, streamparse relies upon a JVM
technology called Apache Storm. The integration with this technology is
lightweight, and for the most part, you don’t need to think about it.

However, to get the library running, you’ll need

JDK 7+, which you can install with apt-get, homebrew, or an installler;
and

In the count_bolt bolt, we’ve told Storm that we’d like the stream of
input tuples to be grouped by the named field word. Storm offers
comprehensive options for
stream groupings,
but you will most commonly use a shuffle or fields grouping:

Shuffle grouping: Tuples are randomly distributed across the bolt’s tasks
in a way such that each bolt is guaranteed to get an equal number of tuples.
This is the default grouping if no other is specified.

Fields grouping: The stream is partitioned by the fields specified in the
grouping. For example, if the stream is grouped by the “user-id” field,
tuples with the same “user-id” will always go to the same task, but tuples
with different “user-id”’s may go to different tasks.

There are more options to configure with spouts and bolts, we’d encourage you
to refer to our Topology DSL docs or
Storm’s Concepts for
more information.

The general flow for creating new spouts and bolts using streamparse is to add
them to your src folder and update the corresponding topology definition.

Let’s create a spout that emits sentences until the end of time:

importitertoolsfromstreamparse.spoutimportSpoutclassSentenceSpout(Spout):outputs=['sentence']definitialize(self,stormconf,context):self.sentences=["She advised him to take a long holiday, so he immediately quit work and took a trip around the world","I was very glad to get a present from her","He will be here in half an hour","She saw him eating a sandwich",]self.sentences=itertools.cycle(self.sentences)defnext_tuple(self):sentence=next(self.sentences)self.emit([sentence])defack(self,tup_id):pass# if a tuple is processed properly, do nothingdeffail(self,tup_id):pass# if a tuple fails to process, do nothing

The magic in the code above happens in the initialize() and
next_tuple() functions. Once the spout enters the main run loop,
streamparse will call your spout’s initialize() method.
After initialization is complete, streamparse will continually call the spout’s
next_tuple() method where you’re expected to emit tuples that match
whatever you’ve defined in your topology definition.

Now let’s create a bolt that takes in sentences, and spits out words:

importrefromstreamparse.boltimportBoltclassSentenceSplitterBolt(Bolt):outputs=['word']defprocess(self,tup):sentence=tup.values[0]# extract the sentencesentence=re.sub(r"[,.;!\?]","",sentence)# get rid of punctuationwords=[[word.strip()]forwordinsentence.split(" ")ifword.strip()]ifnotwords:# no words to process in the sentence, fail the tupleself.fail(tup)returnforwordinwords:self.emit([word])# tuple acknowledgement is handled automatically

The bolt implementation is even simpler. We simply override the default
process() method which streamparse calls when a tuple has been emitted by
an incoming spout or bolt. You are welcome to do whatever processing you would
like in this method and can further emit tuples or not depending on the purpose
of your bolt.

If your process() method completes without raising an Exception, streamparse
will automatically ensure any emits you have are anchored to the current tuple
being processed and acknowledged after process() completes.

If an Exception is raised while process() is called, streamparse
automatically fails the current tuple prior to killing the Python process.

In the example above, we added the ability to fail a sentence tuple if it did
not provide any words. What happens when we fail a tuple? Storm will send a
“fail” message back to the spout where the tuple originated from (in this case
SentenceSpout) and streamparse calls the spout’s
fail() method. It’s then up to your spout
implementation to decide what to do. A spout could retry a failed tuple, send
an error message, or kill the topology. See Dealing With Errors for
more discussion.

You can disable the automatic acknowleding, anchoring or failing of tuples by
adding class variables set to false for: auto_ack, auto_anchor or
auto_fail. All three options are documented in
streamparse.bolt.Bolt.

Ticks tuples are built into Storm to provide some simple forms of
cron-like behaviour without actually having to use cron. You can
receive and react to tick tuples as timer events with your python
bolts using streamparse too.

The first step is to override process_tick() in your custom
Bolt class. Once this is overridden, you can set the storm option
topology.tick.tuple.freq.secs=<frequency> to cause a tick tuple
to be emitted every <frequency> seconds.

You can see the full docs for process_tick() in
streamparse.bolt.Bolt.

Example:

fromstreamparse.boltimportBoltclassMyBolt(Bolt):defprocess_tick(self,freq):# An action we want to perform at some regular interval...self.flush_old_state()

Then, for example, to cause process_tick() to be called every
2 seconds on all of your bolts that override it, you can launch
your topology under sparserun by setting the appropriate -o
option and value as in the following example:

We’ve now defined a prod environment that will use the user storm when
deploying topologies. Before submitting the topology though, streamparse will
automatically take care of instaling all the dependencies your topology
requires. It does this by sshing into everyone of the nodes in the workers
config variable and building a virtualenv using the the project’s local
virtualenvs/<topology_name>.txt requirements file.

This implies a few requirements about the user you specify per environment:

Must have ssh access to all servers in your Storm cluster

Must have write access to the virtualenv_root on all servers in your
Storm cluster

streamparse also assumes that virtualenv is installed on all Storm servers.

Once an environment is configured, we could deploy our wordcount topology like
so:

>sparsesubmit

Seeing as we have only one topology and environment, we don’t need to specify
these explicitly. streamparse will now:

If you do not have ssh access to all of the servers in your Storm cluster, but
you know they have all of the requirements for your Python code installed, you
can set "use_virtualenv" to false in config.json.

If you have virtualenvs on your machines that you would like streamparse to
use, but not update or manage, you can set "install_virtualenv" to false
in config.json.

If you would like to pass command-line flags to virtualenv, you can set
"virtualenv_flags" in config.json, for example:

"virtualenv_flags":"-p /path/to/python"

Note that this only applies when the virtualenv is created, not when an
existing virtualenv is used.

If you would like to share a single virtualenv across topologies, you can set
"virtualenv_name" in config.json which overrides the default behaviour
of using the topology name for virtualenv. Updates to a shared virtualenv should
be done after shutting down topologies, as code changes in running topologies
may cause errors.

If you wish to use streamparse with unofficial versions of storm (such as the HDP Storm)
you should set :repositories in your project.clj to point to the Maven repository
containing the JAR you want to use, and set the version in :dependencies to match
the desired version of Storm.

For example, to use the version supplied by HDP, you would set :repositories to:

The Storm supervisor needs to have access to the log.path directory for
logging to work (in the example above, /var/log/storm/streamparse). If you
have properly configured the log.path option in your config, streamparse
will use the value for the log.file option to set up log files for each
Storm worker in this path. The filename can be customized further by using
certain named placeholders. The default filename is set to:

pystorm_{topology_name}_{component_name}_{task_id}_{pid}.log

Where:

topology_name: is the topology.name variable set in Storm

component_name: is the name of the currently executing component as defined in your topology definition file (.clj file)

task_id: is the task ID running this component in the topology

pid: is the process ID of the Python process

streamparse uses Python’s logging.handlers.RotatingFileHandler and by
default will only save 10 1 MB log files (10 MB in total), but this can be
tuned with the log.max_bytes and log.backup_count variables.

The default logging level is set to INFO, but if you can tune this with the
log.level setting which can be one of critical, error, warning, info or
debug. Note that if you perform sparserun or sparsesubmit with
the --debug set, this will override your log.level setting and set the
log level to debug.

When running your topology locally via sparserun, your log path will be
automatically set to /path/to/your/streamparse/project/logs.