Introduction to Stroom Data

tl;dr Stroom Data is a simple and approachable framework for processing streams of data.

As an engineer at a startup, you are running as fast as you can to deliver the features needed to stay alive. You will knowingly leave a trail of technical debt behind you as well as solutions you know won’t scale if the startup succeeds.

A common shortcut engineers will take is to initially throw all data into their database of choice, be it MySQL, PostgreSQL, MongoDB or some other flavor of a SQL or NoSQL data store. This is a sound decision to keep moving forward; solve the problem for the scale at hand, but don’t spend too much time over-engineering for what you think might be needed in the future.

At DoubleDutch, we had been storing all client metrics data in a single table in a relational database. This worked really well for several years… until suddenly it didn’t anymore. By 2014 we were collecting 5-10gb of data daily. Our mountain of metrics data was slowly reaching a size that made live queries problematic. Even exploratory queries by our data analysts were so slow that the process was a hindrance to their work. It was finally time to return to this part of our code base and decide on a solution that would let us scale beyond our present needs.

At that particular point in time, our primary solution to deal with the amount of data we were receiving was to set up a pipeline that would push our metrics onto a Kafka queue, process them using Samza, and deliver the end results as aggregates into a PostgreSQL database. The solution worked well for most of our needs—well enough, in fact, that two years later we are building the second generation of this system in almost the same way, but feeding into an Elastic Search cluster—as long as those needs aligned with the exact aggregates for which the system had been designed. When our needs didn’t align with the initial design of this pipeline, the engineering effort required to change the pipeline became a real obstacle for exploratory projects.

While our product is very mature and full-featured at this point, we are still a startup and we are still exploring new features and projects at a very rapid pace. In order to support the data needs of these projects, we wrote a small, simple system that would store streams of json documents and maintain up-to-date aggregates over them. You define your aggregate processes through very simple map/reduce functions written in Javascript. The result was that small teams of engineers were able to solve their needs for live views into our mountain of metrics data with very little effort—typically through writing just 5-10 lines of simple Javascript code.

As a sample of what this actually looked like you would post new metrics as JSON documents to /stream/:topic and set up an aggregate job in the form of a single Javascript file. The following code shows a simple map/reduce job that would keep a running aggregate of views by eventId.

The live aggregate accumulator would be exposed over http as /aggregates/:name and could then be used by web-based dashboards. This was a very simple solution that would not scale as well as the same job written for hadoop, but which enabled developers to create live aggregates with very little code and no configuration. This system has been in production at DoubleDutch for almost two years at this point.

Today we are announcing the first open source release of the second generation of this data processing framework. We are calling this system Stroom Data—stroom means stream in Dutch. The following video gives a quick demonstration of how to run Stroom, load a data set into it, and set up a map/reduce job.

Stroom is first and foremost a specification for a set of simple data processing components. However, along with the specification, we are delivering a reference implementation that we are currently using in production at DoubleDutch. The intention is that each of the components of Stroom are simple enough that they can easily be implemented in a way that fits with your technology stack and your deployment strategy.

Stroom consists of components that all work together on top of a simple http-based stream interface. They are designed to function as basic building blocks that can be used together to solve a larger problem, in much the same way command line tools do in UNIX. When possible, they embrace the principle of CQRS to separate the processing of data from the querying of data.

Dig Deeper and Keep in Touch!

Both the specification and reference implementation of Stroom Data are released as an open source project under the Apache license. You can find the project on github here: https://github.com/DoubleDutch/StroomData