IT Knowledge from developers for developers

I have been working with the SMACK stack for a while now and it is great fun from a developer’s point of view. Kafka is a very robust data buffer, Spark is great at streaming all that buffered data and Cassandra is really fast at writing and retrieving it. Unless, of course, your data analysts come up with new queries for which there are no optimized Cassandra tables. In this case, you have to make a choice. You could use Spark itself to do those computations dynamically, for example with a notebook tool like Zeppelin. Depending on your data, this might take a while. Or you can store the data suitably during digestion in Cassandra – taking the penalty of duplicated data. And with the next new query, the cycle starts again…

Druid promises Online Analytical Processing (OLAP) capability in fast data contexts with realtime stream ingestion and sub-second queries. In this post, I am going to introduce the basic concepts behind Druid and show the tools in action.

Druid Segments

The equivalent to a table in Druid is called “data source”. The data source defines how data is stored and sharded.

Druid relies on the time dimension for every query, so data sets require a timestamp column. Other than the primary timestamp, data sets can contain a set of dimensions and metrics (measures in OLAP). Let’s use an example. We’re tracking registrations for events – the time the registration occurred, the name of the event, the country the event takes place in and the number of guests registered in that reservation:

timestamp

event name

country

guests

2016-08-14T14:01:35Z

XP Cologne 2017

Germany

12

2016-08-14T14:01:45Z

XP Cologne 2017

Germany

2

2016-08-14T14:02:36Z

data2day 2016

Germany

1

2016-08-14T14:02:55Z

JavaOne

USA

9

“Event name” and “country” are dimensions – these are data fields that analysts might want to filter on. “guests” is a metric – we can do calculations on that field such as “how many guests have registered for the event “XP Cologne 2017”? “How many guests are attending conferences in Germany”?

Druid stores the data in immutable “segments”. A segment contains all data for a configured period of time. Per default, a segment contains data for a day, but this can be set higher or lower. So depending on the configuration, the data above could be in a single segment if the segment granularity is “day”, or in two different segments if the granularity is “minute”. Next to segment granularity, there is also “query granularity”. This defines how data is rolled up in Druid. With the above example, an analyst might not care about single registration events in a segment but about the aggregation. So going by the query granularity of minute, we would get the following result for the data above:

timestamp

event name

country

guests

2016-08-14T14:01:00Z

XP Cologne 2017

Germany

14

2016-08-14T14:02:00Z

data2day 2016

Germany

1

2016-08-14T14:02:00Z

JavaOne

USA

9

Data rollup can save you quite a bit in storage capacity, but of course you lose the ability to query individual events. Now that we know about basic data structures in Druid, let’s look at the components that make a Druid cluster.

Cluster components

Well, I never promised that this was going to be easy. Druid requires quite a few components and external dependencies that need to work together. Let’s step through them:

Historical Nodes

These are nodes that do three things really well: loading immutable data segments from deep storage, dropping those segments and serving queries about loaded segments. In a production environment, deep storage will typically be Amazon S3 or HDFS. A historical node only cares about its own loaded segments. It does not know anything about any other segment. How does it know what segments to load? For that, we have …

Coordinator nodes

Coordinator nodes are the housekeepers in a Druid cluster. They make sure that all serviceable segments are loaded by at least one Historical Node – depending on configured replication factors. It also makes sure Historicals drop no longer needed segments and rebalances segments for somewhat even distribution within the cluster. So that’s historical data covered. The communication between Historicals and Coordinators is happening indirectly using another external dependency: Apache Zookeeper which you may already know from Kafka or Mesos. You can run multiple Coordinators for high availability, a leader will be elected.

So what about the promised realtime ingestion?

Realtime Ingestion

While Druid can be used to ingest batches of data, realtime data ingestion is one of its strong points. At the moment, there are two options available for realtime ingestion. The first one are Realtime Nodes, but I will not go into detail about them because they seem to be on their way out. Superseding Realtime nodes is the Indexing Service. Now, the Indexing Service is a bit more involved in itself. It is centered around a notion of tasks. The components that accepts tasks from the outside is the “Overlord”. It then forwards those tasks to so-called “Middle Managers”. These nodes in turn spawn so-called “Peons” who run in a separate JVM and can only run one task at a time. A Middle Manager can run a specified number of Peons. If all Middle Managers are occupied and the Overlord is not able to assign a new task anywhere, there is some autoscaling capability built-in to create new Middle Managers in suitable environments (e.g. AWS). So for realtime ingestion, a task is created for every new segment that is to be created. Data becomes queryable immediately after ingestion. The segments are announced in the third external dependency – a *gasp* relational database. The database contains the metadata about the segment and some rules that decide when this data should be loaded or dropped by Historicals. Once a segment is complete, it is handed off to deep storage, metadata is written and the task is completed.

The Indexing Service API is a bit low level. To make working with it easier, the “Tranquility” project seems to have established itself as the entrypoint for realtime ingestion. Tranquility servers can provide HTTP endpoints for data ingestion or read data from Kafka.

OK then, we got the data ingested and store. How can you query it?

Broker Nodes

Broker Nodes are your gateway to the data stored in Druid. They know which segments are available on which nodes (from Zookeeper), query the responsible nodes and merge the results together. The primary query “language” is a REST API, but tool suites like the Imply Analytics Platform extend that support with tools like Pivot and PlyQL. We will see those later.

As you can hardly be expected to remember all this prose, here is a diagram that shows the components and their relations:

Example

In this example, we’ll set up a Druid cluster that ingests RSVP events from Meetup.com – these are published on a very accessible WebSocket API.

Instead of starting each Druid component on its own, we’ll take a shortcut to set up the Imply Analytics Platform mentioned above. That way we won’t gain too much insight about the inner workings, but we’ll get our hands on a working “cluster”. I put cluster in quotes because we will run all processes on a single machine. Deep storage will be simulated by the local file system and the RDBMS is an in-memory Derby.

The data

Let’s take a look at the data first. We receive events of the following type from Meetup:

Simple local setup

Before we can send this data to Druid, we need to start it up. The guys at Imply make this really easy for us. Following their quickstart guide for a local setup, we just need to execute these commands:

This spec basically tells Tranquility and Druid what fields in our dataset are event timestamps, dimensions and metrics. We also can aggregate latitude and longitude to a spatial dimension. In the “granularity” section, we specify that we want our segments to cover an hour and query granularity to be none – this preserves single records and is nice for our example, but probably not what you would do in production. After editing the file, we can start Druid with the steps shown above. Once running, we can ingest data by posting http requests to http://localhost:8200/v1/post/meetup. To check if the meetup ingestion is running, we can open the Overlord console:

Accessing the data

So now we’re ingesting the data, how can we get to it? I am going to show three different ways.

Druid queries

The basic way to get data out of Druid is to run a plain Druid query. This means posting the query in JSON format against the broker node. If for example we would like to get the count of all RSVPs from Germany or the US per hour in a specified period, we’d post the following data to http://localhost:8082/druid/v2/?pretty:

The queries that Druid performs best are time series and TopN queries. The documentation gives you an insight about what is possible.

Pivot

Imply provides “Pivot” at http://localhost:9095. Pivot is a GUI for analyzing a Druid datasource and is very accessible. You are greeted by something like this:

If we want to see the events with the biggest number of RSVPs in the last week including the split between “yes” and “no”, we certainly can do that:

We can also look at the raw data:

Playing around with Pivot to get a feeling for the tool and your data is certainly fun and works like this out of the box. Pivot is based on “Plywood” – a Javascript library as integration layer between Druid data and visualization frontends that is also part of Imply.

PlyQL

Another part of Imply is “PlyQL”. As you can imagine from the name, it aims to provide SQL-like access to the data. Regarding our Meetup platform, we start by looking at the set of tables that we can query:

We could also use the -o switch to get the output of our queries as JSON.

Summary

This concludes our quick walkthrough of Druid. We talked about the basic concepts of Druid, ran an ingestion of realtime Meetup data and looked at ways to access the data with plain Druid and the very interesting Imply Analytics Platform. Druid is a very promising piece of technology that warrants an evaluation if you’re trying to run OLAP queries on realtime fast data.

Florian Troßbach has his roots in classic Java Enterprise development. After a brief detour in the world of classic SAP, he joined codecentric as an IT Consultant and focusses on Fast Data and the SMACK stack.