Happy New Year everyone! We’re excited to announce that we’ve added 8 new
committers to Druid. These committers have been making sustained contributions
to the project, and we look forward to working with them to continue to develop
the project in 2016.

We are excited to announce that we have formalized the governance of Druid to
be a community led project! Druid has been informally community led for some
time, with committers from various organizations regularly adding new features,
improving performance, and making things easier to use. Project committers vote
on proposals, review/write pull requests, provide community support, and help
guide the technical direction of the project. You can find more information on
the project’s goals and governance on our recently updated Druid webpage. Druid depends upon its vibrant community of users
for their feedback with respect to features, documentation and very helpful bug
reports.

We are very happy to announce that Druid has changed its license to Apache 2.0.
We believe this is a change the community will welcome. As engineers, we love
to see the things we make get used and attempt to provide value to the broader
open source world that we have benefitted from for so long. By switching to the
Apache license, we believe this change will better promote the growth of the Druid
community. We hope to send a clear message that we
are all equal participants in the Druid community, a sentiment that is very
important to us.

Everyone wants a great logo, but it’s notoriously difficult work—prone to
miscommunications, heated debates and countless revisions. Still, after three
years we couldn’t put it off any longer. Druid needed a visual identity, so we
partnered with the talented folks at Focus Lab for
help.

In February we were honored to speak at the O’Reilly Strata conference about
building a robust, flexible, and completely open source data analytics stack.
If you couldn’t make it, you can watch the video
here. Preparing for our talk got
us thinking about all the brilliant folks working on similar problems, so we
organized a panel that same night to continue the conversation.

We've already written about pairing R with RDruid, but Python has powerful and free open-source analysis tools too. Collectively, these are often referred to as the SciPy Stack. To pair SciPy's analytic power with the advantages of querying time-series data in Druid, we created the pydruid connector. This allows Python users to query Druid—and export the results to useful formats—in a way that makes sense to them.

We often get asked how fast Druid is. Despite having published some benchmark
numbers in previous blog posts, as well as in our talks,
until now, we have not actually published any data to back those claims up in in a
reproducible way. This post intends to address this and make it easier for
anyone to evaluate Druid and compare it to other systems out there.

Sensors are everywhere these days, and that means sensor data is big data. Ingesting and analyzing sensor data at speed is an interesting problem, especially when scale is desired. In this post, we'll access some real-world sensor data, and show how Druid can be used to store that data and make it available for immediate querying.

At Metamarkets, we specialize in converting mountains of programmatic ad data
into real-time, explorable views. Because these datasets are so large and
complex, we’re always looking for ways to maximize the speed and efficiency of
how we deliver them to our clients. In this post, we’re going to continue our
discussion of some of the techniques we use to calculate critical metrics such
as unique users and device IDs with maximum performance and accuracy.

What if you could combine a statistical analysis language with the power of an analytics database for instant insights into realtime data? You'd be able to draw conclusions from analyzing data streams at the speed of now. That's what combining the prowess of a Druid database with the power of R can do.

Before we start querying druid, we're going to finish setting up a complete cluster on localhost. In our previous posts, we setup a Realtime node. In this tutorial we will also setup the other Druid node types: Compute, Master and Broker.

We recently attended Stanford XLDB and the experience was a blast. Once a year, XLDB invites speakers from different organizations to discuss the challenges of and solutions to dealing with Xtreme (with an X!) data sets. This year, Jeff Dean dropped knowledge bombs about architecting scalable systems, Michael Stonebraker provided inspiring advice about growing open source projects, CERN explained how they found the Higgs Boson, and several organizations spoke about their technology. We definitely recommend checking out the slides from the conference.

Without Whirr, to launch a Druid cluster, you'd have to provision machines yourself, and then install each node type manually. This process is outlined here. With Whirr, you can boot a druid cluster by editing a simple configuration file and then issuing a single command!

I’d like to acknowledge Xavier Léauté for his extensive contributions (in
particular, for suggesting several algorithmic improvements and work on
implementation), helpful comments, and fruitful discussions. Featured image
courtesy of CERN.

In our last post, we got a realtime node working with example Twitter data. Now it's time to load our own data to see how Druid performs. Druid can ingest data in three ways: via Kafka and a realtime node, via the indexing service, and via the Hadoop batch loader. Data is ingested in realtime using a Firehose. In this post we'll outline how to ingest data from Kafka in realtime using the Kafka Firehose.

Druid is a rockin' exploratory analytical data store capable of offering interactive query of big data in realtime - as data is ingested. Druid drives 10's of billions of events per day for the Metamarkets platform, and Metamarkets is committed to building Druid in open source.

Danny Yuan, Cloud System Architect at Netflix, and I recently co-presented at
the Strata Conference in Santa Clara. The
presentation discussed how Netflix
engineers leverage Druid,
Metamarkets’ open-source, distributed, real-time, analytical data store, to
ingest 150,000 events per second (billions per day), equating to about 500MB/s
of data at peak (terabytes per hour) while still maintaining real-time,
exploratory querying capabilities. Before and after the presentation, we had
some interesting chats with conference attendees. One common theme from those
discussions was curiosity around the definition of “real-time” in the real
world and how Netflix could possibly achieve it at those volumes. This post is
a summary of the learnings from those conversations and a response to some of
those questions.

Big Data reflects today’s world where data generating events are measured in
the billions and business decisions based on insight derived from this data is
measured in seconds. There are few tools that provide deep insight into both
live and stationary data as business events are occurring; Druid was designed
specifically to serve this purpose.

In April 2011,
we introduced Druid, our distributed, real-time data store. Today I am
extremely proud to announce that we are releasing the Druid data store to the
community as an open source project. To mark this special occasion, I wanted to
recap why we built Druid, and why we believe there is broader utility for Druid
beyond Metamarkets' analytical SaaS offering.

The Metamarkets solution allows for arbitrary exploration of massive data sets. Powered by Druid, our in-house distributed data store and processor, users can filter time series and top list queries based on Boolean expressions of dimension values. Given that some of our dataset dimensions contain millions of unique values, the subset of things that may match a particular filter expression may be quite large. To design for these challenges, we needed a fast and accurate (not a fast and approximate) solution, and we once again found ourselves buried under a stack of papers, looking for an answer.

The nascent era of big data brings new challenges, which in turn require new
tools and algorithms. At Metamarkets, one such challenge focuses on cardinality
estimation: efficiently determining the number of distinct elements within a
dimension of a large-scale data set. Cardinality estimations have a wide range
of applications from monitoring network traffic to data mining. If leveraged
correctly, these algorithms can also be used to provide insights into user
engagement and growth, via metrics such as “daily active users.”

In a previous blog
post we introduced the
distributed indexing and query processing infrastructure we call Druid. In that
post, we characterized the performance and scaling challenges that motivated us
to build this system in the first place. Here, we discuss three design
principles underpinning its architecture.

Here at Metamarkets we have developed a web-based analytics console that
supports drill-downs and roll-ups of high dimensional data sets – comprising
billions of events – in real-time. This is the first of two blog posts
introducing Druid, the data store that powers our console. Over the last twelve
months, we tried and failed to achieve scale and speed with relational databases
(Greenplum, InfoBright, MySQL) and NoSQL offerings (HBase). So instead we did
something crazy: we rolled our own database. Druid is the distributed, in-memory
OLAP data store that resulted.