Introduction to pydruid

by Igal Levy · April 15, 2014

We've already written about pairing R with RDruid, but Python has powerful and free open-source analysis tools too. Collectively, these are often referred to as the SciPy Stack. To pair SciPy's analytic power with the advantages of querying time-series data in Druid, we created the pydruid connector. This allows Python users to query Druid—and export the results to useful formats—in a way that makes sense to them.

Getting Started

pydruid should run with Python 2.x, and is known to run with Python 2.7.5.

Install pydruid in the same way as you'd install any other Python module on your system. The simplest way is:

pip install pydruid

You should also install Pandas to execute the simple examples below:

pip install pandas

When you import pydruid into your example, it will try to load Pandas as well.

NOTE: Due to limitations in the way the wikipedia example is set up, you may see a limited number of results appear.

Here's that same query in Python:

frompydruid.clientimport*query=PyDruid('http://localhost:8083','druid/v2/')top_langs=query.topn(datasource="wikipedia",granularity="all",intervals="2013-06-01T00:00/2020-01-01T00",dimension="language",filter=Dimension("namespace")=="article",aggregations={"edit_count":longsum("count")},metric="edit_count",threshold=4)printtop_langs# Do this if you want to see the raw JSON

Let's break this query down:

query – The query object is instantiated with the location of the Druid realtime node. query exposes various querying methods, including topn.

datasource – This identifies the datasource. If Druid were ingesting from more than one datasource, this ID would identify the one we want.

granularity – The rollup granularity, which could be set to a specific value such as minute or hour. We want to see the sum count across the entire interval, and so we choose all.

intervals – The interval of time we're interested in. The value given is extended beyond our actual endpoints to make sure we cover all of the data.

filter – Filters are used to specify a selector. In this case, we're selecting pages that have a namespace dimension with the value article (therefore excluding edits to Wikipedia pages that aren't articles).

aggregations – We're interested in obtaining the total count of edited pages, per the language dimension, and we map it to a type of aggregation available in pydruid (longsum). We also rename this count metric to edit_count.

Bringing the Data Into Pandas

Now that Druid is returning data, we'll pass that data to a Pandas dataframe, which allows us to analyze and visualize it:

frompydruid.clientimport*frompylabimportplt# Need to have matplotlib installedquery=PyDruid('http://localhost:8083','druid/v2/')top_langs=query.topn(datasource="wikipedia",granularity="all",intervals="2013-06-01T00:00/2020-01-01T00",dimension="language",filter=Dimension("namespace")=="article",aggregations={"edit_count":longsum("count")},metric="edit_count",threshold=4)printtop_langs# Do this if you want to see the raw JSONdf=query.export_pandas()# Client will import Pandas, no need to do so separately.df=df.drop('timestamp',axis=1)# Don't need the timestamp column heredf.index=range(1,len(df)+1)# Get a naturally numbered indexprintdfdf.plot(x='language',kind='bar')plt.show()

Printing the results gives:

edit_count language
1 834 en
2 256 de
3 185 fr
4 38 ja

The bar graph will look something like this:

If you were to repeat the query, you should see larger numbers under edit_count, since the Druid realtime node is continuing to ingest data from Wikipedia.

Conclusions

In this blog, we showed how you can run ad-hoc queries against a data set that is being streamed into Druid. And while this is only a small example of pydruid and the power of Python, it serves as an effective introductory demonstration of the benefits of pairing Druid's ability to make data available in real-time with SciPi's powerful analytics tools.