RDruid and Twitterstream

by Igal Levy · February 3, 2014

What if you could combine a statistical analysis language with the power of an analytics database for instant insights into realtime data? You'd be able to draw conclusions from analyzing data streams at the speed of now. That's what combining the prowess of a Druid database with the power of R can do.

In this blog, we'll look at how to bring streamed realtime data into R using nothing more than a laptop, an Internet connection, and open-source applications. And we'll do it with only one Druid node.

What You'll Need

Get the R application for your platform.
We also recommend using RStudio as the R IDE, which is what we used to run R.

You'll also need a free Twitter account to be able to get a sample of streamed Twitter data.

Set Up the Twitterstream

First, register with the Twitter API. Log in at the Twitter developer's site (you can use your normal Twitter credentials) and fill out the form for creating an application; use any website and callback URL to complete the form.

Make note of the API credentials that are then generated. Later you'll need to enter them when prompted by the Twitter-example startup script, or save them in a twitter4j.properties file (nicer if you ever restart the server). If using a properties file, save it under $DRUID_HOME/examples/twitter. The file should contains the following (using your real keys):

Start Up the Realtime Node

From the Druid home directory, start the Druid Realtime node:

$DRUID_HOME/run_example_server.sh

When prompted, you'll choose the "twitter" example. If you're using the properties file, the server should start right up. Otherwise, you'll have to answer the prompts with the credentials you obtained from Twitter.

After the Realtime node starts successfully, you should see "Connected_to_Twitter" printed, as well as messages similar to the following:

Querying the Realtime Node

Druid queries are in the format of JSON objects, but in R they'll have a different format. Let's look at this with a simple query that will give the time range of the Twitter data currently in our Druid node:

At the very end comes the response to our query, a minTime and maxTime, the boundaries to our data set.

More Complex Queries

Now lets look at some real Twitter data. Say we are interested in the number of tweets per language during that time period. We need to do an aggregation via a groupBy query (see RDruid help in RStudio):

granularity – This sets the time period for each aggregation (in ISO 8601). Since all our data is in one day and we don't care about breaking down by hour or minute, we choose per-day granularity.

aggregations – This is where we specify and name the metrics that we're interesting in summing up. We wants tweets, and it just so happens that this metric is named "tweets" as it's mapped from the twitter API, so we'll keep that name as the column head for our output.

dimension – Here's the actual type of data we're interesting in. Tweets are identified by language in their metadata (using ISO 639 language codes). We use the name of the dimension, "lang," to slice the data along language.

This gives an idea of what languages dominate Twitter (at least for the given time range). For visualization, you can use a library like ggplot2. Try the geom_bar function to quickly produce a basic bar chart of the data. First, send the query above to a dataframe (let's call it tweet_langs in this example), then subset it to take languages with more than a thousand tweets:

Metrics and Dimensions

How do you find out what metrics and dimensions are available to query? You can find the metrics in $DRUID_HOME/examples/twitter/twitter_realtime.spec. The dimensions are not as apparent. There's an easy way to query for them from a certain type of Druid node, but not from a Realtime node, which leaves the less-appetizing approach of digging through code. To allow for further experimentation, we list some here:

"first_hashtag"

"user_time_zone"

"user_location"

"is_retweet"

"is_viral"

Some interesting analyses on current events could be done using these dimensions and metrics. For example, you could filter on specific hashtags for events that happen to be spiking at the time:

The point to remember is that this data is being streamed into Druid and brought into R via RDruid in realtime. For example, with an R script the data could be continuously queried, updated, and analyzed.