Hive + Splout SQL for a social media reporting webapp: A Big Data love story

(This is the second post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).

In this post we’ll present an example Big Data use case for analyzing tweets and reporting consolidated, meaningful statistics to Twitter users through an interactive, low-latency webapp. For that we will marry Hive(an open-source warehousing solution for Hadoop that enables easy analysis and summarization over Hadoop) with Splout SQL, a highly-performant, low-latency partitioned SQL for Hadoop. We will build a very simple – yet scalable – analysis tool such as the Tweet archivist, and we will do it without even coding. The tool will provide historical mentions, summarized hashtags and popular tweets for every actor in the input dataset.

Twitter data

Twitter data is nowadays a representative, trendy example of Big Data. There are various companies analyzing it and making sense out of it: from social reputation sites like Klout or PeerIndex to social analytics sites like Topsy or Sprinkl. A few of them like Gnip or DataSift have access to the full firehouse and resell the raw tweet stream.

Building tools that are able to analyze such an enormous dataset is a challenge. In this post we will use a scalable processing tool (Hive) and a scalable, low-latency serving database (Splout SQL), and we will see how well they integrate with each other.

Requirements

In order to follow the steps in this post you should:

Have Hadoop and Hive installed in your computer. We tested this with Hadoop CDH3 and Hive 0.10.0.

Have “hive-site.xml” properly configured and Hive’s conf/ in CLASSPATH / HADOOP_CLASSPATH. Note that a “hive-default.xml” is deprecated and could not work.

Have Splout SQL in your computer. For using Splout SQL you just need to download its latest distribution, unzip it and start both QNode and DNode daemons. You can find more information in the “Getting started” section of the official webpage.

Preparing the input data (tweets)

We can download some example tweets from Twitter’s REST API using “curl“. We will use some popular accounts such as “BBCNews” or “Reuters”. We also need to split each tweet in a new line for Hive to process each of the tweets independently. For that we will use “sed“.

JSON + Hive

There are various ways of using JSON data in Hive. The one we use here is the most flexible one: we just create the table with one column (jsonData) and navigate the JSON as needed depending on the query we want to execute. For doing that we use the native Hive function json_tuple. First of all, we add the needed contrib JAR to the classpath (substitute %HIVE_HOME% and %HIVE_VERSION% as needed):

Creating stats views with Hive

We will create three views: the mentions view, the retweets view and the hashtags view. In the mentions view we will populate every interaction between two Twitter users based on mentions in tweets. In the retweets view we will populate all tweets that have a “retweet_count” > 0, and in the hashtags view we will populate the number of times hashtags have been used by each user. Creating all these views is as easy as executing a Hive query for each of them, as we will see.

Which simply produces a view with the user, the hashtag used and the number of times it was used ever:

User

Word

Count

BarackObama

#WeDemandAVote

29

BarackObama

#JobsNotCuts

28

Deploying the views to Splout SQL

It is useless to have all this valuable information without being able to query them in sub-second time. We want to be able to provide a per-user panel where each user will navigate through historical information, and we want to do it both in a scalable and flexible way. Rather than precomputing the panel we will just deploy the Hive tables to Splout SQL: a scalable, low-latency SQL database which will feed the agile frontend.

In order to deploy the generated views in Hive we will create a deployment descriptor JSON called “hive_splout_example.json“. We can create this file in the local filesystem, in the Splout SQL installation directory:

In this JSON file we describe our Splout SQL tablespace, which is made by four tables. Note how we partition always by the user. Note how we import the same Hive table twice (mentions): this is because we are interested both in the mentions of a certain user and the mentioning actions that this user does to others. Because we are partitioning, for being able to query both datasets for the same user, we need to import the mentions table twice, partitioned by either mentioner or mentioned.

For deploying the tablespace to Splout SQL we can use the following command line, executed from the Splout SQL installation directory:

Querying / visualizing the results

Finally, we can go to Splout SQL’s administration panel to see that the tablespace has been deployed correctly, and perform some test queries. As an example, we can obtain the hashtag fingerprint for the user “BarackObama” as the picture illustrates:

But we could also do other complex things like displaying a timeline of mentions (as we have the date in every mention), aggregates by period (month, week, year, …) and so on.

Conclusion

We managed to build a completely scalable Twitter summary tool and we did so without writing a single line of code! This was possible because of the integration between Hive and Splout SQL. Hive provides an easy, interactive interface to Hadoop which allows us to seamlessly populate “Big Data views”. On the other hand, Splout SQL can import those views from Hive and index them so that users can query them in sub-second latencies and under high load. Having SQL both in the backend and in the frontend allows us to be very flexible and agile. Don’t hesitate in trying this example and giving us feedback!

What version of Hadoop are you using? There is some incompatibility between Driver classes in Hadoop 1.0 or 2.0. Can you try with latest Splout release 0.2.3: http://search.maven.org/#browse%7C-1223220190 ? We solved that issue there. Use MR1 for Hadoop 1.0 (CDH3, 0.20.X, etc) or 2.0 for Hadoop 2.0.X, CDH4, etc.

I am using Splout release 0.2.3. But I still have this error
Exception in thread “main” java.io.IOException: Error opening job jar: splout-distribution-0.2.3-hadoop.jar
at org.apache.hadoop.util.RunJar.main(RunJar.java:135)
Caused by: java.io.FileNotFoundException: splout-distribution-0.2.3-hadoop.jar (No such file or directory)
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(Unknown Source)
at java.util.zip.ZipFile.(Unknown Source)
at java.util.jar.JarFile.(Unknown Source)
at java.util.jar.JarFile.(Unknown Source)
at org.apache.hadoop.util.RunJar.main(RunJar.java:133)

How did you end up with a -distribution- JAR ? The distribution is a .tar.gz that needs to be decompressed first. After decompressing you will have everything including the runnable JAR.
You can follow the getting started: http://sploutsql.com/gettingstarted.html

I think you are not in the right folder. You must be in the folder where Splout SQL is installed. In this folder, there must be file with the following name: splout-hadoop-0.2.3-SNAPSHOT-hadoop-mr1.jar or splout-hadoop-0.2.3-SNAPSHOT-hadoop-mr1.jar (depending on the version you are using).