Blog

Databases exist to store data and answer questions about that data. The way we ask questions has to change based on the database itself – SQL changes between different RDBMS vendors. Things get even stranger when you venture outside the RDBMS world and start storing your data in Hive. Even though Hive Query Language (HiveQL) looks like SQL, there are some subtle but important differences to querying that first timers will want to keep in mind.

It All Starts With A Question

Everything starts with a business requirement. In this case the requirement was simple: “We want to see a rolling average of price ranges over the last 6 months.”

Depending on which version of SQL Server you’re using, this is a pretty easy query to build; it can be accomplished using a CTE or a sliding window function. Since a CTE is the simpler example, we’ll start using it for our sample query.

As a user of the database, we can safely assume that the database server is going to compute an average for the six months prior to x.date. SQL Server’s optimizer is going to do a lot of work behind the scenes to make sure this query gets executed efficiently. The upside of using a database with a complex query optimizer is that the optimizer does the hard work for us – it determines the best way to deliver the data and we only have to worry about the questions we want answered. Things don’t work quite the same way with Hive.

Rolling Averages with Big Data

Hive can be queried using the Hive Query Language (HiveQL). While based on the SQL-92 standard, HiveQL doesn’t conform to the standard; even where Hive supports SQL functionality, there are gotchas to keep in mind. The first of these is language support: Hive doesn’t support CTEs. Admittedly, CTEs aren’t a part of SQL-92 but developers will find many language features missing or slightly different. Back to the topic at hand: it’s possible to write this query without using a CTE. The first step is transforming the body of the CTE into a sub-select and joining to the derived table:

Unfortunately, this won’t fly in Hive – only equality conditions are supported on a join. There’s a valid reason for this – it’s very difficult to translate this type of join into a map/reduce job behind the scenes in Hive. Hive 0.10, just released last week, has support for CROSS JOIN, but this version of Hive isn’t a viable option since it is still under heavy development.

Although not quite as simple, it’s still possible to make this query work so we can produce rolling averages:

Tuning the Rolling Average

This query isn’t going to win any awards for performance. In fact, that rolling average query is a good candidate to win an award for terrible performance. Although Hive will perform some optimizations to the query we provided, the re-write process is nothing like what happens with full fledged cost-based optimizer. This query will produce a large amount of intermediate data before results are delivered and moving that data around the cluster takes a lot of time. What if there were a way to make this execute faster?

Let’s re-write this query so instead of computing a rolling average, we only compute a six month average based on today’s date. The first step is to grab just the data for today. Experienced SQL Server developers would say “Ah ha, I can do this using some kind of date math function in conjunction with CAST and GETDATE().” Hive has its own function names that accomplish the same things: date_sub, cast, and unix_timestamp.

What we’re trying to accomplish here is tell Hive, “Hey, I want to see all rows where date is greater than the current time (in seconds since 1970), converted to a TIMESTAMP, minus 180 days.” This doesn’t work, but it’s not by lack of syntactical correctness. There’s a bug in how Hive handles converting Unix time to a TIMESTAMP (HIVE-3454). The documenation and spec says that during a CAST, Hive should treat an INTEGER as seconds since epoch and other numeric types as milliseconds since epoch.

What HIVE-3454 means is that our query will give us an average of data since some time in early January, 1970. While interesting, that’s not what we’re looking for. To make sure that we get the right time, the best approach is to perform date comparison using seconds instead of time functions. This next query almost works (it finds anything less than 180 days from right now):

Since we really want to get an average based on the previous 180 days, we tell Hive to round to the current time to the nearest day from_unixtime(unix_timestamp(),'yyyy-MM-dd'), subtract 180 days datesub(value, 180) and then treat that as a temporal value unix_timestamp(value, 'yyyy-MM-dd'):

SELECT *
FROM table t
WHERE created > unix_timestamp(date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),180),'yyyy-MM-dd') ;

What’s The Best Solution?

Through developing these queries, I found that the best approach to computing a rolling average in Hive is to approach the problem in chunks. It’s necessary to be more explicit in the queries presented to Hive. While Hive doesn’t have a sophisticated query optimizer, Hive does give developers the ability to run queries across many servers and perform a massive amount of data processing very quickly. Although we have a lot of power at our disposal, we have to make a bit of extra effort when developing queries to ensure that they perform well.