What’s Next for Impala After Release 1.1

In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.

But first, a thank-you: Since the initial beta release, we’ve received a tremendous amount of feedback and validation about Impala — copious in its quality as well as quantity. At least one person in approximately 4,500 unique organizations around the world have downloaded the Impala binary, to date. And even after only a few months of GA, we’ve seen Cloudera Enterprise customers from multiple industries deploy Impala 1.x in business-critical environments with support via a Cloudera RTQ (Real-Time Query) subscription — including leading organizations in insurance, banking, retail, healthcare, gaming, government, telecom, and advertising.

Furthermore, based on the reaction from other vendors in the data management space, few observers would dispute the notion that Impala has made low-latency, interactive SQL queries for Hadoop as important a customer requirement as the high-latency, batch-oriented SQL queries enabled by Apache Hive. That’s a great development for Hadoop users everywhere!

What Was Delivered in Impala 1.0/1.1

Let’s begin with a report card on the previously published Impala 1.0/1.1 roadmap. Here’s the feature list, grouped by delivery status:

Furthermore, thanks to the addition of the Apache Sentry module (incubating), Impala 1.1 and later now also provide granular, role-based authorization, ensuring that the right users and applications have access to the right data. (With the recent contribution of Sentry to the Apache Incubator and of HiveServer2 to Hive by Cloudera, Hive 0.11 and later have that functionality, as well.)

A lot of work was done, but there is still plenty of work to do. Now, on to the Impala 2.0 wave.

Near-Term Roadmap

The following new Impala functionality will be released incrementally across near-term future releases, starting with Impala 1.2 in late 2013 and ending with Impala 2.0 in the first third of 2014. In addition, you’ll see more performance gains and SQL functionality enhancements in each release – with the goal of expanding Impala’s performance lead over the alternative SQL-on-Hadoop approaches of legacy relational database vendors as well as Hadoop distro vendors.

Please note, as is always the case with roadmaps, that timelines and features are always subject to change. What you see below captures our current plan-of-record, however.

Impala 1.2

UDFs and extensibility – enables users to add their own custom functionality; Impala will support existing Hive Java UDFs as well as high-performance native UDFs and UDAFs

Automatic metadata refresh – enables new tables and data to seamlessly be available for Impala queries as they are added without having to issue a manual refresh on on each Impala node

Apache HBase CRUD – allows use of Impala for inserts and updates into HBase

External joins using disk – enables joins between tables to spill to disk for joins that require join tables larger than the aggregate memory size

Subqueries inside WHERE clauses

As we learn more about customer and partner requirements, this list will expand.

Conclusion

As you can see, Impala has evolved considerably since its beta release, and it will continue to evolve as we gather more feedback from users, customers, and partners.

Ultimately, we believe that Impala has already enabled our overall goal of allowing users to store all their data in native Hadoop file formats, and simultaneously run all batch, machine learning, interactive SQL/BI, math, search, and other workloads on that data in place. From here, it’s just a matter of continuing to build upon that very solid foundation with richer functionality and improved performance.

2 responses on “What’s Next for Impala After Release 1.1”

Is Impala right fit for creating data cubes at this point of time? If not, how long we need to wait for the same. I am looking at option of HBase or Impala to serve as my data cube solution while migrating current BIDW to Hadoop. Please suggest.

To me, “data cube” implies a pre-defined schema. This is not necessary with Impala/Hadoop, which takes a “schema-on-read” approach (i.e. defined at query time) — you don’t need to worry about creating/updating data structures at all.