Project Panthera: Better Analytics with SQL, MapReduce, and HBase

This is a computer translation of the original content. It is provided for general information only and should not be relied upon as complete or accurate.

In the last several years, we have been working closely with our users and customers on their next-gen data analytics platforms using Hadoop and HBase. While the Hadoop stack has laid a solid foundation for these systems, we are still required to implement many new capabilities in building a flexible and efficient analytics platform; and “Project Panthera” is our open source efforts to enable these new analytics capabilities on Hadoop/HBase. Yesterday I was at Bay Area HUG (Hadoop User Group) Meetup, presenting our current work to over 300 Hadoopers; and I also had some very interesting discussions with the people there. Please see my talk abstract and slide deck below; and you may also want to check our github repository (https://github.com/intel-hadoop/project-panthera) for more details.

Abstract

Project Panthera is an open source effort that showcases better data analytics capabilities on Hadoop/HBase (e.g., better integration with existing infrastructure using SQL, better query processing on HBase, and efficiently utilizing new HW platform technologies). In this talk, we will discusses two new capabilities that we are currently working on under Project Panthera: (1) a SQL Engine for MapReduce (built on top of Hive) that supports common SQL constructs used in analytic queries, including some important features (e.g., sub-query in WHERE clauses, multiple-table SELECT statement, etc.) that are not supported in Hive today; (2) a Document-Oriented Store on HBase for better Hive/SQL query processing, which brings up-to 3x reduction in table storage and up-to 1.8x speedup in query processing.