Ibis on Impala: Python at Scale for Data Science

This new Cloudera Labs project promises to deliver the great Python user experience and ecosystem at Hadoop scale.

Across the user community, you will find general agreement that the Apache Hadoop stack has progressed dramatically in just the past few years. For example, Search and Impala have moved Hadoop beyond batch processing, while developers are seeing significant productivity gains and additional use cases by transitioning from MapReduce to Apache Spark.

Thanks to such advances in the ecosystem, Hadoop has evolved into a robust and powerful open source data analysis stack. A centerpiece of that stack is Impala, the MPP query engine that is still the only open source option for a truly interactive, BI-style experience (an analytic database, if you will) on Hadoop. For business analysts in particular, who are the rank-and-file of big data consumers, the Hadoop experience is becoming all but indistinguishable from that of traditional data infrastructure but with unprecedented scale, flexibility, and cost-effectiveness under the covers.

The Rise of Python for Data Science

While SQL and BI remain the core capability of most analytics environments, advanced statistical analysis—now often known as data science—is becoming an increasingly popular tool in the expanding data analysis toolbox. Among other things, data science is characterized by the use of more complex workflows than are generally supported by traditional SQL. And in the data science world, Python has emerged as the most popular choice for expressing such complex workflows—as well as for its value in programmatic data preparation (via the Python pandas framework)—because of its power, elegance, and robust libraries and third-party integrations.

While Python is a de-facto language for modern data engineering and data science, Python development has been confined to local data processing—thereby limiting its users to smaller data sets. Historically, to address bigger data workloads, Python developers have had to extract samples or aggregates, forcing compromises in data fidelity, adding ETL costs, and ultimately leading to a loss of productivity and addressable use cases.

To plug that gap, today we are excited to announce a new open source project, called Ibis, that will deliver the great Python experience and ecosystem, only at any data and node scale.

Ibis: Same Great Python Ecosystem at Hadoop Scale

Co-founded by the respective architects of the Python pandas toolkit and Impala and now incubating in Cloudera Labs, Ibis is a new data analysis framework with the goal of enabling advanced data analysis on a 100% Python stack with full-fidelity data. With Ibis, for the first time, developers and data scientists will be able to utilize the last 15 years of advances in high-performance Python tools and infrastructure in a Hadoop-scale environment—without compromising user experience for performance. It’s exactly the same Python you know and love, only at scale!

In this initial (unsupported) Cloudera Labs release, Ibis offers comprehensive support for the analytical capabilities presently provided by Impala, enabling Python users to run Big Data workloads in a manner similar to that of “small data” tools like pandas. Next, we’ll extend Impala and Ibis in several ways to make the Python ecosystem a seamless part of the stack:

First, Ibis will enable more natural data modeling by leveraging Impala’s upcoming support for nested types (expected by end of 2015).

Second, we’ll add support for Python user-defined logic so that Ibis will integrate with the existing Python data ecosystem—enabling custom Python functions at scale.

Finally, we’ll accelerate performance further through low-level integrations between Ibis and Impala with a new Python-friendly, in-memory columnar format and Python-to-LLVM code generation. These updates will accelerate Python to run at native hardware speed.

Meeting of Two Great Communities

Ibis is also a significant milestone because when fully realized, it will make Python a first-class citizen in the Hadoop ecosystem. We look forward to working together with the Python open source data community to build an active contributor and user ecosystem around Ibis. By bringing together these world-class development communities, we will accomplish together much more than could be done by each community alone.

Summary

In summary, Ibis has the goal of providing:

An uncompromising Python experience for 100% Python end-to-end workflows, with full access to the ecosystem of Python tools and development extensibility

Interactive experience on a scalable architecture for full-fidelity analysis of big data at native hardware speeds

The Cloudera Labs preview of Ibis is available for installation today in the form of a standard Python package. (This post explains how to get started and how to contribute.) Although the Ibis vision is not yet fully executed in this early release version, we’re confident that it will give you adequate insight into what Ibis will become over time. We look forward to bringing you more news about its progress and are excited to hear your feedback.

Marcel Kornacker is Chief Architect for Database Technology at Cloudera, and the creator of Impala.

Wes McKinney is a Software Engineer at Cloudera. He is the creator of Python’s ubiquitous pandas library and the author of the O’Reilly Media best-seller Python for Data Analysis. Previously, Wes was the founder and CEO of DataPad.