Apache Arrow aims to accelerate analytical workloads

Arrow is designed to serve as a common data representation for big data processing and storage systems, allowing data to be shared between systems and processes without the CPU overhead caused by serialization, deserialization or memory copies.

The lead developers of 13 major open source projects, along with the Project Management Committee (PMC) chairs of seven Apache Software Foundation projects, came together today to announce a new top-level Apache project that promises to dramatically improve the performance of analytical workloads.

While most new Apache efforts spend several years gaining steam as Apache Incubator projects before they become fully fledged projects, the new Apache Arrow is hitting the ground running as a top-level project from the beginning.

"It's because of the people involved," says Jacques Nadeau, CTO of startup Dremio (still in stealth), vice president of the Apache Drill project and now vice president of Apache Arrow. "Because of the support behind it and the people involved with it, I look at it as an opportunity to establish the next phase of heterogeneous data infrastructure."

"We expect that within a few years, the majority of all the world's day will move through the Arrow representation," he adds.

The roots of Arrow

Initially seeded by code from Apache Drill, a schema-free SQL query engine for large-scale datasets, Arrow is a high-performance cross-system data layer for columnar in-memory analytics. Nadeau says it will speed up both big data processing systems and big data storage systems by 10x to 100x by providing a common internal representation of data.

In many workloads, 70 percent to 80 percent of CPU cycles are spent serializing and deserializing data as it moves between systems and processes that each have their own custom data representations. With Arrow as the common representation, data can be shared between systems and processes with no serialization, deserialization or memory copies.

Behind Arrow's popularity

This gets at why Arrow is receiving such wide support from the very beginning, not just from some of the most well-known and important Apache committers and PMCs — including developers from projects including Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu, Parquet, Phoenix, Spark, Storm, Pandas and Ibis — but also vendors including Cloudera, Databricks, Datastax, Dremio, Hortonworks, MapR, Salesforce and Twitter. As a shared foundation for SQL execution engines, data analysis systems, streaming and queueing systems and storage systems, Nadeau says Arrow will provide the various projects in those areas much faster performance and interoperability.

In addition to traditional relational data, Arrow supports complex data and dynamic schemas. It can handle JSON data commonly used in Internet of Things (IoT) workloads, modern applications and log files, and implementations are already available (or underway) for programming languages including Java, C++ and Python. Nadeau says implementations of R and JavaScript should come by the end of the year, and Drill, Ibis, Impala, Kudu, Parquet and Spark will all adopt Arrow by the end of the year. Additional projects are also expected to adopt Arrow in that timeframe.

"Real-world use cases often include complex combinations of structured and rapidly growing complex-data," says Parth Chandra, Apache Drill PMC and Apache Arrow PMC. "Already tested with Apache Drill, the efficient in-memory columnar representation and processing in Arrow will enable users to enjoy the performance of columnar processing with the flexibility of JSON."

Nadeau expects the first formal release of Arrow to come within a few months.