Apache Arrow is an open-source standard for columnar in-memory execution that originated from Apache Drill’s own in-memory columnar data structures. It aims to become the de-facto way of efficiently holding data in-memory and exchange data between different execution engines, thereby avoiding serialisation. Apache Arrow is backed by key developers of 13 open source projects, mainly from Apache, including Calcite, Drill, Pandas, HBase, Spark, and Storm.

InfoQ interviewed Le Dem to find out the differences between Arrow and Parquet, a columnar on-disk storage format, and how both can enable a more efficient computation across execution engines.

InfoQ: Do you believe Apache Arrow, like Parquet, will be commoditized across execution engines like Apache Spark? Do you think it will narrow the performance difference between engines?

Le Dem: As pioneered by MonetDB, vectorized execution is the state of the art for efficient query processing. Many open source query engines are moving to this model, and we believe it makes sense to standardize the in-memory columnar representation to provide extremely efficient interoperability. What Parquet did for columnar storage, Arrow provides for columnar in-memory processing and interchange.

These standardization efforts greatly simplify integration between storage layers, query engines, DSLs and UDFs, and provide a much more efficient communication layer by removing the need for serialization. Standardization makes it easier, cheaper, and faster for all systems to interoperate by removing common bottlenecks. However, there is still plenty of room for each execution engine to innovate by providing specialized techniques that further improve performance, such as operating on compressed vectors or having a smarter query optimizer.

Le Dem: The tradeoffs between reading data from disk and from memory are different. Currently it is up to the engines to implement predicate push down. Eventually Apache Arrow will provide fast vectorized operations that can be reused across engines, although this effort has not yet started.

InfoQ: One of the Arrow's goals is to provide constant time access to data in-memory and enable vectorized operations with SIMD instructions being executed . Does Arrow also provide in-memory data compression like Parquet?

Le Dem: Arrow supports dictionary encoding that can provide significant compression and faster execution for operations such as aggregations and joins. There is also an ongoing discussion to provide generic buffer compression using general purpose algorithms such as snappy or gzip.

In this initial release, Arrow does not yet support other compression techniques such as bit packing. However, we intended for execution engines to be able to define custom vectors provided that standard vectors are used in interchange. This will allow more advanced techniques like operating on compressed vectors directly. An example that comes to mind is the BitWeaving project from University of Wisconsin. In the future the set of standard vectors will be expanded.

The first release of Arrow has native C++-backed integration between the Pandas library, Arrow, and Parquet, allowing Arrow's Record Batches to be manipulated as Pandas dataframes and exposed to SQL-on-Hadoop engines such as Apache Drill.

InfoQ: Apache Arrow supports interoperability allowing data to be transferred between processes without serialisation. Can you comment on the capabilities of the IPC layer of Arrow?

Le Dem: The IPC layer is still experimental and it is a true zero-copy layer. When an Arrow’s Record Batch is finalized it becomes immutable. In this state it can be shared with other processes in read-only mode using shared memory without worrying about concurrent access. The vector representation is independent from its memory address (no absolute pointer required) and is safe to use in shared memory where each process sees a different address for the buffers.

InfoQ: Like Parquet, Apache Arrow supports nested data types. What data types are currently supported and which are on the roadmap?

Le Dem: Arrow supports all the common data types, a pretty exhaustive list so far. Some of the more recently added types include SQL’s Timestamp and Interval types.

Apache Parquet is a on-disk columnar format while Apache Arrow is a in-memory data format. They serve different purposes: the first primary objective is to provide good compression and mechanisms that allow the software to open just the bits of the file that may be interesting. This is done with page header data that contains things like the min and max values of a certain column. Predicate pushdown operations leverage the information in the headers to decide if it should be scanned or skipped.

Arrow is a standard for in-memory columnar processing, that defines CPU-cache friendly data structures, interoperable between multiple engines (which avoids CPU waste on de/serialization). Arrow as a standard enables easier and more efficient data exchange between multiple engines (such as Apache Spark, and others). Many techniques used by engines like Apache Spark, Apache Drill, borrow concepts pioneered by MonetDB such as vectorization of operations.

InfoQ Weekly Newsletter

Join a community of over 250 K senior developers by signing up for our newsletter. If you are based in the EEA, please contact us so we can provide you with the protections afforded to you under EEA protection laws.

Is your profile up-to-date? Please take a moment to review and update.

Email Address

Note: If updating/changing your email, a validation request will be sent

Company name:

Keep current company name

Update Company name to:

Company role:

Keep current company role

Update company role to:

Company size:

Keep current company Size

Update company size to:

Country/Zone:

Keep current country/zone

Update country/zone to:

State/Province/Region:

Keep current state/province/region

Update state/province/region to:

Subscribe to our newsletter?

Subscribe to our architect newsletter?

Subscribe to our industry email notices?

By subscribing to this email, we may send you content based on your previous topic interests. See our privacy notice for details.

You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.

We notice you're using an ad blocker

We understand why you use ad blockers. However to keep InfoQ free we need your support. InfoQ will not provide your data to third parties without individual opt-in consent. We only work with advertisers relevant to our readers. Please consider whitelisting us.