The Apache Arrow team is pleased to announce the 0.8.0 release. It is the
product of 10 weeks of development and includes 286 resolved JIRAs with
many new features and bug fixes to the various language implementations. This
is the largest release since 0.3.0 earlier this year.

As part of work towards a stabilizing the Arrow format and making a 1.0.0
release sometime in 2018, we made a series of backwards-incompatible changes to
the serialized Arrow metadata that requires Arrow readers and writers (0.7.1
and earlier) to upgrade in order to be compatible with 0.8.0 and higher. We
expect future backwards-incompatible changes to be rare going forward.

We discuss some highlights from the release and other project news in this
post.

Projects “Powered By” Apache Arrow

A growing ecosystem of projects are using Arrow to solve in-memory analytics
and data interchange problems. We have added a new Powered By page to the
Arrow website where we can acknowledge open source projects and companies which
are using Arrow. If you would like to add your project to the list as an Arrow
user, please let us know.

Decimal support in C++, Python, consistency with Java

Phillip Cloud led efforts this release to harden details about exact
decimal values in the Arrow specification and ensure a consistent
implementation across Java, C++, and Python.

Arrow now supports decimals represented internally as a 128-bit little-endian
integer, with a set precision and scale (as defined in many SQL-based
systems). As part of this work, we needed to change Java’s internal
representation from big- to little-endian.

We are now integration testing decimals between Java, C++, and Python, which
will facilitate Arrow adoption in Apache Spark and other systems that use both
Java and Python.

In the future, we may implement support for smaller-precision decimals
represented by 32- or 64-bit integers.

C++ improvements: expanded kernels library and more

In C++, we have continued developing the new arrow::compute submodule
consisting of native computation fuctions for Arrow data. New contributor
Licht Takeuchi helped expand the supported types for type casting in
compute::Cast. We have also implemented new kernels Unique and
DictionaryEncode for computing the distinct elements of an array and
dictionary encoding (conversion to categorical), respectively.

We expect the C++ computation “kernel” library to be a major expansion area for
the project over the next year and beyond. Here, we can also implement SIMD-
and GPU-accelerated versions of basic in-memory analytics functionality.

As minor breaking API change in C++, we have made the RecordBatch and Table
APIs “virtual” or abstract interfaces, to enable different implementations of a
record batch or table which conform to the standard interface. This will help
enable features like lazy IO or column loading.

There was significant work improving the C++ library generally and supporting
work happening in Python and C. See the change log for full details.

GLib C improvements: Meson build, GPU support

Developing of the GLib-based C bindings has generally tracked work happening in
the C++ library. These bindings are being used to develop data science tools
for Ruby users and elsewhere.

The C bindings now support the Meson build system in addition to
autotools, which enables them to be built on Windows.

The Arrow GPU extension library is now also supported in the C bindings.

JavaScript: first independent release on NPM

Since the last release, we made a first JavaScript-only Apache release, version
0.2.0, which is now available on NPM. We decided to make separate
JavaScript releases to enable the JS library to release more frequently than
the rest of the project.

Python improvements

In addition to some of the new features mentioned above, we have made a variety
of usability and performance improvements for integrations with pandas, NumPy,
Dask, and other Python projects which may make use of pyarrow, the Arrow Python
library.

Substantially improved serialization performance for pandas objects when
using pyarrow.serialize and pyarrow.deserialize. This includes a special
pyarrow.pandas_serialization_context which further accelerates certain
internal details of pandas serialization * Support zero-copy reads for

pandas.DataFrame using pyarrow.deserialize for objects without Python
objects

Multithreaded conversions from pandas.DataFrame to pyarrow.Table (we
already supported multithreaded conversions from Arrow back to pandas)

More efficient conversion from 1-dimensional NumPy arrays to Arrow format

New generic buffer compression and decompression APIs pyarrow.compress and
pyarrow.decompress

Enhanced Parquet cross-compatibility with fastparquet and improved Dask
support

Python support for accessing Parquet row group column statistics

Upcoming Roadmap

The 0.8.0 release includes some API and format changes, but upcoming releases
will focus on ompleting and stabilizing critical functionality to move the
project closer to a 1.0.0 release.

With the ecosystem of projects using Arrow expanding rapidly, we will be
working to improve and expand the libraries in support of downstream use cases.

We continue to look for more JavaScript, Julia, R, Rust, and other programming
language developers to join the project and expand the available
implementations and bindings to more languages.

Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.