Upcoming 1.0.0 Format Stability Release

We are planning for our next major release to move from 0.14.0 to
1.0.0. The major version number will indicate stability of the Arrow
columnar format and binary protocol. While the format has already been
stable since December 2017, we believe it is a good idea to make this
stability official and to indicate that it is safe to persist
serialized Arrow data in applications. This means that applications
will be able to safely upgrade to new Arrow versions without having to
worry about backwards incompatibilities. We will write in a future
blog post about the stability guarantees we intend to provide to help
application developers plan accordingly.

Packaging

We added support for the following platforms:

Debian GNU/Linux buster

Ubuntu 19.04

We dropped support for Ubuntu 14.04.

Development Infrastructure and Tooling

As the project has grown larger and more diverse, we are increasingly
outgrowing what we can test in public continuous integration services
like Travis CI and Appveyor. In addition, we share these resources
with the entire Apache Software Foundation, and given the high volume
of pull requests into Apache Arrow, maintainers are frequently waiting
many hours for the green light to merge patches.

The complexity of our testing is driven by the number of different
components and programming languages as well as increasingly long
compilation and test execution times as individual libraries grow
larger. The 50 minute time limit of public CI services is simply too
limited to comprehensively test the project. Additionally, the CI host
machines are constrained in their features and memory limits,
preventing us from testing features that are only relevant on large
amounts of data (10GB or more) or functionality that requires a
CUDA-enabled GPU.

Organizations that contribute to Apache Arrow are working on physical
build infrastructure and tools to improve build times and build
scalability. One such new tool is ursabot, a GitHub-enabled bot
that can be used to trigger builds either on physical build or in the
cloud. It can also be used to trigger benchmark timing comparisons. If
you are contributing to the project, you may see Ursabot being
employed to trigger tests in pull requests.

To help assist with migrating away from Travis CI, we are also working
to make as many of our builds reproducible with Docker and not reliant
on Travis CI-specific configuration details. This will also help
contributors reproduce build failures locally without having to wait
for Travis CI.

Columnar Format Notes

User-defined “extension” types have been formalized in the Arrow
format, enabling library users to embed custom data types in the
Arrow columnar format. Initial support is available in C++, Java,
and Python.

A new Duration logical type was added to represent absolute lengths
of time.

Arrow Flight notes

Flight now supports many of the features of a complete RPC
framework.

Authentication APIs are now supported across all languages (ARROW-5137)

On the protocol level, endpoints are now identified with URIs, to
support an open-ended number of potential transports (including TLS
and Unix sockets, and perhaps even non-gRPC-based transports in the
future) (ARROW-4651)

Application-defined metadata can be sent alongside data (ARROW-4626,
ARROW-4627).

Windows is now a supported platform for Flight in C++ and Python
(ARROW-3294), and Python wheels are shipped for all languages
(ARROW-3150, ARROW-5656). C++, Python, and Java have been brought to
parity, now that actions can return streaming results in Java
(ARROW-5254).

C++ notes

188 resolved issues related to the C++ implementation, so we summarize
some of the work here.

General platform improvements

A FileSystem abstraction (ARROW-767) has been added, which paves the
way for a future Arrow Datasets library allowing to access sharded
data on arbitrary storage systems, including remote or cloud
storage. A first draft of the Datasets API was committed in
ARROW-5512. Right now, this comes with no implementation, but we
expect to slowly build it up in the coming weeks or months. Early
feedback is welcome on this API.

The dictionary API has been reworked in ARROW-3144. The dictionary
values used to be tied to the DictionaryType instance, which ended
up too inflexible. Since dictionary-encoding is more often an
optimization than a semantic property of the data, we decided to
move the dictionary values to the ArrayData structure, making it
natural for dictionary-encoded arrays to share the same DataType
instance, regardless of the encoding details.

The FixedSizeList and Map types have been implemented, including in
integration tests. The Map type is akin to a List of Struct(key,
value) entries, but making it explicit that the underlying data has
key-value mapping semantics. Also, map entries are always non-null.

A Result<T> class has been introduced in ARROW-4800. The aim is to
allow to return an error as w ell as a function’s logical result
without resorting to pointer-out arguments.

The Parquet C++ library has been refactored to use common Arrow IO
classes for improved C++ platform interoperability.

Line-delimited JSON reader

A multithreaded line-delimited JSON reader (powered internally by
RapidJSON) is now available for use (also in Python and R via
bindings) . This will likely be expanded to support more kinds of JSON
storage in the future.

New computational kernels

A number of new computational kernels have been developed

Compare filter for logical comparisons yielding boolean arrays

Filter kernel for selecting elements of an input array according to
a boolean selection array.

Take kernel, which selects elements by integer index, has been
expanded to support nested types

C# Notes

The native C# implementation has continued to mature since 0.13. This
release includes a number of performance, memory use, and usability
improvements.

Go notes

Go’s support for the Arrow columnar format continues to expand. Go now
supports reading and writing the Arrow columnar binary protocol, and
it has also been added to the cross language integration
tests. There are now four languages (C++, Go, Java, and JavaScript)
included in our integration tests to verify cross-language
interoperability.

Java notes

Support for referencing arbitrary memory using ArrowBuf has been
implemented, paving the way for memory map support in Java

A number of performance improvements around vector value access were
added (see ARROW-5264, ARROW-5290).

The Map type has been implemented in Java and integration tested
with C++

Several microbenchmarks have been added and improved. Including a
significant speed-up of zeroing out buffers.

A new algorithms package has been started to contain reference
implementations of common algorithms. The initial contribution is
for Array/Vector sorting.

JavaScript Notes

MATLAB Notes

Version 0.14.0 features improved Feather file support in the MEX bindings.

Python notes

We fixed a problem with the Python wheels causing the Python wheels
to be much larger in 0.13.0 than they were in 0.12.0. Since the
introduction of LLVM into our build toolchain, the wheels are going
to still be significantly bigger. We are interested in approaches to
enable pyarrow to be installed in pieces with pip or conda rather
than monolithically.

It is now possible to define ExtensionTypes with a Python
implementation (ARROW-840). Those ExtensionTypes can survive a
roundtrip through C++ and serialization.

The Flight improvements highlighted above (see C++ notes) are all
available from Python. Furthermore, Flight is now bundled in our
binary wheels and conda packages for Linux, Windows and macOS
(ARROW-3150, ARROW-5656).

We will build “manylinux2010” binary wheels for Linux systems, in
addition to “manylinux1” wheels (ARROW-2461). Manylinux2010 is a
newer standard for more recent systems, with less limiting toolchain
constraints. Installing manylinux2010 wheels requires an up-to-date
version of pip.

Various bug fixes for CSV reading in Python and C++ including the
ability to parse Decimal(x, y) columns.

Parquet improvements

It’s now possible to configure “data page” sizes when writing a file
from Python

Ruby and C GLib notes

The GLib and Ruby bindings have been tracking features in the C++
project. This release includes bindings for Gandiva, JSON reader, and
other C++ features.

Rust notes

There is ongoing work in Rust happening on Parquet file support,
computational kernels, and the DataFusion query engine. See the full
changelog for details.

R notes

We have been working on build and packaging for R so that community
members can hopefully release the project to CRAN in the near
future. Feature development for R has continued to follow the upstream
C++ project.

Community Discussions Ongoing

There are a number of active discussions ongoing on the developer
dev@arrow.apache.org mailing list. We look forward to hearing from the
community there:

Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.