We have been implementing a series of optimizations in the Apache Parquet C++
internals to improve read and write efficiency (both performance and memory
use) for Arrow columnar binary and string data, with new “native” support for
Arrow’s dictionary types. This should have a big impact on users of the C++,
MATLAB, Python, R, and Ruby interfaces to Parquet files.

This post reviews work that was done and shows benchmarks comparing Arrow
0.12.1 with the current development version (to be released soon as Arrow
0.15.0).

Summary of work

One of the largest and most complex optimizations involves encoding and
decoding Parquet files’ internal dictionary-encoded data streams to and from
Arrow’s in-memory dictionary-encoded DictionaryArray
representation. Dictionary encoding is a compression strategy in Parquet, and
there is no formal “dictionary” or “categorical” type. I will go into more
detail about this below.

One of the challenges of developing the Parquet C++ library is that we maintain
low-level read and write APIs that do not involve the Arrow columnar data
structures. So we have had to take care to implement Arrow-related
optimizations without impacting non-Arrow Parquet users, which includes
database systems like Clickhouse and Vertica.

Background: how Parquet files do dictionary encoding

Many direct and indirect users of Apache Arrow use dictionary encoding to
improve performance and memory use on binary or string data types that include
many repeated values. MATLAB or pandas users will know this as the Categorical
type (see MATLAB docs or pandas docs) while in R such encoding is
known as factor. In the Arrow C++ library and various bindings we have
the DictionaryArray object for representing such data in memory.

For example, an array such as

['apple', 'orange', 'apple', NULL, 'orange', 'orange']

has dictionary-encoded form

dictionary: ['apple', 'orange']
indices: [0, 1, 0, NULL, 1, 1]

The Parquet format uses dictionary encoding to compress data, and it is
used for all Parquet data types, not just binary or string data. Parquet
further uses bit-packing and run-length encoding (RLE) to compress the
dictionary indices, so if you had data like

When writing a Parquet file, most implementations will use dictionary encoding
to compress a column until the dictionary itself reaches a certain size
threshold, usually around 1 megabyte. At this point, the column writer will
“fall back” to PLAIN encoding where values are written end-to-end in “data
pages” and then usually compressed with Snappy or Gzip. See the following rough
diagram:

Faster reading and writing of dictionary-encoded data

When reading a Parquet file, the dictionary-encoded portions are usually
materialized to their non-dictionary-encoded form, causing binary or string
values to be duplicated in memory. So an obvious (but not trivial) optimization
is to skip this “dense” materialization. There are several issues to deal with:

A Parquet file often contains multiple ColumnChunks for each semantic column,
and the dictionary values may be different in each ColumnChunk

We must gracefully handle the “fall back” portion which is not
dictionary-encoded

We pursued several avenues to help with this:

Allowing each DictionaryArray to have a different dictionary (before, the
dictionary was part of the DictionaryType, which caused problems)

We enabled the Parquet dictionary indices to be directly written into an
Arrow DictionaryBuilder without rehashing the data

When decoding a ColumnChunk, we first append the dictionary values and
indices into an Arrow DictionaryBuilder, and when we encounter the “fall
back” portion we use a hash table to convert those values to
dictionary-encoded form

We override the “fall back” logic when writing a ColumnChunk from an
DictionaryArray so that reading such data back is more efficient

All of these things together have produced some excellent performance results
that we will detail below.

The other class of optimizations we implemented was removing an abstraction
layer between the low-level Parquet column data encoder and decoder classes and
the Arrow columnar data structures. This involves:

We show both single-threaded and multithreaded read performance. The test
machine is an Intel i9-9960X using gcc 8.3.0 (on Ubuntu 18.04) with 16 physical
cores and 32 virtual cores. All time measurements are reported in seconds, but
we are most interested in showing the relative performance.

First, the writing benchmarks:

Writing DictionaryArray is dramatically faster due to the optimizations
described above. We have achieved a small improvement in writing dense
(non-dictionary) binary arrays.

Then, the reading benchmarks:

Here, similarly reading DictionaryArray directly is many times faster.

These benchmarks show that parallel reads of dense binary data may be slightly
slower though single-threaded reads are now faster. We may want to do some
profiling and see what we can do to bring read performance back in
line. Optimizing the dense read path has not been too much of a priority
relative to the dictionary read path in this work.

Memory Use Improvements

In addition to faster performance, reading columns as dictionary-encoded can
yield significantly less memory use.

In the dict-random case above, we found that the master branch uses 405 MB of
RAM at peak while loading a 152 MB dataset. In v0.12.1, loading the same
Parquet file without the accelerated dictionary support uses 1.94 GB of peak
memory while the resulting non-dictionary table occupies 1.01 GB.

Note that we had a memory overuse bug in versions 0.14.0 and 0.14.1 fixed in
ARROW-6060, so if you are hitting this bug you will want to upgrade to 0.15.0
as soon as it comes out.

Conclusion

There are still many Parquet-related optimizations that we may pursue in the
future, but the ones here can be very helpful to people working with
string-heavy datasets, both in performance and memory use. If you’d like to
discuss this development work, we’d be glad to hear from you on our developer
mailing list dev@arrow.apache.org.

Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.