arrow

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.

The arrow package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for working with Parquet (read_parquet(), write_parquet()) and Feather (read_feather(), write_feather()) files, as well as lower-level access to Arrow memory and messages.

Installation

On macOS and Windows, installing a binary package from CRAN will handle Arrow’s C++ dependencies for you. On Linux, you’ll need to first install the C++ library. See the Arrow project installation page to find pre-compiled binary packages for some common Linux distributions, including Debian, Ubuntu, and CentOS. You’ll need to install libparquet-dev on Debian and Ubuntu, or parquet-devel on CentOS. This will also automatically install the Arrow C++ library as a dependency. Other Linux distributions must install the C++ library from source.

If you install the arrow package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call

for version- and platform-specific guidance on installing the Arrow C++ library.

When installing from source, if the R and C++ library versions do not match, installation may fail. If you’ve previously installed the libraries and want to upgrade the R package, you’ll need to update the Arrow C++ library first.

Installing a development version

To use the development version of the R package, you’ll need to install it from source, which requires the additional C++ library setup. On macOS, you may install the C++ library using Homebrew:

# For the released version:
brew install apache-arrow
# Or for a development version, you can try:
brew install apache-arrow --HEAD

On Windows, you can download a .zip file with the arrow dependencies from the rwinlib project, and then set the RWINLIB_LOCAL environment variable to point to that zip file before installing the arrow R package. That project contains released versions of the C++ library; for a development version, Windows users may be able to find a binary by going to the Apache Arrow project’s Appveyor, selecting an R job from a recent build, and downloading the build\arrow-*.zip file from the “Artifacts” tab.

Linux users can get a released version of the library from our PPAs, as described above. If you need a development version of the C++ library, you will likely need to build it from source. See “Development” below.

Once you have the C++ library, you can install the R package from GitHub using the remotes package. From within an R session,

You can specify a particular commit, branch, or release to install by including a ref argument to install_github(). This is particularly useful to match the R package version to the C++ library version you’ve installed.

Developing

If you need to alter both the Arrow C++ library and the R package code, or if you can’t get a binary version of the latest C++ library elsewhere, you’ll need to build it from source too.

Note that after any change to the C++ library, you must reinstall it and run make clean or git clean -fdx . to remove any cached object code in the r/src/ directory before reinstalling the R package. This is only necessary if you make changes to the C++ library source; you do not need to manually purge object files if you are only editing R or Rcpp code inside r/.

Once you’ve built the C++ library, you can install the R package and its dependencies, along with additional dev dependencies, from the git checkout:

Editing Rcpp code

The arrow package uses some customized tools on top of Rcpp to prepare its C++ code in src/. If you change C++ code in the R package, you will need to set the ARROW_R_DEV environment variable to TRUE (optionally, add it to your~/.Renviron file to persist across sessions) so that the data-raw/codegen.R file is used for code generation.