🏎💨vroom

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use. This lazy access is done automatically, so no changes to your R data-manipulation code are needed.

vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.

package

version

time (sec)

speedup

throughput

vroom

1.1.0

1.14

58.44

1.40 GB/sec

data.table

1.12.8

11.88

5.62

134.13 MB/sec

readr

1.3.1

29.02

2.30

54.92 MB/sec

read.delim

3.6.2

66.74

1.00

23.88 MB/sec

Features

vroom has nearly all of the parsing features of readr for delimited and fixed width files, including

Learning more

Benchmarks

The speed quoted above is from a real 1.48G dataset with 13,971,118 rows and 11 columns, see the benchmark article for full details of the dataset and bench/ for the code used to retrieve the data and perform the benchmarks.

Environment variables

In addition to the arguments to the vroom() function, you can control the behavior of vroom with a few environment variables. Generally these will not need to be set by most users.

VROOM_TEMP_PATH - Path to the directory used to store temporary files when reading from a R connection. If unset defaults to the R session’s temporary directory (tempdir()).

VROOM_THREADS - The number of processor threads to use when indexing and parsing. If unset defaults to parallel::detectCores().

VROOM_SHOW_PROGRESS - Whether to show the progress bar when indexing. Regardless of this setting the progress bar is disabled in non-interactive settings, R notebooks, when running tests with testthat and when knitting documents.

VROOM_CONNECTION_SIZE - The size (in bytes) of the connection buffer when reading from connections (default is 128 KiB).

VROOM_WRITE_BUFFER_LINES - The number of lines to use for each buffer when writing files (default: 1000).

There are also a family of variables to control use of the Altrep framework. For versions of R where the Altrep framework is unavailable (R < 3.5.0) they are automatically turned off and the variables have no effect. The variables can take one of true, false, TRUE, FALSE, 1, or 0.

VROOM_USE_ALTREP_NUMERICS - If set use Altrep for all numeric types (default false).

There are also individual variables for each type. Currently only VROOM_USE_ALTREP_CHR defaults to true.

VROOM_USE_ALTREP_CHR

VROOM_USE_ALTREP_FCT

VROOM_USE_ALTREP_INT

VROOM_USE_ALTREP_BIG_INT

VROOM_USE_ALTREP_DBL

VROOM_USE_ALTREP_NUM

VROOM_USE_ALTREP_LGL

VROOM_USE_ALTREP_DTTM

VROOM_USE_ALTREP_DATE

VROOM_USE_ALTREP_TIME

RStudio caveats

RStudio’s environment pane calls object.size() when it refreshes the pane, which for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes (RStudio#4210, RStudio#4292) for this issue, so so it is recommended you use at least that version.