Thursday, July 14, 2016

One more reason to use Feather

It was quiet here for some time. Not because I have nothing to blog but because I have no time to do so properly. One of those almost forgotten posts was about Feather package / module.

Feather is a fast, lightweight binary format for data frames with R and Python implementation. The original RStudio announcement is here. And sure, the speed improvement is impressive. See my numbers for saving ~ 100 million probabilities:

# haplotype probs: 192 animals x 8 x 64000 markers

> format(object.size(probs), units="Mb")

[1] "750 Mb"

# saveRDS or save needs almost a minute to write probs to disk

> system.time(saveRDS(dprobs, file="DO192_probs.rds"))

user system elapsed

50.701 0.574 51.678

# write_feather needs 6-7 seconds

> system.time(write_feather(dprobs, file="DO192_probs.feather"))

user system elapsed 1.344 1.051 6.272

Feather is even better if you compare it to traditional text formats like CSV. As David Smith explains in his blog, one of the reasons is traditional formats are row-oriented while internal R's storage is column-oriented.

Diagram credit: Hadley Wickham

I have one more reason to use Feather. If you have datasets with many columns (e.g. genes in human/mouse genome) and you need fast access to just one column (e.g. Shiny app), then Feather is ideal because its columns are automatically indexed.