R-based computing with big data on disk

Description

useR!2017: R-based computing with big data on disk

Keywords: big data, reproducibility, data aggregation, bioinformatics, imagingWebpages: https://github.com/kuwisdelu/matter, http://bioconductor.org/packages/release/bioc/html/matter.htmlA common challenge in many areas of data science is the proliferation of large and heterogeneous datasets, stored in disjoint files and specialized formats, and exceeding the available memory of a computer. It is often important to work with these data on a single machine, e.g. to quickly explore the data, or to prototype alternative analysis approaches on limited hardware. Current solutions for working with such data on disk on a single machine in R involve wrapping existing file formats and structures (e.g., NetCDF, HDF5, database approaches, etc.) or converting them to very simple flat files (e.g., bigmemory, ff).Here we argue that it is important to enable more direct interactions with such data in R. Direct interactions avoid the time and storage cost of creating converted files. They minimize the loss of information that can occur during the conversion, and therefore improve the accuracy and the reproducibility of the analytical results. They can best leverage the rich resources from over 10,000 packages already available in R.We present matter, a novel paradigm and a package for direct interactions with complex, larger-than-memory data on disk in R. matter provides transparent access to datasets on disk, and allows us to build a single dataset from many smaller data fragments in custom formats, without reading them into memory. This is accomplished by means of a flexible data representation that allows the structure of the data in memory to be different from its structure on disk. For example, what matter presents as a single, contiguous vector in R may be composed of many smaller fragments from multiple files on disk. This allows matter to scale to large datasets, stored in large stand-alone files or in large collections of smaller files.To illustrate the utility of matter, we will first compare its performance to bigmemory and ff using data in flat files, which can be easily accessed by all the three approaches. In tests on simulated datasets greater than 1 GB and common analyses such as linear regression and principal components analysis, matter consumed the same or less memory, and completed the analyses in a comparable time. It was therefore similar or more efficient than the available solutions.Next, we will illustrate the advantage of matter in a research area that works with complex formats. Mass spectrometry imaging (MSI) relies on imzML, a common open-source format for data representation and sharing across mass spectrometric vendors and workflows. Results of a single MSI experiment are typically stored in multiple files. An integration of matter with the R package Cardinal allowed us to perform statistical analyses of all the datasets in a public Gigascience repository of MSI datasets, ranging from <1 GB up to 42 GB in size. All of the analyses were performed on a single laptop computer. Due to the structure of imzML, these analyses would not have been possible with the existing alternative solutions for working with larger-than-memory datasets in R .Finally, we will demonstrate the applications of matter to large datasets in other formats, in particular text data that arise in applications in genomics and natural language processing, and will discuss approaches to using matter when developing new statistical methods for such datasets.