A few weeks ago, the stream package has been released on CRAN. It allows to do real time analytics on data streams. This can be very usefull if you are working with large datasets which are already hard to put in RAM completely, let alone to build some statistical model on it without getting into RAM problems.

Most of the standard statistical algorithms require access to all data points and make several iterations over the data and are less suited for usage in R on big datasets.

Streaming algorithms on the other hand are characterised by

single passes over the data,

using a limited amount of storage space and RAM

work in a limited amount of time

be ready to use the model at any time

The stream package is currently focussed on clustering algorithms available in MOA (http://moa.cms.waikato.ac.nz/details/stream-clustering/) and also eases interfacing with some clustering already available in R which are suited for data stream clustering. Classification algorithms based on MOA are on the todo list.

The stream package allows you to easily extend the use of the models with different data sources. These can be SQL sources, Hadoop, Storm, Hive, simple csv files, flat files or other connections. It is quite easy to extend it towards other connections. As an example, the following code available at this gist (https://gist.github.com/jwijffels/5239198) allows it to connect to an ffdf from the ff package. This allows to do clustering on ff objects.

Below, you can find a toy example showing streaming clustering in R based on data in an ffdf.

In a recent post by Revolution Analytics (link & link) in which Revolution was benchmarking their closed source generalized linear model approach with SAS, Hadoop and open source R, they seemed to be pointing out that there is no 'easy' R open source solution which exists for building a poisson regression model on large datasets.

This post is about showing that fitting a generalized linear model to large data in R <is> easy in open source R and just works.

For this we recently included bigglm.ffdf in package ffbase to integrate it more closely with package biglm. That was pretty easy as the help of the chunk function in package ff already shows how to do it and the code in the biglm package is readily available to do some simple code modifications.

Let's show how it works on some readily available data available here.

The following code shows some features (laf_open_csv, read.csv.ffdf, table.ff, binned_sum.ff, biglm.ffdf, expand.ffgrid and merge.ffdf) of package ffbase and package ff which can be used in a standard setting where you have your large data, want to profile it, see some bivariate statistics and build a simple regression model to predict or understand your target.

It imports a flat file in an ffdf, shows some univariate statistics, does a fast group by and builds a linear regression model.

This course is a hands-on course covering the basic toolkit you need to have in order to use R efficiently for data analysis tasks. It is an intermediate course aimed at users who have the knowledge from the course 'Essential tools for R' and who want to go further to improve and speed up their data analysis tasks.

The following topics will be covered in detail

The apply family of functions and basic parallel programming for these, vectorisation, regular expressions, string manipulation functions and commonly used functions from the base package. Useful other packages for data manipulation.

Making a basic reproducible report using Sweave and knitr including tables, graphs and literate programming

If you want to build your own R package to distribute your work, you need to understand S3 and S4 methods, you need the basics of how generics work as well as R environments, what are namespaces and how are they useful. This will be covered to help you start up and build an R package.

The ff package is a great and efficient way of working with large datasets.

One of the main reasons why I prefer to use it above other packages that allow working with large datasets is that it is a complete set of tools.

When comparing it to the other open source 'bigdata' packages in R

It is not restricted to work basically with numeric data matrices like the bigmemory set of packages but allows relatively easy access to character vectors through factors

It has efficient data loading functionality from flat/csv files, you can interact with SQL databases as explained here.

You don't need to use the sqldf package to get and insert all the time your data in the right (RAM-limited) sized to and from an SQLite database

It has higher level functionality (e.g. apply set of family, sorting, ...) which package mmap seems to be lacking

And it allows you to work with datasets which you can not put in memory which the datatable package does not allow.

If you disagree, do comment.

In my dayly work, I encountered frequently that some bits are still missing in package ff to make it suited for easy day-to-day analysis of large data. Apparently I was not alone as Edwin de Jonge started already package ffbase (available at http://code.google.com/p/fffunctions/). For a predictive modelling project BNOSAC has made for https://additionly.com/ we have made some nice progress in the package.

The package ffbase now contains quite some extensions that are really usefull. It contains a lot of the functionality from the R's base package for usage with large datasets through package ff.

These functions work all on either ff objects or objects of class ffdf from the ff package

Next to that there are some extra goodies allowing faster grouping by - not restricted to the ff package alone (Fast groupwise aggregations: bySum, byMean, binned_sum, binned_sumsq, binned_tabulate)

This makes it a lot more easy to work with the ff package and large data.

Let me show some R-code to show what this means using a simple dataset of the Heritage Health Prize competition of only 2.6Mio records available for download for download here (http://www.heritagehealthprize.com/c/hhp/data). It is a dataset with claims.

The code is for objects of ff/ffdf class and is based on version 0.6 of the package for download here.