Linear models: Pushing down and speeding up

R Analytics Blog – 2017 / 10 / 21

Let’s start with a few observations about linear models. The usual discussion posits a target variable y of
length N and a design matrix X with p columns (of full rank, say) and N rows. The linear model equation can be written

and the least squares solution is obtained by projecting y orthogonally onto the image of X giving the almost iconic formula in the title for the estimated parameters.

The most common case – unless you are a geneticist –
is that X is ‘long’, i.e. we have N >> p. We can test some properties of our least squares solution in this case in a little simulation

Classical treatments of the least squares solution focus on decomposing X in a suitable manner to address both, the crossproduct and the inversion. This somewhat obscures the fact that there are two distinct data steps:

a ‘small data’ step: Even with our hypothetical p = 1000 coefficients, the inversion only takes a few seconds

a ‘big data’ step: The crossproduct is what we need to worry about much more for typical large X!

Chunking and parallelising

Our example case (large X with N >> p) is ideal for chunking and parallelising. Chunking is made easy by packages like
biglm and speedglm that provide the infrastructure for handling chunked data and for updating estimates with new chunks.
Also note that the crossproducts
are ‘embarrassingly parallel’, i.e. they can be calculated via an appropriate ‘lapply’-ification:

To improve things further, one needs to look more closely at the structure of X. One well-known approach is in the MatrixModels-package, which allows to make use of sparsity in X, for instance, when X contains many dummy variables.

Special structures in X – star schema

The situation for which I have not immediately found an adequate package, and which I want to discuss in a bit more detail here, is when X is constructed from some kind of database ‘star schema’. As an example, we can take the fuel price data discussed here.

We start with loading the required packages and by defining a small benchmarking function, which will not only measure execution times, but also give a rough idea on memory usage:

Exploiting the star schema: Lifted matrices

So, how to make this smaller and faster?

The main observation is that the parts of the design matrix, that come from the joined dimension matrices, have a very simple form. For example, when looking at the variables which depend on time only:

is the (sparse version of) the ‘index matrix’, representing the join. This matrix has a 1 in position (ro, co) exactly when TimeAggID[ro] == co and zeros otherwise. It is easy to check that the matrix product of indTime and matTime
is the lift / join of matTime to the fact table

all(indTime %*%matTime ==matTime[pricesAll$TimeAggID, ]) ## TRUE

This is a very useful representation, as

matTime will be comparatively small,

crossproducts with the index matrix 'indTime' amount to a simple summing over groups, and need to be executed only once, regardless the size of matTime

Furthermore, index matrix algebra is already implemented in the Matrix-package.

Let’s call these objects ‘lifted matrices’. We can actually define them as a simple extension of the classes in Matrix:

The beauty of the ‘Matrix’-package (and the S4-system it is built on) is that we now only need to implement the algebra for this new class as a set of S4-methods,
adding to (in one case for the moment: overriding) the existing method-definitions in Matrix:

With these definitions, it is now easy to define the ‘lifted’ (or ‘pushed down’, depending on your point of view) version of our fuel price model above, by separately defining the components of the design which stem from the two dimension tables and (for the effect of CompE10) from the fact table itself.

I go through the full calculation here for demonstration purposes. I hope to put this into a more user-friendly format, soon.

Well, of course, I have massaged the example a bit. The model we have been looking at is a pretty extreme case. Almost all the explanatory variables are ‘lifts’ from one of the dimension tables. So the model is almost a product of two completely separate models. I actually had to sneak in the variable ‘CompE10’ (with all the problems, endogeneity, inconsistency, this variable brings, were we to really try to interpret this model) in order to make the example a truly non-product one.

Still, this approach is worthwile in many cases, and I will soon publish a more elaborate example, including how to deal with chunked and partitioned data, as well as allowing fixed effects.

About me

I am a consultant and project manager in marketing and business analytics.
Having worked in the area for
more than 15 years and having led the Data Science and
Analytics teams at IRI
Germany from 2009 to 2016, I am now again working as an
independent consultant focusing on applications of Big
Data and AI in marketing.