Incanter and the GLM

I read somewhere that the Generalized Linear Model is the “workhorse of statistics” though I cannot seem to find the reference anymore. The workhorse of statistics is so called because it unifies regression for the exponential family of probability distributions which includes Gaussian, Binomial, and Poisson distributions. Instead of modeling the mean of the response variable, GLM models a continuous, differentiable transformation of the mean as a linear model of the predictor variables. This transformation is called the link function and is unique for each distribution in the exponential family. Once the distribution is specified, the model coefficients are determined via maximum likelihood estimation. In particular, iteratively reweighted least squares of the likelihood function has been shown to converge on the MLE.

To implement the GLM in Clojure/Incanter, we first need to implement the IRLS algorithm. If we assume that we know the link function (and its inverse, derivative, and the weight function), then IRLS is implemented as follows:

In the above code, we define the update step as an internal function of the updated coefficients variable. Then, we iterate over an infinite sequence of updates until the condition that the euclidean distance between successive iterations is less than epsilon.

Next, we need to define the link functions and other associated functions of each member of the exponential family of distributions. I have shown Gaussian and Binomial distributions below:

I have used the struct-map technique from Clojure which gives me a sort of family type. Additional families would be specified here. Now, similar to R, we can pass the family type to a general GLM function and have one estimation technique (the IRLS defined above) for all families. The GLM function is shown:

Gregory Burd

Clojure data structures (list, map, set) are all backed by Java Collections under the hood. There is one fairly simple way that any Java Collection can be extended such that it can be arbitrarily large, by using the Berkeley DB Java Edition Collections API. Berkeley DB implements a persistent B-Tree with transactions with a key/value basic API. But, also included in Berkeley DB is a full implementation of the Collections API on top of the B-Tree. Using that API your collection lives in the B-Tree, stored in files on the local disk. Only portions of the collection are in memory at any given time. Berkeley DB Java Edition maintains a LRU cache of the data you are operating on. That which fits into a pre-determined cache size (percentage of the Java Heap) is in memory, the rest is store on disk. If Clojure/Incanter incorporated an option to use Berkeley DB Java Edition in this way you could operate on as large a collection as you have storage space.

Disclosure: My day-job is Product Manager at Oracle for Berkeley DB. But, don't let that bias you. The license for Berkeley DB Java Edition is liberal enough to use in conjunction with Clojure.