A research blog about programming languages, databases, provenance, and other randomly-selected topics.

Sunday, August 12, 2018

Recursion in R

I've been (trying to) learn a little about R, mostly in self-defense because I'm involved in a project with others using it.

R has lots of fun features, such as built-in support for vector and matrix operations and overloading, which make it well-suited to its domain, statistical programming and modeling.

However, there are also some things that are just plain weird, at least from the point of view of programming language design. Scoping and lazy evaluation interact in a subtle way, particularly with regard to recursion.

One of the things that supposedly distinguishes R from the language S on which it was based is lexical scope. That is, references to non-local variables are supposed to be resolved statically (based on lexical structure) rather than dynamically (as in early forms of LISP, or using non-capture-avoiding substitution in the λ-calculus).

So for example, if we define x to be 42 at the toplevel, define a function f that refers to x, then call f from within another function g that has a local variable or parameter also named x, then the right thing happens:

(NB: The "+" is just the R interpreter's way of telling you it knows that you're writing a multi-line function; the "[1]" is R telling us that the result is a vector and the first element is 44. In R, "everything is a vector.") So, that means that the following does the right thing, right?

> x <- 42> f <- function (y) {x + y}> x <- 17> f(2)[1] 19

Uh, no. You get 19. The reason is that the occurrence of x within f is a reference to the toplevel variable named "x", not a copy of the value it had when f was constructed. Lexical scope, Jim, but not as we know it.

This is actually essential to how recursion works in R. For example, consider everyone's favorite toy recursive function, factorial:

For this to work at all, it's essential that the "fact" inside the function body not be bound to the value fact had when the function was defined, because there is no such value. The toplevel variable fact only is defined after the function expression has finished evaluating.

To confirm that this is indeed what happens, suppose we have defined fact as above, and now do the following:

> oldfact <- fact> fact <- function (n) {2*n}> oldfact(10)[1] 180

That is, we save the original factorial function in a new variable oldfact, then redefinefact as a function that doubles its argument. Now calling oldfact results in one iteration of the original body of fact, followed by a call to the new body of fact on 9, so the end result is (2*9) * 10 = 180.

In other words, when you assign what looks like a "recursive" function in R to a variable, any subsequent redefinitions of that variable apply to any future calls to the original function.

Just to make the implications of that really clear, let's see how one could use this to "patch" the factorial function to explain its own behavior:

So now, calling fact still produces the original result value, but at every recursive call we can insert whatever code we want (for example, printing out the function name and argument n) before calling the old version of fact.

If the above examples make your skin crawl, luckily R is untyped, so it's straightforward to define fact (or any other simple recursive function) in a sane way, using your favorite recursion combinator:

By defining the recursion using closed functions (whose parameters are not accessible or reassignable from outside the function), the recursive definition is now safe from whatever the toplevel does.

Dealing with multiple recursion also seems possible, using R's "lists" (i.e. field-accessible data structures, what many other languages call records or structs), and is left as an exercise for the reader.