Style guide for R code

Now that I work for a large software company, my computer code is heavily scrutinized: there are style guides, rules for indenting, conventions for variable naming, etc. I’ve come around to it–it really does make my code a lot better. blah blah blah. But my company doesn’t really have formal rules for R because hardly any engineers use it. I’m working on writing said rules, and it seems incomplete not to include something about writing efficient code: avoiding nested loops, etc. Do you know of any good references on how to write good R code?

My reply:

1. I’m curious what the R experts say. My own stylistic preferences differ slightly from what appears to be the default in R packages: in particular, I like to indent 2 characters, but R seems to indent a lot more, which to my taste makes the code hard to read in a text editor. (My own stylistic preferences can be deduced from the examples of R code in my books.)

2. I know that there is general advice to avoid global variables–it’s better to pass information in function arguments. When writing Umacs we found this to be awkward and so we used global variables instead, but since then somebody explained to me how to do it all using local variables (without requiring a huge effort in passing lists of variables). Unfortunately I can’t remember now how I was going to do it. Maybe the idea was to put all the arguments in a list.

3. I personally like to fill up arrays with NA’s when I set them up, so that if something goes wrong, I’ll get lots of NA’s in the result, and I can track back where the problem is.

4. I think R has a debugger but I’ve never used it. I probably should.

5. As you know, I’ve been moving toward the idea of simulating fake data for every problem as a test of the algorithm and code. I call this the self-cleaning oven principle: a good package should contain the means of its own testing. We haven’t yet done this with “arm” but we should.

6. I agree about avoiding nested loops–when it causes programs to be slow. On the other hand, sometimes a matrix implementation can just be mysterious, and I find it helpful to spell things out with loops. (Again, we discuss this in our book–we even have a footnote or two explaining why we have some loops.)

7. I like to follow the general principle that lines of code should (almost) never be repeated. I’m always seeing students write scripts with cut and pasted code, and I’m always telling them to use a function and a loop instead.

8. A silly little thing: with if () statements, I recommend always using braces (curly brackets), even if the conditional command is just one line. If you or someone else wants to modify a function, it’s much easier to do so if the braces are already there.

9. R functions are getting uglier and uglier. I’d say the typical R function is 90% “paperwork” (exception handling, passing of names, etc) and only 10% “meat” (to mix analogies). I attribute some of this to the S4 system of objects with sockets etc. For one thing, it’s typically no longer possible to see what a function does by typing its name. I don’t know what to say here, except to recommend not drinking the Kool-Aid: maybe you can try to keep your functions clean rather than putting all the effort into the paperwork. (Unfortunately, we didn’t really follow this advice with bayesglm: we made the mistake of adapting the existing glm function.)

10. Scalability is a big issue in R. Ideally any new function would be accompanied by a statement explaining how it scales as the inputs increase in size.

11. When summarizing the results of your output, I recommend working with “display()” (from the arm package) rather than “summary()”. The summary() function always seems to give a lot of crap, and we’ve tried to be cleaner and more focused with display(). One option is to set up functions for both so that users can typically use display(), with some extra information in summary().

12. It’s a good idea to graph inferences. Graphs aren’t just for raw data.

This seems like too much advice; maybe some of the above rules are unnecessary or can be written more generally.

In any case, if you’re writing guidelines, I recommend giving examples of the recommended approach and also the bad approach for each rule.

Perhaps others have suggestions too (or comments on my ideas)? Once you’ve written your guidelines, I hope you can publish them, with discussion, in a statistics journal so all can see. There may already be some R style guide that you can adapt and react to.

8 Comments

You are touching on language and style issues at the same time — either one is already a big topic. Maybe you want to bring this separately to the r-help (or maybe r-devel) lists for wider discussion.

As for style and coding conventions, years ago Henrik Bengtsson precented his ideas on 'R Coding Convention' at one of the Vienna 'DSC' conferences that center around R; you can see his current version
and may take this as a further starting point.

For the narrow point 1, I decided after many years of using the ESS default in Emacs that I liked the setup used by R Core better; Emacs configuration snippets are in the 'R Extension' manual that ships with R.

1. I agree with the 2-column indent. Long function names in R combined with indentation (for example) for the function, a loop inside the function, and a a conditional statement can mean that the real code begins on column 35 (out of 80 columns). For example:
<pre>
a = function(){
for(i in 1:10) {
columnTransform = menu.letters(
}
}
</pre>
I often find the standard 80-column limit makes writing "pretty" R code difficult.

8. At one time I used a certain syntax for 'if' statements that was not conducive to stepping through and evaluating code line by line. My preferred syntax that does allow this is:
<pre>
if(condition) {
code 1
} else {
code 2
}
</pre>
The key here is to have the 'else' preceded by the right curly brace instead of being on a line by itself.

9. As Andrew said, it used to be that every function could be examined by typing its name, but now you're likely to get little more than "this is a generic function" or an error because the function is hidden away in a namespace. I find the overuse of namespaces to be a big thorn in my side when examining R code.

Other comments:

I. Some people claim that comments should start at column 40–supposedly easier to read. I *maybe* can see this used to be true for black & white code, but with modern fontification of code, I much prefer comments at the left side, but in a different color. My eyes are already at the left side to follow the code, so that's where I want my comments too.

II. Some languages (like Mathematica) have strong rules for function names (e.g. camel caps as in ThisIsSomeFunction), but R is a total mess. I've complained about this in a few places with this set of examples:
<pre>
row.names, rownames
browseURL, contrib.url, fixup.package.URLs
package.contents, packageStatus
mahalanobis, TukeyHSD
getMethod, getS3method
choose.files, file.choose
</pre>
Instead of getting better, things are actually getting worse as some people now use "_" in function names. Develop rules early for function names and variable names. Follow them.

III. R's use of vectors and object-oriented functions can create code that is just too damn cute for its own good. Remember that for most applications, coding time far exceeds processing time. Slightly inefficient code that is clear is often preferable to cute, efficient code. When your code is cute, add comments to the code that would be clear enough for your grandmother to understand. (Well, maybe not that extreme, but I've had to struggle with a lot of cute code over the years…)

IV. Personally, I always use a 'return' at the end of a function to be explicit about what is returned, either 'return(object)' or just 'return()'. Without 'return', it is not always clear what is happening.

V. Don't use the 'fix' or 'edit' functions–they wreck the formatting of the function.

I've never written R, but I can continue the discussion I was having with Andrew in person on this topic when dinner interrupted us.

That 90% paperwork (point 9, metaphor 1) is important.

For exceptions, the rule is simple. Check incoming arguments to public functions for consistency and document those requirements in the function doc. This segregates these tests to one place in the code. You then don't need to check calls to internal functions.

The advantage of checking inputs is simple — you fail early, so it's easy to debug. This is easier and more standard than filling arrays with not-a-number values (point 3), because you can return a message saying what the problem is. It's good manners in a program, although it may make the program more verbose (just like real good manners).

I find the biggest difference in style is at the algorithm level, not at the syntax level. I speak the languages I use well enough not to be thrown off by weird syntax. But some folks write algorithms oddly. For instance, some people like single return style, which often involves tortured loop exit conditions. Or everything's done with continuations.

Well written code doesn't need inline comments. Inline comments are written for programmers. Inline comments get out of synch with the code, whereas the code never lies. I just ignore comments in code I'm reading — they're rarely both correct and useful. You should be able to tell from the variable and function names what's going on. If there's a confusing operation, call it out in a well-named method and then you don't need to doc it (or implement it again elsewhere).

Function/API doc is written for users. That should explain what the algorithm does and how to use it.

Global variables are the devil's work (point 2). They're so tempting, yet so evil. The first reason is that they introduce namespace conflicts if you try to put two programs together that coincidentally share a global with the same name. The second problem with globals (or object variables used as globals if that was the solution) is that they make it hard to reason about what's going on in loops. If things get passed globally, anything in a program might have set them. Good style conventions should make programs easy to modify and debug.

I think that there is a practical issue that no-one mentioned. 99.99% of the time I am coding up R for data analysis under extreme time pressure. I have code that will run out of the box for fairly involved stuff, but only if I make a few modifications–something that takes 5 minutes. If, however, I were to try to generalize that code into a function, it would take about 30 minutes. The correct answer is easy: just generalize it into a function. But often the choice is between getting an answer to a problem now or much later. And usually I cannot wait.

A related situation I face very often is that I write fairly involved code that does one job really well. But four experiments later I figure out really "cute" (and transparent and fast) solution. Do I take the time to fix my old clunky code? There is no payoff, the paper is published.

Spread out over the 8-9 years that I have been using R, I have a large collection of code in various stages of elegance.

As far as I can see, there is no way for me to take the time to go back and redo everything consistently. Of course, one might say that a single-use code can be thrown away, or just neglected. But sometimes one needs old data, or the data is not that old and it needs to be integrated with new data.

If someone has a clever (or even not so clever) strategy for reducing this problem, I'm very keen to hear it.

1. I don't care personally (I use whatever my editor defaults to), but there is a style guide that R Core uses for this sort of thing. More or less, it's what Emacs defaults to using (which is to use two spaces, not tabs at all, for indenting).

2. The notion of a global variable doesn't really hold in R. The closest thing you can really get is pollution of GlobalEnv by a package, but this actually happens all the freakin' time (and is somewhat necessary). For packages this is probably not recommended unless you have a very specific reason.

If you want to avoid them but still collect a variety of functions, you can hide them in a wrapper function of course. I tend to use S4 objects for my state holding—imagine a sampler. It has a state at time t, represented by some state object as well as some parameters, represented by a parameter object. There's an iterate function defined on (state,parameters) (I'm using S4 so I can do multiple dispatch) that returns t+1. That way a for loop discards my state and an lapply (or sapply) gets me a recorded state. An sapply() with an internal for loop buys me sampling with thinning (though I typically do this as an optional argument to iterate() ).

4. It's pretty minimal. There's a debug package that helps a bit. R 2.6 also has some things to make errors more useful, especially when using S4 dispatch.

5. Yes. The Lisp community has taken this idea one step further and developed a testing system based around the idea of random inputs.

6. Why? If you have an outer loop and an inner loop why should you go to lengths to hide that fact? It is what it is. This is different, of course, from factoring out an inner loop so that it can be used in other places, but factoring an inner loop for the purpose of hiding nested loops is just clutter.

8. Personally, I don't care. For a simple return value I usually do not use braces. Also, to address someone above brought up,

ifelse and if…else… are actually separate constructs (S-PLUS also has ifelse1 to make things even more complicated). ifelse is not a "cute" construct for if…else… it is the PARALLEL version of if…else… he could have written his ifelse statement as

<pre>
b = if(a==1) 2 else 3
</pre>

just as easily because the if…else… construct is really simply sugar for a function call so it has a return value.

If you are using ifelse it probably looks something like

<pre>
b = ifelse(runif(1000) < .5,2,3)
</pre>

which gives you a vector of 1000 2s and 3s. Since most people don't know about it, they tend to use

<pre>
b = rep(3,1000)
b[runif(1000) < .5] = 2
</pre>

9. Trying to use S4 in the same way as a Java-style object system is pretty much doomed to failure. Done properly, this can actually reduce the amount of book-keeping required by users (who are a different group of people from developers). I think the flowCore package in Bioconductor is a good example of this, but I'm biased (and someday, should write up why things are the way they are). You can do similar things with S3 objects, which we've done in the flowUtils package to handle some XML parsing in an elegant way.

In general, my problem with most of the R style guides I see is that they were apparently written by either C or Java programmers. There are certain things that hold no matter what, but I am concerned that this leads to thinking that assumes that R is a language like C/C++ or Java, which will a) frustrate users when they encounter a situation where that is not true and b) ignore vast swathes of the language, which usually corresponds to "the powerful bits." (pop quiz: how do you call functions recursively in R. If you answered "by calling the function name again" you are wrong.). It also has a tendency to perpetuate assorted folklore about the language, which is even more annoying (e.g. for loops are slow. which is not really the case. typically, updating a massive list is slow. go ahead, time it. My earlier example about thinning is actually a good one because you are only updating a single state object.)

I'm well aware of the niceties of vectorized 'ifelse', but I've noticed that some R code uses it because (I assume) it is more compact or just a habit. Cute, but less portable. I switch back and forth between R and S-Plus and frequently write code in a style that I don't like, but that is more portable. (Note that 'ifelse' is portable between R and S-Plus, but not if I was converting the code to some other languages.)

Of course inline comments are written for programmers, but saying that "code never lies" is nearly as pompous as saying "it's a feature, not a bug". Bugs make code lie all the time.

Further, even good code can't always explain WHY the code is written thus, only WHAT it is doing. (Sure, the specifications might say WHY, but it is much more useful to have the WHY included as a comment where it will stay attached to the code than in the specifications that too often are lost once the project is complete.)

Examples:
<pre>
# The 'read.table' function is a lot faster than
# 'importData', and I wish I could use it here,
# but it chokes on files with the character '233'
</pre>

<pre>
# 'dist' works fine with values above 100, but
# Joe decided to cap it at 100
</pre>

Lastly, desktop and internet search tools work much, much better with comments than the actual code.

I should've recommend Hunt and Thomas's "Pragmatic Programmer" in the last post. It's an entertaining, wonderfully non-dogmatic set of great advice for writing and maintaining code. It's not about any particular language. Much of their advice won't even make sense if you haven't worked with a team of programmers larger than yourself and a student/prof/colleague.

Fixing old code without changing its function is called refactoring. The key to scheduling refactoring is amortization. If you do something clunky once, OK. If you do it twice, you should start to worry. By the third time you use something, fix it up so it's easier to use the fourth and subsequent times. This is often applied to functions. You can cut-and-paste once, but the second time you're tempted, you should refactor to a function.

Also, it's a very different world writing code for others to use and code for yourself to use. There's some aspects of using your own code later that are like others using your code, but it's still vastly different.

Some people have good memories. They can throw every paper they ever print into piles on the floor of their office and still find them. I have a terrible spatial memory and use a filing cabinet. But the real problem is when you work with someone else and they need to find something in your pile. That's when sloppiness becomes unacceptable even the first time around (unless it's well encapsulated).

In any case, it helps to follow a set of standard practices (use the idioms of your language — don't make them up) and naming conventions because it makes it easier for you and others to understand the code. Once you get used to doing this, it actually saves time, as it cuts out decisions about how to do things you might otherwise have to think about.