Paleoecology Blogs

R Tips I Wish I Had Learned Earlier – Using Functions

This post is part of a series expanding on suggestions from twitter users and contributors on reddit.com’s rstats forum about tips and tricks that they wish they had learned earlier.

Writing functions in R isn’t something that people start doing right away, and it is something that people often actively avoid since functions often require rewriting code, and, in some cases, mean you have to think hard about how data looks and behaves in the context of your analysis. When you’re starting out a project, or a piece of analysis it can often seem easier to just cut and paste pieces of code, replacing key variable names. In the long term this kind of cutting and pasting can cause serious problems in analysis.

It’s funny. We have no problem using functions in our day to day programming: lm, plot, sqrt. These functions simplify our data analysis, we use them over and over, and we’d never think about using the raw code, but when it comes to our analysis we’re willing to cut and paste, and let fairly simple sets of operations run into the hundreds of lines just because we don’t want to (or don’t know how to) add a few simple lines of code to make operations simpler.

Opportunities to make further improvements using lapply, parallel applications and other *ply functions.

Fewer variables in memory, faster processing over time.

So let’s look at what a function does: A function is a piece of code – a set of instructions – that takes in variables, performs an operation on those variables, and then returns a value.

For example, the function lm takes in information about two variables, as a formula, and then calculates the linear model relating them. The actual function is 66 lines of code (you can just type lm into your console), but it includes calls to another function, lm.fit that is 67 lines long, and that too includes calls to other functions. I’m sure we can all agree that

This is a simple example, and it doesn’t exactly make sense (there’s no reason to average the rows). But, it is clear that for all three datasets (climate, location and pollen) we are doing the same general set of operations.

So, what are the things we are doing that are common?

We’re removing a common set of sample sites

We are applying a function by rows (except Location)

We are averaging by row.

So, this seems like a decent candidate for building a function. Right now we’re at 9 rows, so keep that in mind.

We want to pass in a variable, so we’ll call it x, and we’re applying a function across rows before averaging, so we can pass in a function name as y. Location is a bit tricky, we don’t actually transform it, but there is a function called identity that just returns the argument passed to it, so we could use that. Lastly, we do the rowMean, so we can pass it out. Let’s try to construct the function:

So we’re still at 9 lines of code, but to me this reads much cleaner. We could go even further by combining the lines below the good.loc assignment into something like this:

rowMeans(apply(x[good.loc,], 1, y))

which would make our 9 lines of code into 7, fairly clean lines of code.

The other advantage of sticking this all into a function is that all the variables created in clip_mean (e.g., x.trans) only exist within the function. The only thing that gets passed out at the end is the rowMeans call. If we had assigned rowMeans to a variable (using ‘<-‘) we wouldn’t even have that at the end (since you need to pass data out of the variable, either using the return command, or by directly calling a variable or function). So, even though we’ve got 9 rows, we have 9 rows and only 8 variables in memory, instead of 13 variables. This is a big help because the more variables you have the harder debugging becomes.

So, to summarize, we’ve got cleaner code, fewer variables and you now look like a total pro.
If you have any tips or suggestions for working with functions please leave them in the comments.