Comparing Transformation Styles: attach, transform, mutate and within

There are several ways to perform data transformations in R. Each has its own set of advantages and disadvantages. Let’s take one variable, square it and add 100. How many ways might an R beginner screw up such a simple computation? Quite a few!

Here’s a data frame with one variable:

> mydata <- data.frame(x = 1:5)
> mydata
x
1 1
2 2
3 3
4 4
5 5

Since the variable x exists only in mydata, to transform x, I must somehow tell R it is stored in mydata. The simplest way to do that is using dollar format: mydata$x. I’ll make a copy of the data first so we can do the transformation several ways:

That works, but I had to type more characters for the “mydata.new” part than I did for the transformation itself. So let’s look at approaches that save us that trouble. One widely used approach is to use the attach function. This function makes a copy of a data frame’s variables in a temporary area that is attached to your search path as separate variables or vectors. That’s nice because you can refer to them simply by their names like “x” instead of “mydata$x”. However, the attach function is tricky to use. Here’s the most common mistake made by beginners:

There are no error messages, but the variables are not in the data frame! The attach function allows you to use short names to refer to variables in a data frame, but it does not change where new variables are written. So x2 and x3 are simply in my workspace:

The variable x2 got created and put into mydata.new. However, when the attempt to create x3 was run, variable x2 could not be found. This is due to the fact that the attached version of the data is a copy that was done in the past, it is not a live connection. Therefore, to refer to simply “x2” you would have to attach mydata.new again. You could also get around this problem by using dollar format in the second equation:

That worked, but having to keep track of when you do and don’t need dollar format seems more trouble than it’s worth. In addition, the fact that attach actually makes a copy of the data means that it wastes both time and memory.

The transform function lets you use short variable names on both sides of the equation, and it does not need to make a copy of the data set. Let’s just square x to see how it works.

Notice that when calling the transform function, new variable names like x2 are actually the names of arguments, and the formulas are the values of those arguments. As a result, the equals sign is used instead of the assignment operator “<-”.

Eliminating the tedious repetition of “mydata$…” makes the formulas easier to enter, read and debug. However, the transform function has a problem: it is unable to use a variable that it just created. For example:

We see that when attempting to create x3 from x2, the variable x2 is not found. It will not exist until the call to transform is complete. In our simple example, x2 may be merely an intermediate step, and we could avoid this problem by calculating x3 directly with one formula: x3 = (x ^ 2) + 100. However, if we really need x2 to exist later as a variable, we would have to run transform twice, once to create x2 and again to create x3 from it.

In the above code, note the comma between the two equations. Since transform uses equations as the values of tranform’s arguments, all equations must be followed by commas, except for the last one, which is followed by the final close parenthesis.

Hadley Wickham’s plyr package has a very useful function, mutate. It’s very similar to the base transform function but it can use variables that it just created:

However, mutate does have a limitation: it cannot re-create a variable that it just created. So you can use its new variables only on the right-hand side of your equations. In this next example, rather than create x3, I’ll continue to use the name x2:

As you can see, mutate kept only the first transformation to x2, ignoring the addition of 100. You might think that reusing the same variable name would be a rare occurrence, but if you are recoding a variable using the ifelse function (albeit inefficiently) this situation can arise often. (Avoid that by nesting multiple calls to ifelse, which is also more efficient.)

Finally, we come to the within function. It uses variables by their short names, saves new variables inside the data frame using short names, and it allows you to use new variables anywhere in calculations. It is built into base R, and it works like this:

Notice that we’re back to using the assignment operator “<-” and commas are not used between formulas. Multiple formulas must be enclosed in {braces}. Also note that the variables appear in the data frame in reverse order. Variable x3 appears before x2, even though the formula for x2 appeared first.

When I reuse the variable name x2 rather than create a new variable, x3, I still get the right answer:

Since the within function does this example so well, why use anything else? The mutate function shares syntax with plyr’s summarise function and their combination provides great flexibility when doing transformations or getting summary statistics by groups. Because of this, I use mutate to do this type of task and remember to not transform a variable that I just created!

That covers the main ways to transform variables in R. I hope that by understanding the limitations of each, you’ll avoid common pitfalls and be a more productive R user.

Related

To leave a comment for the author, please follow the link and comment on his blog: r4stats.com » R.