Why do we need the reference class?

The importance of the data structures in algorithm implementation and design patterns has been mentioned many many times in textbooks. In this section, I would like to explain why we need to combine the idea of the reference class and the data structures in R.

The reference class

The first question is: what is the reference class?

As an R user, you should know that, in R, everything is an object. This includes the variables like vector, list, data.frame, and etc., and even all the functions (closures).

Almost all the variables (if the functions are also variables, yes, they are, because the function name can be changed to another function) are assigned or passed by value (this is termed pass-by-value).

R will copy the value of x to y, and they are supposed to be two different variables which occupy different memory pieces. However, in practice, the new memory will not be allocated for y immediately when running y <- x. R will do a fake pass-by-reference (that we will explain later), and later new memory will be allocated for y(or x), once the value of x(or y) is changed. So anyway, x to y are supposed to be in different places in the memory.

R will copy the variable x to a new variable val, change the value of val, and then return it by value! (copy the local val to the new variable z, and then kill val when the function exits) So the function actually copies twice in both passing and returning.

It should be noticed that, if the value of val is not going to be changed inside the function func, that is, func is defined to be a immutable function like, for example,

the variable y will be an alias (the term is a “reference”) of x, which means that both x and y will share the same memory. Once you change the value of either of the two, the other will be changed automatically.

If the function is pass-by-reference, then argument val will share the same memory of x when you run func(x), and the value of the global x will be changed after calling the function.

So when we do pass-by-reference, no new memory is allocated, the original variable will have an alias. When you change the value of the alias, the original one will also be changed.

The reference class follows the same rule, that is, any object or instance of this class will always be passed-by-reference.

Data structures

R is becoming a more and more sophisticated language. R users program by using R to make achievements in data sciences. Now the question is that “Are you really happy with only pass-by-value?” or “whether only the pass-by-value is sufficient?”

The answer is NO, because we are doing much more than expected in data sciences, and some ideas from the data structures are now needed, but the implementation of these ideas needs the pass-by-reference!

So the dream is that “we hope that we can input something into some function and the function will change its value!”

Or, for some people, “we really miss the pointers or references in Fortran, C, C++, and etc.”

Even R says no itself due to the fact the Reference Class following the S4 class has been implemented. And then the much more efficient R6 class was implemented and available in the R6 package. I employ the R6 class to implement the data structures.

Suppose that you want to design and implement some algorithm. The most efficient solution is, for example, to use the recursion, in which you pass a variable into some function, the function will call itself and pass the same variable into it. The corresponding algorithm requires that the variable will be changed inside the recursion. See for example, the traverse algorithm in the binary-search-tree, in which, if we want to have a copy of these traversed elements, then it is desirable to pass-by-reference a container (vector, list, data.frame and etc.) into the recursive traverse function. Once an element in the tree is reached, its value will be copied into the container.

Now you see that, the data structures and the algorithms do require the pass-by-reference feature!

The R6 class is truly the reference class

In order to investigate the R6 class, especially to confirm that it is doing pass-by-reference, consider the example below

We define a new R6 type reference class RClass. It is has a finalize function which will be run when the system removes the instance of the class (free or collect the memory allocated for it). We see that the finalize will print a message show that the instance is being deleted. We do the memory garbage collection manually by using the gc function, because R is “lazy”…

Our experiment is designed as follows. If the instance of the RClass class is passed-by-value into the function ftmp, then i) the global tmp1 will not be changed; ii) the new variable tmp inside the function will be removed when the function exits, which means that we will see some message saying “obj 1 deleted!”, because the value of tmp is set to one. Now let’s check

We see that no memory space is freed, which means that inside the function, no new variable was created. The value of the global tmp1 has been changed to one successfully.

So our conclusion is that R6 is capable!

The binary search tree example

Here is one example of the package R6DS. The binary search tree is quite efficient in sorting, searching and traversing its elements. The time complexity can be \(O(log n)\) if the tree is well structured.

When building the binary search tree, the “<” and “=” operators should be defined and equip to the generator instance of the RBST class. We just compare the numbers for simplicity.

OK, the tree is now built. When you read the manual of the RBST class, you will see that we can do traversal by calling the traverse function of the class. Each node or element can be manipulated by using the function callback that is input into traverse. But it is a really good dream that we can pass something by reference into the recursive traverse and do something inside the recursion.

You can pass any instance of some reference class into the function to make the manipulation of the data much more flexible. And you should see how important the package can be…

But of course, the example is somewhat “stupid” as you can solve the same problem by using some global variable. However, global variable should not be preferred in many other cases, for example: when you want to share your code with others, and preferrably the function offers the only interface (they do not need to know the names of the global variables and create them manually)