Object Oriented Programming in Data Science with R

Since R is mostly a functional language and data science work lends itself to be expressed in a functional form you can come by just fine without learning about object-oriented programming.

Personally, I mostly follow a functional programming style (although often not a pure one, i.e. w/o side-effects, because of limited RAM). Expressing mathematical concepts in a functional way is quite natural in my opinion.

However, object-oriented programming offers a lot of benefits in certain use cases. The Python data science community embraces oop, possibly because of its larger background in computer science as opposed to math/stats of the R community. While I think that oop is sometimes taken to far (I do not want to write numpy.matmul(a, b) to do matrix multiplication, I prefer A %*% B:), I also think that there is a lot to like about it. Oop helps to hide complexity, e.g. by encapsulating the complexity of a prediction algorithm.

In this post I want to show you how to use the S3 class system to load data from different sources into R and how to implement a class myPredictionAlgorithm with a fit() and predict() method using R6 as a class system.

Object-oriented programming in R

As already mentioned, R has multiple systems to implement object-oriented programming. In order of complexity, starting from the simplest, they are:

S3 classes,

S4 classes,

Reference classes (~ R5) and

R6 class system.

In contrast to ‘classic’ message-passing object-oriented languages like Python, C++ or Java, S3 uses so called generic-function oop. Message-passing oop involves sending messages (= methods) to an object, which then tries to find an appropriate function to call (Hadley Wickham, ‘Advanced R’). S3 generic-function oop is actually quite similar to operator overloading. A generic function say, print(), decides which method to call, such as print.myClass(). S3 has no formal class definition. S4 and Reference Classes are more both more formal than S3. R6 is what most programmers coming from say Python expect an oo system to look like.

S3 generics

A generic function is a function whose functionality depends on the object it is used on. print() is one of the best examples that shows the power of generic functions:

In S3 we can create a class simply by setting the class attribute. In our create_connection() function the class depends on the object type, which will be either ‘local’ or ‘database’. Now we define a generic function extract_data() to load our data into R depending on the class of the connection object:

We define a generic function with UseMethod(). When we call extract_data(), our function checks the class attribute of the object we passed. Based on the class it searches for a function following the naming convention ‘extract.class_name’. If it finds one, it will use it, else it will use our default implementation:

You could also rename our function extract() to load(), but the convention is to use the same name as in our call to
UseMethod(). This allows you to easily find all specific implementations of the generic function using methods():

We can also use methods() to list all methods implemented for a given class:

methods(class = "local")

## [1] extract
## see '?methods' for accessing help and source code

So essentially we created the following class hierarchy (adapted from: Thomas Mailund, Advanced Object-Oriented Programming in R):

The abstract class ‘extractData’ defines an interface ‘extract’. We do not explicitly create the abstract class, we only define its method extract() using a generic function. Using this generic function, we implement concrete classes called ‘local’, ‘database’ and ‘default’ by writing corresponding extract.class functions.

In the S3 class system inheritance works by specifying a character vector as class attribute like so:

object = 1
class(object) = c("C", "B", "A")
class(object)

## [1] "C" "B" "A"

The first element in the class attribute vector is the most specialized and the last the most general.

R6

With a few notable exceptions (e.g. the data.table package) data is immutable in R. R6 is a class system that breaks the immutable-data principle by allowing mutable data structures. This allows us to create methods that actually modify objects and not make a copy. R6 can be seen as an improved version of the reference class system (R5), so I will not cover R5.

So, public members are accessed using self$ and private members using private$. The R6 introduction vignette suggests to have methods return invisible(self), if you want you want methods to be chainable. Private attributes can be accessed only by methods defined in the class or sub-classes.

Unfortunately, there is no way to enforce types of fields in R6 except by implementing checks manually (e.g. as a checker class).