Why use the R Language?

What is R, and S?

This used to be called “An Introduction to the S Language”. R is a dialect of the S language, and has come to be — by far — the dominant dialect.

S started as a research project at Bell Labs a few decades ago, it is a language that was developed for data analysis, statistical modeling, simulation and graphics. However, it is a general purpose language with some powerful features — it could (and does) have uses far removed from data analysis.

It should be used for many of the tasks that spreadsheets are currently used for. If a task is non-trivial to do in a spreadsheet, then almost always it would more productively (and safely) be done with R. “Spreadsheet Addiction” talks about problems with spreadsheets and how R is often a better tool.

Why the R Language?

R is not just a statistics package, it’s a language.

R is designed to operate the way that problems are thought about.

R is both flexible and powerful.

The Importance of Being a Language

In this context “package” has the specific meaning of software that gives you a set number of choices of what to do. It is not at all the same as “R package” which is discussed later.

Though the distinction between a package and a language is subtle, that subtle difference has a massive impact. With a package you can perform some set number of tasks — often with some options that can be varied. A language allows you to specify the performance of new tasks.

Your retort may be, “But I won’t want to create a new form of regression.” Yes, R does allow you to create new forms of regression (and many people have), but R also allows you to easily perform the same sort of standard regression on your 5 datasets (or maybe it is 500 datasets).

The key is abstraction. You easily see that your 5 regressions are really the same — there is merely different data involved with each. In your mind you have abstracted the specific tasks so that they all look similar. Once you see the abstraction, it is simple to teach R the abstraction. Languages are all about abstraction.

The Way We Think

One of the goals of S (and hence R), and one that I think has largely been successful, is that the language should mirror the way that people think. A simple example: suppose we think that weight is a function of (dependent on) height and girth. The R formula to express this is:

weight ~ height + girth

The + is not + as in addition, but + as in “and”.

Another feature of R is that it is vector-oriented — meaning that objects are generally treated as a whole — as humans tend to think of the situation — rather than as a collection of individual numbers. Suppose that we want to change the heights from inches to centimeters. In R the command could be:

height.cm <- 2.54 * height.inches

Here height.inches is an object that contains some number — one or millions — of heights. R hides from the user that this is a series of multiplications, but acts more like we think — whatever is in inches multiply by 2.54 to get centimeters.

Experience with C or Fortran can ironically make it harder to use R efficiently. The C-before-R gang tend to translate the problem into “programming” rather than thinking about the problem in the “natural” way.

A Moveable Feast

Flexibility and power abound in R. For instance, it is easy to call C and C++ functionality from R. R does not insist that everything is done in its language, so you can mix tools — picking the best tool for each particular task.

The pieces of code that are written in the R language are always available to the user, so a minor change to the task usually requires only a minor change to the code — a change that can be carried out in a minor amount of time.

Why R for data analysis?

R is not the only language that can be used for data analysis. Why R rather than another? Here is a list:

interactive language

data structures

graphics

missing values

functions as first class objects

packages

community

Data analysis is inherently an interactive process — what you see at one stage determines what you want to do next. Interactivity is important. Language is important. The two together — an interactive language — is even more than their sum. But there is a down-side: compromises between interactive use and programming use are the cause of some user trauma.

R has a fantastic mechanism for creating data structures. Obviously if you are doing data analysis, you want to be able to put your data into a natural form. You don’t have to warp your data into a particular structure because that is all that is available.

Graphics should be central to data analysis. Humans are predominantly visual, we don’t intuitively grasp numbers like we do pictures. It is easy to produce graphs for exploring data. The default graphs can be tweaked to get publication-quality graphs.

Real data have missing values. Missing values are an integral part of the R language. Many functions have arguments that control how missing values are to be handled.

Functions, like mean and median, are objects that you can use like data. You can easily change your analysis to use the median (or some strange estimate you make up on the spot) rather than the mean.

R has a package system that makes it extremely easy for people to add their own functionality so it is indistinguishable from the central part of R. And people have. There are thousands of packages that do all sorts of extraordinary things.

The R community is very strong, and quite committed to improving data analysis.

The Preferred Medium

Given its qualities, R has become the preferred computing environment for a large part of the statistical community. When a new statistical method is invented, chances are it will be implemented first in R.

In March 1999 John Chambers — one of the originators of S at Bell Labs — was presented the ACM Software System Award. It stated, “S has forever altered the way people analyze, visualize, and manipulate data.” Previous winners of this award include Unix, TeX and the World-Wide Web. John is now a member of R Core (the group that produces R).