Creating Charts and Graphs with GNU R

As the R homepage explains, "R is a language and environment for
statistical computing and graphics. It is a GNU project which is similar
to the S language and environment which was developed at Bell
Laboratories..." Although it is not as widely known or utilized as one
might expect, R is a powerful tool which provides a wide variety of
statistical techniques. One of its major strengths is its graphics
capabilities. With R, even the statistically challenged can easily
produce publication-quality graphs and charts, in a variety of image
formats (JPEG, Postscript, and PDF, among others). In this tutorial, I
will give a short introduction to R with a focus on its graphing
capabilities.

Obtaining and Installing R

R is Free Software and runs a variety of
platforms (Unix, Windows, and MacOS). At the time of writing, the
current version of R is 2.2.1. For more in-depth information about
installation, see the R Installation
and Administration Manual. Here we will cover installation on Linux.

Although you can install R on a Linux system the old-fashioned way,
either from source code or from pre-compiled binaries, it is much easier
with a package manager, since it will spare you from having to worry
about dependencies. Here I will show you how to install R on a Red Hat Linux distribution (Fedora
Core 3, to be specific) using yum (Yellow Dog Updater
Modified).

For starters, you will need root access. Once you are root, you will
have to let yum know where to find the necessary installation files.
Change your working directory to the yum repository directory:

[root]# cd /etc/yum.repos.d/

The required files are available from one of CRAN's mirror
sites. (Just in case you're wondering, CRAN stands for
"Comprehensive R Archive Network".) Add a file named CRAN.repo
to the directory /etc/yum/repos.d/. The file should have the
following contents (substitute the URL for your preferred mirror site):

(Note: When I recently tried re-installing the program, no public key
was available for the main R installation file on the Berkeley mirror.
To work around this problem, I instructed yum to ignore the public key
by setting the gpgcheck flag to false (zero). You should first
try running yum with this line removed. If the installation fails, you
can then add it back and try again.)

Once this has been done, yum will know about CRAN and can do all the
work of installing R for you. Just run the following command, and the
rest should happen automatically (output of command omitted to save
space).

[root]# yum install R

If you already have R installed, you can use yum to ensure that you have
the latest version by running this command instead:

[root]# yum update R

Getting Started

R can be run in one of two modes: interactively or non-interactively
(running R code from a saved file). For the rest of the tutorial, we
will assume that you are running R in interactive mode. A brief
discussion of how R can be run non-interactively is provided towards the
end.

You can start an interactive R session by running the program from the
commandline. When you do so, the program will automatically produce
output that looks like this:

[stuart]$ R
R : Copyright 2005, The R Foundation for Statistical Computing
Version 2.2.1 (2005-12-20 r36812)
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>

R is now running in interactive mode. This means that you can use it in
the style of a calculator, where you type instructions which are
automatically executed. However, in R, commands are not executed until
you hit the return key. Below is a sample interactive session with R,
where R is used to do some simple arithmetic. Try duplicating this
session. Just remember not to type the right bracket (>) that every
line starts with. It is R's interactive command prompt, and not R code.
(The hash marks and everything following them are comments for the
reader's use and can be omitted, since they will be ignored by R.)

The same operations can be carried out on lists of values. In the first
line of the following R code, we create a list of values in sequential
order and assign them to the variable v (short for vector). We
then view the contents of v before showing a simpler way of
creating a list of sequential values. Finally, we show that mathematical
operations performed on a vector apply to all of the items within that
vector.

When you are ready to end your R session, you will need to type
q() in order to quit. When you do so, R will ask you whether
you want to save your workspace image. The workspace image is
essentially your interactive session history. If you say yes, hidden
files (.RData and .Rhistory) will be saved in the
directory where you originally started R. The next time you run R from
this directory, it will load these hidden files and retain the data and
history from your last session. If you say no, nothing will be saved.

> q()
Save workspace image? [y/n/c]: y
[stuart]$

Importing and Manipulating Data in R

A critical feature for any statistical package is some means of
obtaining data from outside sources: text files, spreadsheets,
relational databases, etc. For more complete documentation of the data
importation and exportation facilities of R, see the R data
manual. Here we will show the basics of importing tabular data into
R.

To see how this works, let's look at a spreadsheet with some sample
data. We'll use the results of the 2004 European Union Parliament
elections (from wikipedia.org),
which is saved as an Open Office
spreadsheet in EU-2004.sxc. When viewed
within Open Office, it should look like this:

As you can see, this spreadsheet consists of ten rows, a header row and
nine rows of data. For each row, there are four columns of data: the
acronym of a political party, its full name, the number of votes it
received in the 2004 election, and the number of seats won.

We will save this data from Open Office into a plain text tabular format
for ease of importation into R. To make the parsing of the data as
trouble-free as possible, we will use both column separators (commas)
and column delimiters (double quotes) (EU-2004.csv).

Once the data has been exported, it can be read by R using the function
read.table(), which minimally takes a single argument, the name
of the data file to be read. In addition, we will explicitly specify the
column separator with the option sep. If the data file contains
header information (i.e., a row that labels each column of data), as
ours does, the option header should also be set to
TRUE in order to ensure that the first line isn't treated as
data. If the data is not saved into a variable, R will print a limited
number of rows for inspection, as illustrated below:

Once it's been imported, tabular data can be easily manipulated within
R. For example, if we wanted to know the total number of EU seats or the
total number of votes, we simply pass one column's worth of data to a
function that calculates the total. There are two ways of doing this:
bracket notation and dollar sign notation.

Bracket notation allows you to specify particular rows and columns. Once
a table has been read and assigned to a variable, specific rows and
columns can be accessed by using bracket notation on a variable, as
[row(s), column(s)], where row(s) and
column(s) are either single integers or lists of integers. Both
usages are illustrated below:

An alternative to bracket notation is dollar sign notation, which takes
advantage of the headers in tabular data. Rather than specifying columns
by position, we simply specify the name of the column as it appears in
the header, as illustrated below:

Although bracket notation and dollar sign notation provide equivalent
results, it is generally preferable to use dollar sign notation, since
it relies on the labels within a data set, which are less likely to
change than the relative position of the columns (which may get shifted
around if new columns are added).

Now that we are able to extract a specific column, obtaining totals is
fairly trivial, and requires nothing more than using the column values
as input to the function sum(), as shown below:

With these data manipulations basics, we are now in a position to import
data into R for statistical analysis. In the following section, we will
see how to create various types of graphs using various data sets saved
in tabular format.

Pie Chart

A pie chart is
commonly used to display the relative proportions of different values in
a data set (more technically, the frequency of the levels of a
categorical variable). For example, if we wanted to show how many seats
the various political parties won in the 2004 European Union Parliament
elections, we could display the information in the form of a pie chart.
The R code in the interactive session shown below provides a first-pass
attempt at this:

You can run this code within R yourself. The first line will import the
data from the external data file discussed in the previous section. The
second line accesses the column of data with the number of votes using
dollar sign notation and saves it into a variable called seats.
That variable is then passed to the function pie(), which does
the work of creating the graph. Once you type the last line, R will open
a new popup window containing a graph that should look like this:

This graph is not very useful as it stands. The most obvious shortcoming
is that the various slices of the pie are unlabeled. Labels can be
assigned using the option labels, which we set using the column
of abbreviations from the data set, as shown in the third line. In
addition, a title is added using the option main (where
\n stands for a line break). Note that when you hit return on
an unfinished line, it continues on the next line starting with a
different command prompt, the plus sign (+) rather than the normal right
bracket (>).

The results of running the R code above is a graph that significantly
improves upon the previous one:

Although they are frequently used in the popular press, pie charts are
unpopular among statisticians. If you consult the R help page for
pie() by typing ?pie, you will see the following
warning in the "Notes" section: "Pie charts are a very bad way of
displaying information. The eye is good at judging linear measures and
bad at judging relative areas. A bar chart or dot chart is a preferable
way of displaying this type of data."

Bar Graph

A bar graph or bar
chart has rectangular bars of lengths that represent the quantity or
frequency of data values. (A histogram is a
particular type of bar graph, one that displays only the frequency of
data values, rather than their quantity.) The bars can be horizontally
or vertically oriented.

Bar graphs are produced in much the same way as pie charts, although the
command used to create them, barplot(), differs slightly from
pie. The main difference is that instead of using
labels to set the labels, we use names.arg instead.
(Also note that we have explicitly converted the abbreviations into
character data, since they would otherwise be treated as factors for
analysis and therefore displayed as numbers.) Below, we make a bar graph
of the same data that we used for our pie chart in the last section:

The main problem with this bar graph is that some of the bar labels are
too wide and are therefore omitted by R. There are a number of different
ways to solve this problem. The easiest solution is to shrink the bar
labels, which can be done by explicitly instructing R to use a smaller
font for them. This is done with the option cex.names, which
determines the font size for the bar labels. We will shrink them to 70%
(0.7) of the default size.

This will produces a bar graph in which the bars are light gray and individually labeled:

Note that it is much easier to observe small difference with a bar graph
than with a pie chart. For example, it is quite easy to see that the
GUE-NGL (Confederal Group of the European United Left/Nordic Green Left)
obtained more seats than the ID (Independence and Democracy) in the bar
graph, whereas in the pie chart it is difficult, if not impossible, to
discern this fact.

Scatter Plots

A scatterplot is a
graph used in statistics to visually display and compare two or more
sets of related quantitative/numerical data by displaying a finite
number of data points (observations) in a space defined by two scales,
which are placed on the horizontal and vertical axes, the well-known
x-axis and y-axis, respectively.

To use a linguistic example, let's look at word frequency distributions.
More specifically, let's look at the relationship between the total
number of words in a text (the token count) and the number of
unique words in a text (the type count). For example, consider a
sentence such as "Boys will be boys". It has four words, but only three
of them are unique, since boys occurs twice. In other words, it
consists of four tokens, but only three types.

Using a simple Python script (type-token.py), I have calculated the
type and token count for every text in a large collection of folk tales
published in the Wantok newspaper of Papua New Guinea. The
results can be found in corpus-counts.csv. These folk tales
were originally published in Tok Pisin (an
English-based creole) and later translated into English by Thomas Slone
in 1001
Papua New Guinean Nights. Below, we produce a scatterplot of these
values with the token count on the x-axis and the type count on the
y-axis.

If we place the two side-by-side, we can eyeball the two graphs and see
that, among other things, the overall number of types appears to be
lower in Tok Pisin. In other words, the two graphs provide evidence that
Tok Pisin has a more restricted vocabulary than English.

The two scatterplots are difficult to directly compare because R
automatically adjusts the range set to the data set being plotted, and
therefore uses different ranges for the x- and y-axis. For Tok Pisin,
the range is roughly 0 to 300 for the x-axis and 0 to 3000 for the
y-axis, whereas for English it is roughly 0 to 500 for the x-axis and 0
to 2500 for the y-axis. We can put both graphs on the same scale by
explicitly setting the ranges with the options xlim and
ylim. (Graphs omitted to save space.)

A better way of visualizing the same information would be to place both
scatterplots on a single graph. Below, we instruct R to plot the two
data sets on one graph and to distinguish the two using color: Tok Pisin
in red and English in blue. The color for a plot is set using the
col option. Note that the first plot is done using the
high-level function plot(), whereas the second is done using
the low-level function points(). Although these two functions
have the same syntax, they have different uses. plot() will
create a new graph; points() simply adds data points to a
pre-existing graph.

Because we have used color to distinguish the plots for the two data
sets, we need a legend that explains the color scheme. This can be done
with the legend() command. The location of the legend is
specified by its x,y coordinates. The elements in the legend are
specified as points by setting the plotting character option
pch to the default plotting character, which is 1.
The colors of the points are determined by the option col,
which takes a list of two colors. The text for the red and blue points
is provided by the option legend.

The resulting graph, shown below, is much easier to read, and the trends
in the two scatterplots are much more visually striking.

Line Graphs

A line graph is like a scatterplot, except that its data points
are connected by a line. Since line graphs are commonly used to
visualize activity in the stock market, we will show how to produce a
line graph of the Dow
Jones Industrial Average (DJIA) during a twenty-year period (1985 to
2005). The data is saved as a tab-delimited text file, dow-jones.csv, and comes from djindexes.com.

In the following sample R code, we read the data and plot it with the
year on the x-axis and the Dow Jones on the y-axis. Note that the
command plot() is run with the same syntax, the only difference
being the option type, which is explicitly set to l to
obtain a line graph. A title is added with main, and the x-axis
is labeled with xlab and the y-axis with ylab.

It shows a steady increase in the DJIA until 2000, when there is an
abrupt drop followed by a gradual recovery. This is, of course, the
infamous bursting of the dotcom bubble and
the eventual recovery from it.

A Few R Essentials

Saving Graphs as External Image Files

In the examples provided in this tutorial, we have used the R
environment to display graphs in its popup window. But sooner or later,
you will want to save a graph into an external file for incorporation
into a document (e.g., such as this one). By default, R writes all
graphs to its popup window, but this is not the only available
device (the technical term for the destination of any graphics
drawn by R). R supports a variety of image formats. To see the available
options, type ?device within an R session. The options include
postscript, PDF, PNG, and JPEG (among others). It is even possible to
have R generate the commands required to draw a graph in LaTeX.

Below, we will show how to export a graph into a JPEG image file. The
main trick is to call the function jpeg(). It has only one
required argument, which is the filepath to the JPEG file. Once the
function jpeg() has been called, all subsequent graphing will
be done inside of the specified JPEG file. Therefore, the following code
will create a line graph in a file named test.jpg. (The file
will be saved in the directory where R was originally run.)

Before quitting R, it is important to close the file with the command
dev.off() to ensure that R does not continue to write to it in
future sessions.

As you can see, the basics of creating graphs in external image files
are fairly straightforward. For more in-depth information, see the Graphics
section of the official R introduction.

Running R Non-Interactively (Scripting)

So far, we have run R interactively, but it is also possible to run it
non-interactively by storing R code in a separate file and redirecting
it to R. For example, we can save the code from the Dow Jones line graph
into a file (dow-jones.r):

There are two things to note about the formatting. First, there are no
right brackets, since these are only used as command prompts by R in
interactive mode. Second, whitespace can be used to give commands a more
readable formatting.

The script can now be run from a Unix commandline as follows:

[stuart]$ R --no-save < dow-jones.r

There are two things to observe about how the R code is run above.
First, the contents of dow-jones.r are
sent to R using redirection
(the left angle bracket). Second, as you will recall, when you quit R in
interactive mode, you were asked whether to save the session. When a
script is run interactively, this question must be answered in advance
with a commandline option, either --save or
--no-save; otherwise, an error results. (The option
--vanilla conveniently wraps a number of options into one;
for more information, see Invoking
R.)

Where to Find Raw Data

When learning to use R, you will need raw data sets on which you can
test various features of the language. Fortunately, R comes with an
assortment of built-in data sets. These are an eclectic bunch, ranging
in nature from Nile, which provides data about water flow in
the river Nile from 1871 to 1970, to cars, which provides the
speed of cars and braking distance from the 1920s. It also includes such
gems as Titanic, which provides data concerning who did and did
not survive the wreck of that ill-fated cruise
ship.

The see the full list of these data sets, simply type data().
To obtain more information about a particular data set, you can type the
name of the data set preceded by a question mark (e.g.,
?AirPassengers). Most data sets also come with example R code,
which you can study to improve your understanding of the language.

Where to Go From Here

R provides the means to perform a wide variety of statistical
techniques. Because it is the tool of choice for many professional
statisticians, it has been thoroughly tested, and its functionality is
very comprehensive. Your knowledge of statistics
will most likely be the main limitation on what you can do with it.

Recent comments

This is a very helpful article. I had been looking at expensive commercial packages. This article, and my following through the tutorial, convinced me to give R a try.

One issue is that scatterplot does not appear to be a supported function in the current R release. However, the plot function appears have to taken on all the capabilities of the scatterplot function that the article used. It would be nice if the article could get updated to reflect that. Otherwise, I hope people notice this note when they get confused.