How to use these annotations

first learning

If you are new to R and first reading the book, then you should probably mostly ignore my comments. However, when you are confused by something in the book, you can look to see if there is a comment on that page that pertains to what you are confused about.

revising

On further reading, these comments are more likely to be of use. Some are clarifications, some are extensions.

Page by page comments

Page 10

Page 11

distribution

I’m not a lawyer, but I think the phrasing about redistribution is not right. I think it should say “change and redistribute” rather than “change or redistribute”.

If what you do never leaves your entity, then you can do absolutely whatever you want. That is the free as in speech part. Legalities only come into play if what you do is made available to others. It is a common misunderstanding that you are restricted in what you do within your own world.

runs anywhere

The book highlights that R runs on many operating systems. It fails to make clear that the objects that it creates on the operating systems are all the same. You can start a project on a Linux machine at work, continue it while you commute with your Mac laptop, and then finish it on your Windows machine at home. No problem.

Page 12

The book should tell you not to be afraid of new words. New words like “vector”. You don’t need to make friends with them right away, but don’t be scared off.

(technical) Unhappily the word “vector” in R has several meanings — so it is unfortunate that it is the first new word. The meaning used throughout the book is the most common meaning. See The R Inferno (Circle 5.1) for the gory details.

Page 13

statistics

Pretty much everywhere in the book where it says “statistics” I would prefer “data analysis” instead. Statistics in many people’s mind is formal and academic, not like what they do. More people can feel comfortable doing data analysis than statistics.

In addition to the fear factor, there really is a (slight) difference between data analysis and statistics. I think data analysis is more important even though I’m trained as a statistician.

fields of study

There are additional fields of study where R is used that are not considered to be data hotbeds, such as music and literature. The flexibility of R becomes very important for data in non-traditional forms.

Page 23

vectors

If you are new to R, you shouldn’t expect yourself to understand this discussion. Just let it sink in over time.

Page 24

assignment operator

Always put spaces around the assignment operator. That makes the code much more readable.

The book tells you on page 63 that you can use = as well. You will see both used. They are mostly the same (differences are explained in The R Inferno, Circle 8.2.26). I agree with the book’s approach to use <- but really you can use either.

Page 28

RStudio

A nice feature of the RStudio workspace view is that it categorizes the objects.

Page 29

Windows pathnames (technical)

The book implies that you can not write Windows pathnames with backslashes. Actually you can, you just need to put a double backslash where you want a backslash. Hence it is easier and (often) less confusing to use slashes rather than backslashes.

Page 30

loading objects (technical)

It is possible to use attach instead of load. If you load an object, then it is put into your global environment. If you attach an object, it is put separately on the search list. If you modify an object that has been attached, then the modified version goes into your global environment.

Page 32

vectorization

There are different forms of vectorization, and the book doesn’t make that explicit. Vectorization can be put into three categories:

vectorization along vectors

summary

vectorization across arguments

Functions like sum and mean are vectorized in the sense that they take a vector and summarize it. This is done in pretty much all languages, it is not special.

Vectorization as it is commonly spoken of in R is vectorization along vectors. For example the addition operator as seen on page 24. This is the form of vectorization that is so useful and powerful in R.

You should not expect the third form of vectorization in R. However, it does exist in a few functions. The sum and mean functions do summary-type vectorization:

> sum(1:3)
[1] 6
> mean(1:3)
[1] 2

The sum function also does vectorization along arguments:

> sum(1, 2, 3)
[1] 6

That is basically anomalous. The mean function is more typical by not doing this form of vectorization:

> mean(1, 2, 3) # WRONG
[1] 1

Unfortunately you don’t get an error or a warning in this case. Do not expect this form of vectorization.

Page 33

error message

Getting error messages can be frightening for a while. But it’s not the end of the world. Relax.

Page 36

names (technical)

In fact it is possible to get any name that you want, but you probably don’t want to.

return (technical)

Actually return is not a reserved word, but you should treat it as if it were.

Page 37

F and T

I wish to emphasize the advice in the book:

never abbreviate TRUE and FALSE to T and F

avoid using T and F as object names

Page 42

library

The book suggests (with a slight revision on page 361) to load packages with the library function. Some of us prefer require instead of library for this use. The best use of library is without arguments — this gives you a list of available packages.

contributed packages

I think the authors might be being a little too polite in their description of the quality of contributed packages.

I find base R to be phenomenally clean code — it is hard to find commercial code that is less buggy. The quality of contributed packages varies widely. A few are up to the standards of base R, some are quite good, I’m sure there are a few dreadful ones.

With contributed packages you need to be more cautious than when only using base R functionality. Or perhaps I should say that you always need to be vigilent, but if you are using contributed packages, there is a larger chance that a problem is due to a package rather than your own fault.

Without inspecting the code, I know of two clues to suggest a package is of good quality:

widely used

good documentation

A widely used package — such as those highlighted in the book — is an indication that a lot of problems with the code have been fixed or didn’t exist in the first place.

Many people use the test of the cleanliness of restaurant restrooms to infer the cleanliness of the kitchen. Likewise, carefully written documentation is likely to be a sign of clean code.

Page 46

exponentiation (technical)

It is not a good idea to use ** to mean exponentiation — it is not out of the question for that to go away. Stick to using the ^ operator.

Page 49

log and exp

The sentence a little below mid-page about creating the vector inside exp should say inside the log function.

Page 52

infinity

The last sentence on the page should say 10^309 and 10^310 rather than 10^308 and 10^309.

Page 54

table 4-3

You are unlikely to use any of these except for is.na, which you may use quite a lot.

Page 55

types of vectors

All of the types of vectors listed may have missing values (NA).

Page 56

integer versus double

One of the nice things about R is that you hardly ever need to worry about whether something is stored as an integer or a double.

Page 104

one-dimensional arrays (technical)

Regular vectors are not dimensional at all in the technical sense, but we think of them as being one-dimensional. But there really are one-dimensional arrays. They are almost like plain vectors but not quite.

Page 106

playing with attributes

For large objects you often won’t like the response you get when you do:

> attributes(x)

Often better is to just look at what attributes the object has:

> names(attributes(x))

Page 109

extracting values from matrices

The flexibility of subscripting matrices (and data frames) as vectors is a curse as well as a blessing.

If you want to do:

> x[-2,]

and you do:

> x[-2]

then you will get an entirely different result. This can be a hard mistake to find — a few pixels difference on your screen can have a big impact.

Page 113

first.matrix

The example on this page assumes that first.matrix is as it was first created, not as it has been modified in the intervening exercises.

Page 114

matrix operations

So adding numbers by row is easy. How to add them by column? One way is:

This uses the rep function to create a vector with as many elements as the matrix has (assuming the vector being replicated has length equal to the number of columns), and the replicated values are in the desired positions.

Page 116

inverting a matrix

The reason that the command to invert a matrix is not intuitive is because it is seldom the case that (explicitly) inverting a matrix is a good idea.

Page 117

vectors as arrays (technical)

Actually vectors, in general, are not arrays at all. The difference is of little consequence, however.

third array dimension (technical)

I call the items in the third dimension of an array “slices” rather than “tables”. I’m not aware of any standardized nomenclature. I don’t think “tables” is such a good choice because there are other meanings of “table” in R.

array filling (technical)

I’m not able to follow the sentence in the book describing how arrays are filled. How I think of it is that the first subscripts vary fastest (no matter how many dimensions are in the array).

Page 119

rows and columns (technical)

Maybe my brain went on strike, but I think that “rows” and “columns” are reversed in the first paragraph on the page.

Page 120

data frame structure

Note that all the vectors that make up the columns need to be the same length.

data frame structure (technical)

It is possible for a “column” of a data frame to be a matrix, in which case the number of rows needs to match.

data frame length

Note that the length of a data frame is different from the length of the equivalent matrix. The length of the data frame is the number of columns, while the length of the matrix is the number of columns times the number of rows.

Page 122

character versus factor

The book suggests always making sure that data frames hold character vectors instead of factors in order to reduce problems. The other main route to avoid frustration is to always assume that there are factors.

The thing you don’t want to do is assume that what is really a factor is a character vector.

naming variables

If in the middle of the page where it says “In the previous section” you don’t know what they are talking about, not to worry — you’re not alone.

as with matrices

I’m not clear on the reference to matrices at the very bottom of the page.

Page 124

data frame subscripting

You can get a column of a data frame using either the $ or [ form of subscripting. But there is a difference:

Page 130

pieces of a list

I prefer calling the pieces of a list “components” rather than “elements”. One reason is that a component of a list can be another list, and hence not very elementary.

Page 139

The functions that you write are essentially the same as the inbuilt functions. They are first-class citizens.

Page 152

functional programming

You can very effectively use R without having a clue what “functional programming” means. The important idea behind functional programming is safety — the data that you want to use is almost surely the data that really is being used.

Page 153

calculation example

The object names were obviously changed midstream. fifty should be half and hundred should be full.

Page 157

generic functions (technical)

A detail that only occasionally really matters is that the argument names in methods should match the argument name in the generic. You don’t want to have the argument called x in the generic but object in a method.

Page 171

looping without loops

Using apply functions is really hiding loops rather than eliminating them.

Page 172

number of apply functions

Not that it matters, but I count 8 apply functions in the base package in version 2.15.0. There are also a reasonably large number of apply functions in contributed packages.

Page 188

error checking (technical)

Another way to write the check for out of bounds values is:

stopifnot(all(x >= 0 & x <= 1))

This will create an appropriate error message if there is a violation.

This will take multiple conditions separated by commas. So you can have checks like:

stopifnot(is.matrix(x), is.data.frame(y))

to make sure that x is a matrix and y is a data frame.

Page 190

technical tip (technical)

The first sentence starts:

In fact, functions are generic …

It should read:

In fact, some functions are generic …

Page 192

factor to numeric

The book gives the efficient method of converting a factor to numeric:

as.numeric(levels(x))[x]

The slightly less efficient but easier to remember method is:

as.numeric(as.character(x))

Don’t forget the as.character — it matters.

problems with factors (technical)

Circle 8.2 of The R Inferno starts with a number of items about factors.

Page 193

documentation quality

Unfortunately, I think the authors are painting too rosy of a picture of the quality of R documentation. There probably is some great documentation for any task or issue that you have, but you may have a significant search on your hands to find that great document.

Page 194

help files

It takes practice to learn how to use help files well. It doesn’t help that sections of the help files are in the wrong order (in my opinion). The “See also” and “Examples” should be near the top, “Details” should be at the bottom.

The examples often are the most important part. The book implies that all examples are reproducible. Not all are, but many are.

You don’t need to understand the whole of a help file the first time around. The goal should be to improve your understanding of the function.

Page 199

Stack Overflow

It is possible to subscribe via RSS to R tags.

Page 200

cards

With the cards I’m used to, the command to create cards should include 2:10 rather than 1:9.

Page 202

session info

The book says that it is sometimes helpful to include the results of sessionInfo() in questions. I would change that from “sometimes” to “often”.

Page 210

reading in data

The start of Circle 8.3 in The R Inferno has a number of items about problems reading data in.

Page 216

changing directories

If you are using the RGui, there is a “change dir” item in the File menu.

Page 221

three subset operators

The [[ operator always gets one component. The result is often not a list.

In contrast the [ operator can get any number of items and (except for dropping) gives you back the same type of object.

Page 226

removing duplicates

The book shows the removal of duplicates using both logical subscripts and negative numeric subscripts. Be careful with the latter of these: