The interpreter understands more than arithmatic operations.
That last command told it to use (or “call”) the function seq().

Most of “learning R” involves getting to know a whole lot of functions, the
effect of each function’s arguments (e.g. the input values 1 and 100), and
what each function returns (e.g. the output vector).

R as Calculator

A good place to begin learning R is with its built-in mathematical functionality.

Arithmatic operators

Try +, -, *, /, and ^ (for raising to a power).

>5/3

[1] 1.666667

Logical tests

Test equality with == and inequality with <=, <, !=, >, or >=.

>1/2==0.5

[1] TRUE

Math functions

Common mathematical functions like sin, log, and sqrt, exist along side some universal constants.

>sin(2*pi)

[1] -2.449294e-16

Programming idoms

Common computer programming functions like ‘rep’, ‘sort’, and ‘range’

>rep(2,5)

[1] 2 2 2 2 2

Parentheses

Sandwiching something with ( and ) has two possible meanings.

Group sub-expressions by parentheses where needed.

>(1+2)/3

[1] 1

Call functions by typing their name and comma-separated arguments between parentheses.

Environment

In the RStudio IDE, the environment tab displays any variables added to R’s
vocabulary in the current session. In a brand new session, the R interpreter
already recognizes many things, despite the environment being “empty”.

With an “empty” environment, the interpreter still recognizes:

any number

any string of characters

nearly universal operators (e.g. + and /)

operators specific to R (e.g. $ and %*%)

functions in “base R”

To reference a number or function just type it in as above.
To referece a string of characters, surround them in quotation marks.

>'ab.cd'

[1] "ab.cd"

Without quotation marks, the interpreter checks for things in the environment
named ab.cd and doesn’t find anything:

>ab.cd

Error in eval(expr, envir, enclos): object 'ab.cd' not found

Question

Is it better to use ' or "?

Answer

Neither one is better. You will often encounter stylistic choices
like this, so if you don’t have a personal preference try to mimic existing
styles.

Assignment

You can expand the vocabulary known to the R interpreter by creating a new
variable. Using the symbol <- is referred to as assignment: the output of
any command to the right of <- gets the name given on its left.

>x<-seq(0,100)

You’ll notice that nothing prints to the console, because we assigned the output to a variable.
We can print the value of x by evaluating it without assignment.

Assigning values to new variables (to the left of a <-) is the only time you
can reference something previously unknown to the interpreter. All other
commands must reference things already in the interpreter’s vocabulary.

Once assigned to a variable, a value becomes known to R and you can refer to it in other commands.

Editor

The console is for evaluating commands you don’t intend to keep or reuse.
It’s useful for testing commands and poking around. The environment
represents the state of a current session. The editor reads and writes
files–it is where you head to compose your R scripts.

R scripts are simple text files that contain code you intend to run again and
again; code to process data, perform analyses, produce visualizations, and even
generate reports. The editor and console work together in the RStudio IDE, which
gives you multiple ways to send parts of the script you are editing to the
console for immediate evaluation. Alternatively you can “source” the entire
script or run it from a shell with Rscript.

Open up “worksheet.R” in the editor, and follow along by replacing
the ... placeholders with the code here. Then evalute just this line
(Ctrl+Enter on Windows, ⌘+Enter on Mac OS).

vals<-seq(1,100)

Our call to the function seq could have been much more explicit. We could give
the arguments by the names that seq is expecting.

vals<-seq(from=1,to=100)

Run that code by moving your cursor anywhere within those two lines and clicking
“Run”, or by using the keyboard shortcut Ctrl-Return or ⌘ Return.

Question

What’s an advantage of naming arguments?

Answer

One advantage is that you can put them in any order. A related
advantage is that you can then skip some arguments, which is fine to do if each
skipped argument has a default value. A third advantage is code readability,
which you should always be conscious of while writing in the editor.

Readability

Code readability in the editor cuts both ways: sometimes verbosity is useful,
sometimes it is cumbersome.

The seq() function has an alternative form available when only the from and
to arguments are needed.

The : operator should be used whenever possible because it replaces a common,
cumbersome function call with an brief, intuitive syntax. Likewise, the assign
function duplicates the functionallity of the <- symbol, but is never used
when the simpler operator will suffice.

Function documentation

How would you get to know these properties and the names of a function’s
arguments?

Load Data

We will use the function read.csv() to load data from a Comma Separated Value
file. The essential argument for the function to read in data is the path to the
file, other optional arguments adjust how the file is read.

Additional file types can be read in using read.table(); in fact, read.csv()
is a simple wrapper for the read.table() function having set some default
values for some of the optional arguments (e.g. sep = ",").

Type read.csv( into the console and then press tab to see what arguments
this function takes. Hovering over each item in the list will show a description
of that argument from the help documentation about that function. Specify the
values to use for an argument using the syntax name = value.

The read.csv command assumes the first row in the file contains
column names. Look at ?read.csv to see the default header = TRUE argument.
What exactly that means is described down in the “Arguments” section.

Use the assignment operator “<-“ to load data into a variable for
subsequent operations.

animals<-read.csv(file='data/animals.csv')

After reading in the “animals.csv” file, you can explore what types of data are
in each column with the str function, short for “structure”.

Missing data, as interpreted by the read.csv function, is controlled by the
na.strings argument. Override the default value of 'NA' with the empty
string, '', to properly interpret the “species_id” and “sex” columns.

You can also specify multiple things to be interpreted as missing values, such
as na.strings = c("missing", "no data", "< 0.05 mg/L", "XX").

Data Types

A data frame is clearly a table, but what exactly is a table in the R
environment? The str command gave an indication that each field has it’s own
data type.

Type

Example

double

3.1, 4, Inf, NaN

integer

4L, 0L, 999L

character

‘a’, ‘4’, ‘👏’

logical

TRUE, FALSE

missing

NA

Data structures

A data frame is a compound object, built from one or more objects that hew to
the basic data types. Like all data frames, “animals” is a “list”.

>is.list(animals)

[1] TRUE

The “list” is one of three one-dimensional data structurs you will regularly
encounter.

Lists

Factors

Vectors

Vectors

Vectors are an array of values of the same type. Create a vector by combining
elements of the same type together using the function c().

counts<-c(4,3,7,5,2)

All elements of an vector must be the same type, so when you attempt to combine
different types they will be coerced to the most flexible type.

>c(1,2,"c")

[1] "1" "2" "c"

Factors

A factor is a vector that can contain only predefined values, and is used to
store categorical data. Factors are like integer vectors, but posess a levels
attribute that assigns names to however many discrete categories are specified.

Use factor() to create a vector with predefined values, which are often
characters or “strings”.

Parts and Subsets

Any single part of a data structure is always accessible, either by its name or
by its position, using double square brackets: [[ and ]].

Position

>counts[[1]]

[1] 4

>counts[[3]]

[1] 7

Names

Parts of an object may also have a name. The names can be given when you are creating a vector or afterwards using the names() function.

>df[['education']]

[1] college highschool college middle middle
Levels: middle highschool college

names(df)<-c('ed','ct')

>df[['ed']]

[1] college highschool college middle middle
Levels: middle highschool college

Question

This use of <- with names(x) on the left is a little odd. What’s going on?

Answer

We are overwriting an existing variable, but one that is accessed
through the output of the function on the left rather than the global
environment.

For a multi-dimensional array, separate the dimension along which a part is
requested with a comma.

>df[[3,'ed']]

[1] college
Levels: middle highschool college

It’s fine to mix names and indices when selecting parts of an object.

Subsets

Multiple parts of a data structure are similarly accessed using single square
brackets: [ and ]. What goes between the brackets, to specify the positions
or names of the desired subset, may be of multiple forms.

Functions

Functions package up a batch of commands. There are several reasons to develop
functions in R for data analysis:

reuse

readability

modularity

consistency

Writing functions to use multiple times within a project prevents you from
duplicating code, a real time-saver when you want to update what the function
does. If you see blocks of similar lines of code through your project, those are
usually candidates for being moved into functions.

Anatomy of a function

Like all programming languages, R has keywords that are reserved for import
activities, like creating functions. Keywords are usually very intuitive, the
one we need is function.

function(...){...return(...)}

Three components:

arguments: control how you can call the function

body: the code inside the function

return value: controls what output the function gives

We’ll make a function to extract the first row of its argument, which we give a
name to use inside the function:

function(z){result<-z[1,]return(result)}

Note that z doesn’t exist until we call the function, which merely contains
the instructions for how any z will be handled.

Finally, we need to give the function a name so we can use it like we used c()
and seq() above.

first<-function(z){result<-z[1,]return(result)}

>first(df)

ed ct
1 college 4

Question

Can you explain the result of entering first(counts) into the console?

Answer

The function caused an error, which prompted the interpreter to
print a helpful error message. Never ignore an error message. (It’s okay to
ignore a “warning”.)

Flow Control

The R interpreter’s “focus” flows through a script (or any section of code you
run) line by line. Without additional instruction, every line is processed from
the top to bottom. “Flow control” is the generic term for causing the
interpreter to repeat or skip certain lines, using concepts like “for loops” and
“if/else conditionals”.

Flow control happens within blocks of code isolated between curly braces { and
}, known as “statements”.

if(...){...}else{...}

The keyword if must be followed by a logical test which determines, at runtime, what to do next.
The R interpreter goes to the first statement if the logical value is TRUE and to the second statement if it’s FALSE.

An if/else conditional would allow the first function to avoid the error thrown by calling first(counts).

Linear Models

Regression of a “response” variable against discrete and continuous “predictors”
is fundamental to statistical data analysis. The lm function, which is an
abbreviation for “linear model”, provides the simplest kind of regression in R.

Fitting a regression requires two inputs:

data

a data.frame with independent observations

model

a type of R expression called a formula

Specify a formula by naming a response variable left of a “~” and any number of
predictors to its right.

>y~a

y ~ a

Formula mini-language

Writing formulas requires understanding a very simple syntax for including
predictors and specifying which ones interact.

A few simple examples of increasingly complicated formulas:

y ~ a gives one predictor

y ~ a + b gives two predictors

y ~ a:b is only the interaction of a and b

y ~ a*b gives all three predictors

Fitting models

Match your formula variables to the column names of a data frame, and pass the
formula and data.frame as the first two arguments to lm, for “linear
model”.

fit<-lm(weight~hindfoot_length,animals)

>summary(fit)

Factors in linear models

Data types matter in statistical modelling. For the predictors in a linear
model, the most important distinction is whether a variable is a factor.

fit<-lm(weight~species_id,animals)

>summary(fit)

The difference between 1 and 24 degrees of freedom in the last two models—with
one predictor each—is due to species_id being a factor.

Exercises

Exercise 1

Exercise 2

By default, all character data is read in to a data.frame as factors. Use the
read.csv() argument stringsAsFactors to suppress this behavior, then
subsequently modify the sex column in animals to make it a factor. Remember
that columns of a data.frame are identified to the R interpreter with the $
operator, e.g. animals$sex.

Exercise 3

Use the typeof function to inspect the data type of counts, and do the same
for another variable to which you assign a list of numbers. Why are they
different? Use c to combine counts with the new variable you just created
and inspect the result with typeof. Does c always create vectors?

Exercise 5

Exercise 6

The keywords else and if can be combined to allow flow control among more
than two statements, as below. Expand the first function once again to
differentiate between dat provided as a matrix and as a data.frame. It’s
up to you what the “first” element of a matrix should be!

Solution 3

x<-list(3,4,5,7)

>typeof(counts)

[1] "double"

>typeof(x)

[1] "list"

>typeof(c(counts,x))

[1] "list"

The variable x has a data type of list, so R does not restrict its elements
to a particular type as it does for vectors. The result of combining a list and
vector is a list, because the list is the more flexible data structure.

Solution 6

If you need to catch-up before a section of code will work, just squish it's
🍅 to copy code above it into your clipboard. Then paste into your interpreter's
console, run, and you'll be ready to start in on that section. Code copied by
both 🍅 and 📋 will also appear below, where you can edit first, and then copy,
paste, and run again.