Understanding Basic Data Types in R

To make the best of the R language, you’ll need a strong understanding of the
basic data types and data structures and how to operate on those.

Very important to understand because these are the objects you will manipulate
on a day-to-day basis in R. Dealing with object conversions is one of the most
common sources of frustration for beginners.

Everything in R is an object.

R has 6 (although we will not discuss the raw class for this workshop) atomic
vector types.

character

numeric (real or decimal)

integer

logical

complex

By atomic, we mean the vector only holds data of a single type.

character: "a", "swc"

numeric: 2, 15.5

integer: 2L (the L tells R to store this as an integer)

logical: TRUE, FALSE

complex: 1+4i (complex numbers with real and imaginary parts)

R provides many functions to examine features of vectors and other objects, for
example

class() - what kind of object is it (high-level)?

typeof() - what is the object’s data type (low-level)?

length() - how long is it? What about two dimensional objects?

attributes() - does it have any metadata?

# Examplex<-"dataset"typeof(x)

[1] "character"

attributes(x)

NULL

y<-1:10y

[1] 1 2 3 4 5 6 7 8 9 10

typeof(y)

[1] "integer"

length(y)

[1] 10

z<-as.numeric(y)z

[1] 1 2 3 4 5 6 7 8 9 10

typeof(z)

[1] "double"

R has many data structures. These include

atomic vector

list

matrix

data frame

factors

Atomic Vectors

A vector is the most common and basic data structure in R and is pretty much the
workhorse of R. Technically, vectors can be one of two types:

atomic vectors

lists

although the term “vector” most commonly refers to the atomic types not to lists.

The Different Vector Modes

A vector is a collection of elements that are most commonly of mode character,
logical, integer or numeric.

You can create an empty vector with vector(). (By default the mode is
logical. You can be more explicit as shown in the examples below.) It is more
common to use direct constructors such as character(), numeric(), etc.

vector()# an empty 'logical' (the default) vector

logical(0)

vector("character",length=5)# a vector of mode 'character' with 5 elements

[1] "" "" "" "" ""

character(5)# the same thing, but using the constructor directly

[1] "" "" "" "" ""

numeric(5)# a numeric vector with 5 elements

[1] 0 0 0 0 0

logical(5)# a logical vector with 5 elements

[1] FALSE FALSE FALSE FALSE FALSE

You can also create vectors by directly specifying their content. R will then
guess the appropriate mode of storage for the vector. For instance:

x<-c(1,2,3)

will create a vector x of mode numeric. These are the most common kind, and
are treated as double precision real numbers. If you wanted to explicitly create
integers, you need to add an L to each element (or coerce to the integer
type using as.integer()).

The function is.na() indicates the elements of the vectors that represent
missing data, and the function anyNA() returns TRUE if the vector contains
any missing values:

x<-c("a",NA,"c","d",NA)y<-c("a","b","c","d","e")is.na(x)

[1] FALSE TRUE FALSE FALSE TRUE

is.na(y)

[1] FALSE FALSE FALSE FALSE FALSE

anyNA(x)

[1] TRUE

anyNA(y)

[1] FALSE

Other Special Values

Inf is infinity. You can have either positive or negative infinity.

1/0

[1] Inf

NaN means Not a Number. It’s an undefined value.

0/0

[1] NaN

What Happens When You Mix Types Inside a Vector?

R will create a resulting vector with a mode that can most easily accommodate
all the elements it contains. This conversion between modes of storage is called
“coercion”. When R converts the mode of storage based on its content, it is
referred to as “implicit coercion”. For instance, can you guess what the
following do (without running them first)?

xx<-c(1.7,"a")xx<-c(TRUE,2)xx<-c("a",TRUE)

You can also control how vectors are coerced explicitly using the
as.<class_name>() functions:

as.numeric("1")

[1] 1

as.character(1:2)

[1] "1" "2"

Objects Attributes

Objects can have attributes. Attributes are part of the object. These include:

names

dimnames

dim

class

attributes (contain metadata)

You can also glean other attribute-like information such as length (works on
vectors and lists) or number of characters (for character strings).

length(1:10)

[1] 10

nchar("Software Carpentry")

[1] 18

Matrix

In R matrices are an extension of the numeric or character vectors. They are not
a separate type of object but simply an atomic vector with dimensions; the
number of rows and columns.

m<-matrix(nrow=2,ncol=2)m

[,1] [,2]
[1,] NA NA
[2,] NA NA

dim(m)

[1] 2 2

You can check that matrices are vectors with a class attribute of matrix by using class() and typeof().

m<-matrix(c(1:3))class(m)

[1] "matrix"

typeof(m)

[1] "integer"

While class() shows that m is a matrix, typeof() shows that fundamentally the matrix is an integer vector.

Matrices in R are filled column-wise.

m<-matrix(1:6,nrow=2,ncol=3)

Other ways to construct a matrix

m<-1:10dim(m)<-c(2,5)

This takes a vector and transforms it into a matrix with 2 rows and 5 columns.

Another way is to bind columns or rows using cbind() and rbind().

x<-1:3y<-10:12cbind(x,y)

x y
[1,] 1 10
[2,] 2 11
[3,] 3 12

rbind(x,y)

[,1] [,2] [,3]
x 1 2 3
y 10 11 12

You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation:

mdat<-matrix(c(1,2,3,11,12,13),nrow=2,ncol=3,byrow=TRUE)mdat

[,1] [,2] [,3]
[1,] 1 2 3
[2,] 11 12 13

Elements of a matrix can be referenced by specifying the index along each dimension (e.g. “row” and “column”) in single square brackets.

mdat[2,3]

[1] 13

List

In R lists act as containers. Unlike atomic vectors, the contents of a list are
not restricted to a single mode and can encompass any mixture of data
types. Lists are sometimes called generic vectors, because the elements of a
list can by of any type of R object, even lists containing further lists. This
property makes them fundamentally different from atomic vectors.

A list is a special type of vector. Each element can be a different type.

Create lists using list() or coerce other objects using as.list(). An empty
list of the required length can be created using vector()

x<-list(1,"a",TRUE,1+4i)x

[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i

x<-vector("list",length=5)# empty listlength(x)

[1] 5

The content of elements of a list can be retrieved by using double square brackets.

x[[1]]

NULL

Vectors can be coerced to lists as follows:

x<-1:10x<-as.list(x)length(x)

[1] 10

What is the class of x[1]?

What about x[[1]]?

Elements of a list can be named (i.e. lists can have the names attribute)

Lists can be extremely useful inside functions. Because the functions in R are able to return only a single object, you can “staple” together lots
of different kinds of results into a single object that a function can return.

A list does not print to the console like a vector. Instead, each element of the
list starts on a new line.

Elements are indexed by double brackets. Single brackets will still return
a(nother) list. If the elements of a list are named, they can be referenced by the $ notation (i.e. xlist$data).

Data Frame

A data frame is a very important data type in R. It’s pretty much the de facto
data structure for most tabular data and what we use for statistics.

A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list).

Data frames can have additional attributes such as rownames(), which can be
useful for annotating data, like subject_id or sample_id. But most of the
time they are not used.

Some additional information on data frames:

Usually created by read.csv() and read.table(), i.e. when importing the data into R.

Assuming all columns in a data frame are of same type, data frame can be converted to a matrix with data.matrix() (preferred) or as.matrix(). Otherwise type coercion will be enforced and the results may not always be what you expect.

Can also create a new data frame with data.frame() function.

Find the number of rows and columns with nrow(dat) and ncol(dat), respectively.

Rownames are often automatically generated and look like 1, 2, …, n. Consistency in numbering of rownames may not be honored when rows are reshuffled or subset.

Useful Data Frame Functions

dim() - returns the dimensions of data frame (i.e. number of rows and number of columns)

nrow() - number of rows

ncol() - number of columns

str() - structure of data frame - name, type and preview of data in each column

names() or colnames() - both show the names attribute for a data frame

sapply(dataframe, class) - shows the class of each column in the data frame

See that it is actually a special list:

is.list(dat)

[1] TRUE

class(dat)

[1] "data.frame"

Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets (similar to matrix).

dat[1,3]

[1] 11

As data frames are also lists, it is possible to refer to columns (which are elements of such list) using the list notation, i.e. either double square brackets or a $.

dat[["y"]]

[1] 11 12 13 14 15 16 17 18 19 20

dat$y

[1] 11 12 13 14 15 16 17 18 19 20

The following table summarizes the one-dimensional and two-dimensional data structures in R in relation to diversity of data types they can contain.

Dimensions

Homogenous

Heterogeneous

1-D

atomic vector

list

2-D

matrix

data frame

Lists can contain elements that are themselves muti-dimensional (e.g. a lists can contain data frames or another type of objects). Lists can also contain elements of any length, therefore list do not necessarily have to be “rectangular”. However in order for the list to qualify as a data frame, the lenghth of each element has to be the same.

Column Types in Data Frames

Knowing that data frames are lists, can columns be of different type?

What type of structure do you expect to see when you explore the structure of the iris data frame? Hint: Use str().

# The Sepal.Length, Sepal.Width, Petal.Length and Petal.Width columns are all# numeric types, while Species is a Factor.# Lists can have elements of different types.# Since a Data Frame is just a special type of list, it can have columns of# differing type (although, remember that type must be consistent within each column!).str(iris)