Contents

The word "data" is a general purpose word denoting a collection of measurements. "Data points" refer to individual instances of data. A "data set" is a well-structured set of data points. Data points can be of several "data types," such as numbers, or text, or date-times. When we collect data on similar objects in similar formats, we bundle the data points into a "variable." We could give a variable a name such as 'age,' which could represents the list of ages of everyone in a room. The data points associated with a variable are called the "values" of the variable. These concepts are foundational to understanding data science. There is some quirkiness in the way variables are treated in the R programming language.

The Wiktionary defines data as the plural form of datum; as pieces of information; and as a collection of object-units that are distinct from one another.

The Wiktionary defines datum as a measurement of something on a scale understood by both the recorder (a person or device) and the reader (another person or device). The scale is arbitrarily defined, such as from 1 to 10 by ones, 1 to 100 by 0.1, or simply true or false, on or off, yes, no, or maybe, etc.; and as a fact known from direct observation.

For our purposes, the key components of these definitions are that data are observations that are measured and communicated in such a way as to be intelligible to both the recorder and the reader. So, you as a person are not data, but recorded observations about you are data. For example, your name when written down is data; or the digital recording you speaking your name is data; or a digital photograph of your face or video of you dancing are data.

Rather than call a single measurement by the formal word '"datum," we will use what the Wikipedia calls a data point. We may talk about a single data point or several data points. Just remember that when we talk of "data," what we mean is a set of aggregated data points.

The Wiktionary, unhelpfully, defines a data set as a "set of data." Let us define a data set as a collection of data points that has been observed on similar objects and formatted in similar ways. Thus, a compilation of the written names and the written ages of a room full of people is a data set. In computing, a data set is stored in a file on a disk. Storing the data set in a file makes it accessible to analysis.

As illustrated earlier, data can exist in many forms, such as text, numbers, images, audio, and video. People who work with data have taken great care to very specifically define different data types. They do this because they want to compute various operations on the data, and those operations only make sense for particular data types. For example, addition is an operation we can compute on integer data types (2+2=4), but not on text data types ("two"+"two"=???). Concatenation is an operation we can compute on text. To concatenate means to put together, so: concatenate(two, two) = twotwo. For the purposes of this introduction, we will just concern ourselves with simple numeric and simple text data types and leave more complex data types—like images, audio, and video—to more advanced courses. Data scientists use the various data types from mathematics, statistics, and computer science to communicate with each other.

We will introduce just the most commonly used data types in Mathematics. There are many more, but we'll save those for more advanced courses.

Integers - According to the Wikipedia, integers are numbers that can be written without a fractional or decimal component, and fall within the set {..., −2, −1, 0, 1, 2, ...}. For example, 21, 4, and −2048 are integers; 9.75, 5½, and √2 are not integers.

Rational Numbers - According to the Wikipedia, rational numbers are those that can be expressed as the quotient or fraction p/q of two integers, with the denominator q not equal to zero. Since q may be equal to 1, every integer is a rational number. The decimal expansion of a rational number always either terminates after a finite number of digits or begins to repeat the same finite sequence of digits over and over. For example, 9.75 2/3, and 5.8144144144… are rational numbers.

Real Numbers - According to the Wikipedia, real numbers include all the rational numbers, such as the integer −5 and the fraction 4/3, plus all the irrational numbers such as √2 (1.41421356... the square root of two), π (3.14159265...), and e (2.71828...).

Imaginary Numbers - According to the Wikipedia, imaginary numbers are those whose square is less than or equal to zero. For example, √-25 is an imaginary number and its square is -25. An imaginary number can be written as a real number multiplied by the imaginary unit i, which is defined by its property i2 = −1. Thus, √-25 = 5i.

Data scientists understand that the kind of mathematical operations they may perform depends on the data types reflected in their data.

We will introduce just the most commonly used data types in statistics, as defined in the Wikipedia. There are a few more data types in statistics, but we'll save those for more advanced courses.

Nominal - Nominal data are recorded as categories. For this reason, nominal data is also known as categorical data. For example, rocks can be generally categorized as igneous, sedimentary and metamorphic.

Ordinal - Ordinal data are recorded as the rank order of scores (1st, 2nd, 3rd, etc.). An example of ordinal data is the result of a horse race, which says only which horses arrived first, second, or third but include no information about race times.

Interval - Interval data are recorded not just about the order of the data points, but also the size of the intervals in between data points. A highly familiar example of interval scale measurement is temperature with the Celsius scale. In this particular scale, the unit of measurement is 1/100 of the temperature difference between the freezing and boiling points of water. The zero point, however is arbitrary.

Ratio - Ratio data are recorded on an interval scale with a true zero point. Mass, length, time, plane angle, energy and electric charge are examples of physical measures that are ratio scales. Informally, the distinguishing feature of a ratio scale is the possession of a zero value. For example, the Kelvin temperature scale has a non-arbitrary zero point of absolute zero.

Data scientists know that the kind of statistical analysis they will perform is determined by the kinds of data types they will be analyzing.

Bit - A bit (a contraction of binary digit) is the basic unit of information in computing and telecommunications; a bit represents either 1 or 0 (one or zero) only. This kind of data is sometimes also called binary data. When 8 bits are grouped together we call that a byte. A byte can have values in the range 0-255 (00000000-11111111). For example, the byte 10110100 = 180.

Hexadecimal - Bytes are often represented as Base 16 numbers. Base 16 is known as Hexadecimal (commonly shortened to Hex). Hex uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F (or alternatively a–f) to represent values ten to fifteen. Each hexadecimal digit represents four bits, thus two hex digits fully represent one byte. As we mentioned, byte values can range from 0 to 255 (decimal), but may be more conveniently represented as two hexadecimal digits in the range 00 to FF. A two-byte number would also be called a 16-bit number. Rather than representing a number as 16 bits (10101011110011), we would represent it as 2AF3 (hex) or 10995 (decimal). With practice, computer scientists become proficient in reading and thinking in hex. Data scientists must understand and recognize hex numbers. There are many websites that will translate numbers from binary to decimal to hexadecimal and back.

Boolean - The Boolean data type encodes logical data, which has just two values (usually denoted "true" and "false"). It is intended to represent the truth values of logic and Boolean algebra. It is used to store the evaluation of the logical truth of an expression. Typically, two values are compared using logical operators such as .eq. (equal to), .gt. (greater than), and .le. (less than or equal to). For example, b = (x .eq. y) would assign the boolean value of "true" to "b" if the value of "x" was the same as the value of "y," otherwise it would assign the logical value of "false" to "b."

Alphanumeric - This data type stores sequences of characters (a-z, A-Z, 0-9, special digits) in a string--from a character set such as ASCII for western languages or Unicode for Middle Eastern and Asian languages. Because most character sets include the numeric digits, it is possible to have a string such as "1234". However, this would still be an alphanumeric value, not the integer value 1234.

Integers - This data type has essentially the same definition as the mathematical data type of the same name. In computer science, however, an integer can either be signed or unsigned. Let us consider a 16-bit (two byte) integer. In its unsigned form it can have values from 0 to 65535 (216-1). However, if we reserve one bit for a (negative) sign, then the range becomes -32767 to +32768 (-7FFF to +8000 in hex).

Floating Point - This data type is a method of representing real numbers in a way that can support a wide range of values. The term floating point refers to the fact that the decimal point can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation. In scientific notation, the given number is scaled by a power of 10 so that it lies within a certain range—typically between 1 and 10, with the decimal point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the revolution period of Jupiter's moon Io is 152853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047×105 seconds. Floating-point representation is similar in concept to scientific notation. The base part of the number is called the significand (or sometimes the mantissa) and the exponent part of the number is unsurprisingly called the exponent.

The two most common ways in which floating point numbers are represented are either in 32-bit (4 byte) single precision, or in 64-bit (8 byte) double precision. Single precision devotes 24 bits (about 7 decimal digits) to its significand. Double precision devotes 53 bits (about 16 decimal digits) to its significand.

List - This data type is used to represent complex data structures. In its most simple form, it has a key-value pair structure. For example, think of a to-do list:

Key

Value

1

Get haircut

2

Buy groceries

3

Take shower

Lists can become and often do become very complex. The keys do not have to be numeric, but could be words, such as "one," "two," and "three." The values do not have to be a single data point. The value could be a series of numbers, or a matrix of numbers, or a paragraph. For example the first key in a list could be "Romeo and Juliet," and the first value in the list could be the entire play of Romeo and Juliet. The second key in the list could be "Macbeth," and the second value in the list could be the entire play of Macbeth. Finally, a value in a list could even be another list. At this point do not go down the rabbit hole of "a list within a list within a list . . ." We will leave that to graduate students in computer science.

Data scientists understand the importance of how data is represented in computer science, because it affects the results they are generating. This is especially true when small rounding errors accumulate over a large number of iterations.

There are at least 24 data types in the R language.[1] We will just introduce you to the 9 most commonly used data types. As you will see they are a blend of the data types that exist in Mathematics, Statistics, and Computer Science. Just what a Data Scientist would expect. The nine are:

NULL - for something that is nothing

logical - for something that is either TRUE or FALSE (on or off; 1 or 0)

complex - for complex numbers that have both real and imaginary parts (e.g., square root of -1)

date - for dates only

POSIX - for dates and times (dates are internally represented as the number of days since 1970-01-01, with negative values for earlier dates)

list - for storing complex data structures, including the output of most of the built-in R functions

You can get R to tell you what type a particular data object is by using the typeof() command. If you want to know what a particular data object was called in the original definition of the S language [2] you can use the mode() command. If you want to know what object class a particular data object is in the C programming language that was used to write R, you can use the class() command. For the purposes of this book, we will mostly use the typeof() command.

Just a note about lists in R. R likes to use the list data type to store the output of various procedures. We generally do not perform statistical procedures on data stored in list data types--with one big exception. In order to do statistical analysis on lists, we need to convert them to tables with rows and columns. R has a number of functions to move data back and forth between table-like structures and list data types. The exception we just referred to, is called the data.frame list object. List objects of the class data.frame store rows and columns of data in such a specifically defined way as to facilitate statistical analysis. We will explain data frames in more detail below.

Data scientists must know exactly how their data are being represented in the analysis package, so they can apply the correct mathematical operations and statistical analysis.

Let us start by noting the opposite of a variable is a constant. If we declare that the symbol "X" is a constant and assign it a value of 5, then X=5. It does not change; X will always be equal to 5. Now, if we declare the symbol "Y" to be a variable, that means Y can have more than one value (see the Wiktionary entry for "variable"). For example, in the mathematical equation, Y^^2=4 (Y squared equals 4), the variable Y can either have the value of 2 or -2 and satisfy the equation.

Imagine we take a piece of paper and make two columns. At the top of the first column we put the label "name" and the top of the second column we put the label "age." We then ask a room full of 20 people to each write down their name and age on the sheet of paper in the appropriate columns. We will end up with a list of 20 names and 20 ages. Let us use the label "name" to represent the entire list of 20 names and the label "age" to represent the entire list of 20 ages. This is what we mean by the term "variable." The variable "name" has 20 data points (the list of 20 names), and the variable "age" has 20 data points (the list of 20 ages). A variable is a symbol that represents multiple data points which we also call values. Other words that have approximately the same meaning as "value" are measurement and observation. Data scientists use these four terms (data point, value, measurement, and observation) interchangeably when they communicate with each other.

The word "variable" is a general purpose word used in many disciplines. However, various disciplines also use more technical terms that mean approximately the same thing. In mathematics another word that approximates the meaning of the term "variable" is vector. In computer science, another word that approximates the meaning of the term "variable" is array. In statistics, another word that approximates the meaning of the term "variable" is distribution. Data scientists will often use these four words (variable, vector, array, and distribution) interchangeably when they communicate with each other.

Let us think again of the term data set (defined above). A data set is usually two or more variables (and their associated values) combined together. Once our data is organized into variables, combined into a data set, and stored in a file on a disk, it is ready to be analyzed.

The R programming language is a little quirky when it comes to data types, variables, and data sets. In R we sometimes use the term "vector" instead of "variable." When we combine and store multiple vectors (variables) into a data set in R, we call it a data frame. When R stores vectors into a data frame, it assigns a role to indicate how the data will be used in subsequent statistical analyses. So in R data frames, for example, the "logical," "date/time," and "character" data types are assigned the role of Factor. The "double" data type are assigned the role of num and "integers" are assigned the role of int. (The "complex" data type is assigned the role of "cplx," but don't worry about that now.) These roles correspond to the statistical data types as follows: Factor = nominal, int = ordinal, and num = interval. (We usually transform the ratio data type into an interval data type before doing statistical analysis. This is normally done by taking the logarithm of the ratio data. More on this in later chapters.) We can discover the roles each variable will play within a data frame by using the structure command in R: str(). We will explain what "factors" are in latter chapters.

This assignment should be done in a group of 3 or 4 students. The groups need to be composed of different people from the previous two homework groups. All should interact with the R programming language. The group can help each other both learn the concepts and figure out how to make R work. Practice with R by trying out different ways of using the commands that are described below.

If you don't specifically specify a data type through the as.* commands, R tries to figure out what data type you intended. It does not always guess your mind correctly. Play around with R, assigning some values to some variables and then use the typeof() command to see the automatic assignments of data types that R made for you. Then see if you can convert a value from one data type to another.

The R language is based on an object-oriented programming language. Thus, things in R are called objects. So, when we assign a value to the letter "X," in R we would say we have assigned a value to the object "X." Objects in R may have different properties from each other, depending on how they are used. For this exercise, we will concern ourselves with objects that behave like variables. Those types of objects are called vector objects. So, when we talk—in the language of data science—about the variable "X," in R we could call it the vector "X." As you remember, a variable is something that varies. Let's create a character vector in R and assign it three values. We will use the concatenate c() command in R. Let's also create an integer vector using the same concatenate command.

> name <-c("Maria","Fred","Sakura")>typeof(name)> name
> age <-as.integer(c(24,19,21))>typeof(age)> age

Both vectors now have three values each. The character string "Maria" is in the first position of the vector "name," "Fred" is in the second position, and "Sakura" is in the third position. Similarly, the integer 24 is in the first position of the vector "age," 19 is in the second position, and 21 is in the third position. Let's examine each of these individually.

If we had observed the actual names and ages of three people so that name[1] corresponded to age[1], we would have a data set that looks like the following.

Name

Age

Maria

24

Fred

19

Sakura

21

Let us put our data set into an R data frame object. We need to think of a name for our data frame object. Let's call it "project." After we put our data set into the data frame, we will inspect it using R's "typeof," "class," "ls," and "structure" commands, str(). Remember, upper and lower cases are meaningful.

The typeof() function told us we had created a list object. The class() function told us it is a special type of list object known as a data.frame. The ls() function tells us what "key-value" pairs exist inside our list object. Please don't worry too much about all of that detail right now. What is important is what the str() function tells us.

The structure command tells us we have three observations and two variables. That is great. It tells us the names of the variables are $name and $age. This tells us that when we put a data set into an R data frame list object, we need to reference the variable WITHIN the data frame as follows: project$name and project$age. The structure command also tells us that project$name was assigned a the role of a "Factor" variable and that project$age was assigned the role of "int." These correspond to the "nominal" and "ordinal" data types that statistitians use. R needs to know the role variables play in order to perform the correct statistical functions on the data. One might argue that the age variable is more like the statistical interval data type than the statistical ordinal data type. We would then have to change the R data type from integer to double. This will change its role to "number" within the data frame.

Rather than change the data type of project$age, it is a good practice to create a new variable, so the original is not lost. We will call the new variable project$age.n, so we can tell that is the transformed project$age variable.

> project$age.n <-as.double(project$age)> str(project)

We can now see that project$age and the project$age.n variables play different roles in the data frame, one as "int" and one as "num." Now, confirm that the complete data set has been properly implemented in R by displaying the data frame object.

> project
name age age.n
1 Maria 24242 Fred 19193 Sakura 2121

Now let's double check the data types.

>typeof(project$name)>typeof(project$age)>typeof(project$age.n)

Whoops! We see some of the quirkiness of R. When we created the variable "name," it had a data type of "character." When we put it into a data frame not only did R assign it the role of a "Factor" but it also changed its data type to "integer." What is going on here? This is more than you want to know right now. We will explain it now, but you really don't have to understand it until later.

Because all statistical computations are done on numbers, R gave each value of the variable "name" an arbitrary integer number. It calls these arbitrary numbers levels. It then labeled these levels with the original values, so we would know what is going on. So under the covers, project$name, has the values: 2 (labeled "Maria), 1 (labeled "Fred") and 3 (labeled Sakura). We can convert project$name back into the character data type, but we won't be able to perform statistical calculations on it.