R Introduction – Importing Data into R

Introduction

Today we will talk about importing data into R. There are many ways to do that,
for different types of data, so that we can use R to do some analysis. Some of
the most common types are:

CSV (Comma separated values)

Data in tabular shape, using other separators besides the comma

TAB spaced data

XLS File (Microsoft Excel)

Text lines from a file

HTML, XML, Json

And many others (HDF5, SPSS, Stata)

Let's start to talk about how we can import them into R.

CSV and other tabular formats

CSV, or Comma Separated Value, is one of the most used formats used to store
data in tabular form. On this format, basically, the comma indicates the end of
a column. The table can have a header with column names or not.

Let's use a CSV file with informations on air passengers that you can downloadclicking here.

After that, let's import this CSV to a data frame in R, using the read.csv
function. Let's see:

We use the header=TRUE argument because, in this CSV file, the first line
indicates the columns names, being it's header. If we used header=FALSE, the
values on the first row wouldn't be the columns names, but values of the data
frame, and R would give generic names for the columns. We can also use
read.table to read data in tabular format, including CSVs. In this function, we
need to specify the separator through argument sep. The separator must be
between quotation marks:

Some times, the file where the data is contained will have some instructions or
informations on the first rows. In these cases, you can use the skip argument
to indicate the number of rows you want to skip. Let's use the same file from
the previous example, but i edited it to include some lines of text on the
beginning. Download it clicking here.

There are other useful argument for read.table. With “na.strings” you define
which string should be interpreted as a NA. With row.names and col.names, you
can define the names of the columns and rows. You can also define if strings
should be interpreted as factors, with the “stringsAsFactors” argument. For the
complete list of arguments, click here.

Reading Excel files

For Excel files, there are some ways. The first one is to use Excel ta save the
tabular data in CSV format. Excel allows and have a good interface for it.

But, to read the XLS file, we can also use a package named “xlsx”.

First, install it with install.packages(“xlsx”). Then, load it with
library(xlsx). To load the data, you use the read.xlsx function, entering the
name of the file to be read and the number or index of the sheet.

With this package, you can also define the starting row (startRow), the ending
row (endRow), if the table have a header (header), define the class of each
column (colClasses), among others. To see the complete list of arguments, you
can access the package manual clicking here.

Reading Text Files

You may also want to analyze text lines from a file. A text file can contain,
for example, tweets or Facebook posts. To get this text data into R, you use
the readLines function. Just supply the name of the file or a connection.
Additionally, you can also define the number of text lines to be read, with the
“n” argument. You can download the test file to try this out clicking here:

text_vector <- readLines("text_lines.txt")
## Warning in readLines("text_lines.txt"): incomplete final line found on## 'text_lines.txt'
print(text_vector)
## [1] "This is an example with text lines. This is the first one"## [2] "The second one, with a little more text"## [3] "Line of text 3 incoming"## [4] "Just one more to finish"
text_vector_2 <- readLines("text_lines.txt",2)
print(text_vector_2)
## [1] "This is an example with text lines. This is the first one"## [2] "The second one, with a little more text"

Reading a webpage

You can also read a webpage to extract useful information from it. First, you
create a connection using the url function. Then, you just need to use readLines
just like we used to read a text file on the previous example: