6 Webscraping with rvest

You may come across data online that are relevant to your interests or research (for example, past students at Penn have scraped data from sex offender registries and sporting information from Wikipedia). Not all online data is in a tidy, downloadable format such as a .csv or .rda file. Here we’ll learn how to grab data from a webpage - as our example we’ll be scraping data on movie ticket sales.

For our purposes we will be using the package rvest. This package makes it relatively easy to scrape data from websites, especially when that data is already in a table on the page as our data will be.

If you haven’t done so before, make sure to install rvest.

install.packages("rvest")

And every time you start R, if you want to use rvest you must tell R so by using library().

library(rvest)
#> Loading required package: xml2

We will be scraping movie ticket data from the website The-Numbers. This site has daily information on how much money each movie in theaters made that day. The data includes the name of the movie, the number of theaters it played in, how much it made that day, how much it made since it started playing, and how many days it has been in theaters. Conveniently, this is all found in a single table on that page.

Here is a screenshot of data from July 4th, 2018 and here is a link to that page.

6.1 Scraping one page

In later lessons we’ll learn how to scrape an entire year of data from this site. For now, we’ll focus on just getting data from July 4th, 2018.

The first step to scraping a page is to read in that page’s information to R using the function read_html() from the rvest package. The input for the () is the URL of the page we want to scrape. In a later lesson, we will manipulate this URL to be able to scrape data from many pages.

When running the above code, it returns an XML Document. The rvest package is well suited for interpreting this and turning it into something we already know how to work with. To be able to work on this data, we need to save the output of read_html() into an object which we’ll call movie_data since that is our end goal.

We now need to select only a small part of page which has the relevant information - in this case the data in the table.

Right click somewhere in the table and then click “Inspect Element”.

This will open up a tab on the screen that allows you to see the building blocks on that page. When you move your cursor over parts of this tab, the parts of the page it relates to will be highlighted in blue. We want want all of the data from the table so move your cursor to where it starts with “<table”. Doing so will highlight the entire table in blue.

Right click the “table” area and click Copy then CSS Selector. That will copy what we need. Essentially this says which part of the page is the table and allows us grab only that part from the XML Document we made earlier.

We will use the function html_nodes() to grab the part of the page (based on the CSS selectors) that we want. The input for this function is first the object made from read_html() (which we called movie_data) and then we can paste the text we copied from the website (putting it in quotes).

Note that when doing this in Google Chrome, you follow the same steps except click “Copy selector” rather than “CSS Selector”. The value copied also differs between Chrome and Firefox though the result is the same in our code.

Since we are getting data from a table, we need to tell rvest that the format of the scraped data is a table. We do with using html_table() and our input in the () is the the object made in the function html_nodes().

movie_data <-html_table(movie_data)

By default, rvest returns a list. We prefer to work with data.frames so we’re going to grab the first element in this list which is the data we want. Unlike with a vector or a data.frame, we can’t use normal square bracket notation [] for lists. Instead we need double square brackets [[]].

movie_data <-movie_data[[1]]

Take a look at the webpage and compare it to the data set you’ve now created. All the values should now match.

We have now successfully scraped a website! The movie_data object is a data.frame object that we are familiar with from looking at the Chicago and UCR data so we can subset and manipulate it like we’ve done before.

6.2 Cleaning the webscraped data

Let’s check what the max value is in the “Gross” column which says how much the movie made on that day.

max(movie_data$Gross)
#> [1] "$9,646,015"

So the most money a movie made is about $9.6 million. Is that right? We can check either the website or the data using View() to see if there are any more successful movies (conveniently, the table is already sorted by how much the movie made). No! The most successful movie made $11.5 million, not $9.5 million. So why did max() say the top value is 9.6 million? Let’s take another look at the values in the “Gross” column.

The values are not actually numeric type. If a value is numeric in R it would only have numbers, not dollar signs or commas like we see here. It also would not be in quotes, R’s way of saying “this value is a character type”. So what we have to do is turn these values into numeric type.

The way to convert a character type into a numeric type is the function as.numeric(). Let’s take a look at the very first value in that column, “$11,501,395”.

Running as.numeric() on that value returns NA because it doesn’t know how to handle the dollar sign and comma. If we remove those, it will work as expected.

as.numeric("11501395")
#> [1] 11501395

We can use gsub() which we learned earlier in Chapter 5 to delete the dollar sign and commas from our values. After that we can use as.numeric() to fix that column. (alternatively we could use the function parse_number() from the readr package, but this is a good example of using regular expressions)

Remember that the syntax of gsub() is

gsub("find", "replace", string_to_fix)

First we will remove the comma. We want to use gsub() to find all commas and replace it with nothing (deleting it). To indicate nothing we just use quotes without anything in it.

gsub(",", "", "$11,501,395")
#> [1] "$11501395"

Now to do the dollar sign. Remember that in gsub() and grep() the $ is a special character indicating that whatever precedes it is the last character. To tell R we want the $ literally, we use two backslashes before it - this is how to deal with all special characters, not just dollar signs.

gsub("\\$", "", "$11,501,395")
#> [1] "11,501,395"

Each of these gsub()s work alone. We need to combine them to remove both the dollar sign and the comma. We have two choices for this. The first choice is to use the | operator which tells gsub() to replace the value on the left or right side of the | symbol (or both if both are present).

gsub(",|\\$", "", "$11,501,395")
#> [1] "11501395"

The second choice is to do them separately and save the results into an object that we use in the second gsub(). Let’s first save our value “$11,501,395” into an object we call x and then run both gsub() statements we wrote earlier, using x (without quotes since it is an object) as our value and saving the results back into x. For those not very comfortable with regular expressions, this is the better way of doing it as you can deal with simpler gsub() expressions than above.

An alternative is to change the column name using both names() and gsub() together to remove the space. Running names() returns the name of every column. If we assign something to the names() it will actually change the column names to whatever we assign it. What we want to do is use gsub() to replace all spaces in column names with something else - either with nothing or a value R understands as part of a name such as an underscore.