Easy Web Scraping With Rvest: Exercises

The Internet is full of interesting data, there’s no doubt about it. Some sites, such as Twitter, provide users with systemized access (API) around which some neat R packages have been built. In this exercise set, we practice much more general techniques of extracting/scraping data from the web directly, using the rvest package.
Note that it is useful to have some basic understanding of the elements of html and xml, such as tags and their attributes, in order to become an effective web scraper. A useful package for identifying relevant tags quickly is SelectorGadget, which is available as an extension to the Chrome browser. Regular expression skills will always come in handy.

Exercise 1
Install and load the rvest package. Use read_html to read in this webpage as an R object listing and linking to lecture notes for the MIT course Introduction to Algorithms. Name the object ln_page.

Exercise 2
Using html_nodes(), extract all links from ln_page and save as ln_links. It might be helpful to first read up on html links on w3school.com.

Exercise 3
Now extract all the text from the links to ln_links_text, the path ln_links_path, and the href attribute which defines where they lead to. What is the structure of the objects you extracted?

Exercise 4
Turns out both are simply character vectors. Knowing that the lecture notes are all in PDF format, use your regex skills to extract only the paths that lead the PDF document to ln_links_path_pdf. Print some of the paths to the console.

Exercise 5
Notice the paths are relative (not absolute.) If we want to download the PDFs, we need to prepend the website to the paths. A nice challenge is to use regex so it generalizes to any relative web path.

Exercise 6
Use a loop and R‘s download.file() function to download at least two of the PDFs. Notice you first need to decide what the files will be called on your hard drive (the destfile argument), and of course define your working directory.

Exercise 7
Now that you will be busy studying algorithms, you still don’t want to miss out on new exercise sets on R-exercises.com. So, why not write a script that checks the date of the last post? Using rvest extract the .entry-time html nodes.

Let’s Spread the Word about R-exercises!

If you enjoy our free exercises, we’d like to ask you a small favor: Please help us spread the word about R-exercises. Go to your preferred site with resources on R, either within your university, the R community, or at work, and kindly ask the webmaster to add a link to www.r-exercises.com. We very much appreciate your help!