Becoming a Data Scientist

Part 3 Caught in a Web Scraping Maze: httr and rvest in R

At about this point, I started to think, if all these people are creating their own web scrapers, why can’t I? How hard can it be to pull some links off a page anyway….

So I went back to Google and inspected the elements on the page to see if I could identify the URLs of the search results. Using the Inspect Element tool in Chrome, I found the tags tied to the URLs of the search results.

Inspect element

They are deeply embedded in div within div within div within.. you get the point.

Alright, it’s getting late so I’m going to try and cut to the chase.

Since I knew that scraping Google search results was different from scraping html content (with rvest), I started by Googling “scrape google results R”, and this result about httr came up. I installed the httr package, then ran the example script. Cue drumroll!

httr script

…aaaand error. xpathSApply does not exist! Some searching revealed that it’s a function in the XML package, and since I don’t work much with XML data, this was a good chance to get my feet wet with it.

So I installed the XML package and tried again. Any luck this time?

httr script (with XML)

Mmmm… sort of? At least it pulled a URL, but not really the right one. I tried running their exact script, but it still didn’t yield usable URLs. (I just realized that in my script, I didn’t put “+” between the words of my search. But when I did it just now, I got the same results as below).

I could tell there were some differences in the output and in what I saw through the Inspect Elements, but at first glance, this output looked fairly reasonable, so I moved forward.

As a first test, I looked to see what I would get if I pulled the html text out of the <a> tags.

html_nodes(“a”) %>% html_text()

Hm, interesting. So it seems the accessible URL links are ones that are standard on Google search pages.

Then I tried a bunch of different calls to see what kind of tags html_nodes take (can it take a class name? …seems the answer is no.)

character(0)

Nada. Alright let’s try a different approach. I tested one of the examples described in the rvest documentation, pulling data from the A-Team site on boxofficemojo.

A-Team html_nodes(“center”)

A-Team html_nodes(“center”) %>% html_nodes(“td”)

Woo hoo! I love it when a plan comes togther …at least in the case, and at least it looks like the script works and the source of my woes is coming from the Google Search Results in particular.

I tried calling the divs from the Google Search Results page, but the results were odd. Some of the divs at the first level were present in the output, but some that were in the output I couldn’t find through Inspect Elements.

teamwork %>% html_nodes(“div”)

And then when I looked for the id of the first level div containing (eventually) the div containing the URLs, it wasn’t in the output. (Below is the first level div containing the URLs.)

Inspect element: First level divs

I tried using XML-specific calls, but encountered similar results.

xpathSApply(teamwork, “//div”)

Even when I drilled down those divs, it went down… the first one? I’m not sure where it went down.

xpathSApply(teamwork, “//div//div//div”)

So I’m really not sure why I can’t drill down to the data I want, but it feels like something is blocking my way, that Google Search Results is doing something special to hide their key info on this page. But I have no idea.

I think what I’ll try next (another day) is to download the page source and scrape the plain text file. I should be able to at least do that. That means I’ll still have to go in and download the page source of about 10 pages of search results for my project. But maybe I could also write a Python script that can pull the page source for me? Most of the time the purpose of these scraping programs is to track daily or by the minute (or second!) changes in pages on the web. But for me, my goal is to take a snapshot of what discussions of teamwork look like in America and Korea, so capturing the data is just a one-time thing and that kind of solution could suit my purposes. But for now that’s future Alyssa’s problem!

3 thoughts on “Part 3 Caught in a Web Scraping Maze: httr and rvest in R”

Hi Alyssa! I’m in the same trouble. I need to scrape google for research purposes. I didn’t found a definitive solution but I can give you my two cents:
– The output you get scraping the google search page with R is different from what you see in the browser because google load it’s content in two phases: first it load synchronously the basic page which itselft contains a script that asynchronously load via ajax the encrypted search results. If you notice, the search query and some other parameters in the url are after the #, so they’re not passed to the server but managed via javascript. This is because you are using the google instant version of the engine which has /webhp? as endpoint. Instead you should use the one with /search? has end point and put all search parameters in the POST query, not after the #. This way it will load the page synchronously in one piece and you can scrape.
– But as you said this method is against Google TOS (which is bad since I’m doing research which is going to be published hopefully), so I tried via Google Custom Search API. There is actually a way to tell google to search the whole web using the API, not just your website. In the control panel where you create your Search Engine you can specify in the field “website where you want to search” -> “Search the entire web but give priority to the specified sites” (loosely translated from Italian).
Hope it helps!