The package: postlightmercury

Basically you sign up for free, get an API key and with that you can send it urls that it then parses for you. This is actually pretty clever if you are scraping a lot of different websites and you don’t want to write a web parser for each and every one of them.

Here is how the package works

Installation

Since the package is on cran, it’s very straight forward to install it:

Parse more than one URL:

You can also parse more than one URL. Instead of one, lets try giving it three URLs, two about Gangnam Style and one about Sauerkraut – with all that dancing it is after all important with proper nutrition.

## [1] "By Mark Savage BBC Music reporter Image copyright Schoolboy/Universal Republic Records Image caption Gangnam Style had been YouTube's most-watched video for five years Psy's Gangnam Style is no longer the most-watched video on YouTube.The South Korean megahit had been the site's most-played clip for the last five years.The surreal video became so popular that it \"broke\" YouTube's play counter, exceeding the maximum possible number of views (2,147,483,647), and forcing the company to rewr..."

And that is basically what the package does! 🙂

Under the hood: asynchronous API calls

Originally I wrote the package using the httr package which I normally use for my every day API calling business.

But after reading about the crul package on R-bloggers and how it can handle asynchronous api calls I rewrote the web_parser() function so it uses the crul package.

This means that instead of calling each URL sequentially it calls them in parrallel. This no doubt has major implications if you want to call a lot of URLs and can speed up your analysis significantly.

The web_parser function looks like this under the hood (look for where the magic happens):

web_parser<-function(page_urls, api_key){if(missing(page_urls))stop("One or more urls must be provided")if(missing(api_key))stop("API key must be provided. Get one here: https://mercury.postlight.com/web-parser/")### THIS IS WHERE THE MAGIC HAPPENS async<-lapply(page_urls, function(page_url){crul::HttpRequest$new(url="https://mercury.postlight.com/parser",
headers=list(`x-api-key`=api_key))$get(query=list(url=page_url))})res<-crul::AsyncVaried$new(.list=async)### END OF MAGICoutput<-res$request()api_content<-lapply(output, function(x)x$parse("UTF-8"))api_content<-lapply(api_content, jsonlite::fromJSON)api_content<-null_to_na(api_content)df<-purrr::map_df(api_content, tibble::as_tibble)return(df)}

As you can see from the above code I create a list async that holds the three different URL calls. I then add these to the res object. When I call the results from res it will fetch the data in parrallel if there is more than one URL. That is pretty smart!

You can use this basic temptlate for your own API calls if you have a function that rutinely calls several URL’s sequentially.

Note: In this case the “surrounding conditions” are all the same. But you can also do asynchronous requests that call different end-points. Check out the crul package documentation for more on that.