A Software and Technology Blog by John Robinson

Use Node.js to Extract Data from the Web for Fun and Profit

Need to automate pulling some data from a web page? Or want to mash up some unstructured data from a blog post with another data source. No API for getting at the data… !@#$@#$… No Problem… Web scraping to the rescue. What is web scraping?… you may be asking… Web scraping consists of programmatically (typically with no browser involved) retrieving the contents of web pages and extracting data from them.

In this article, I’m going to show you a pretty powerful set of tools for scraping web content quickly and easily using Javascript and Node.js.

I recently needed to be able to crawl a (modestly) large number of pages and sift through them looking for patterns. It had been a really really long time since I’d wanted to do anything similar to this and was… to put it mildly… out of touch with the tools available.

I have a guilty pleasure. I admit it… I really like Node.js. Node.js is a framework for writing Javascript applications outside of a web browser. In the spirit of Atwood’s Law, it has a number of powerful facilites for writing networked applications. Not only can you use Node.js to build server-side webserver/websocket code, but I’ve found that I like to use it for my random scripting needs. So I set off looking at what web scraping libraries were available for Node.js and found Cheerio. Cheerio is a library for Node.js that allows you to (with no browser required) construct a DOM out of a given piece of HTML and do jquery-like CSS queries on it. Brilliant! Given the advent of CSS and CSS driven styling… What is just about the only organizing factor within web pages out in the wild?… Well CSS of course. These days folks more often than not use CSS classes to style various structural aspects of their web pages. Don’t get me wrong this isn’t exactly a magic bullet… We’re still dealing with webpages, which are typically a big ball of unstructured mess. But for me, CSS selectors have proven to be a pretty powerful tool for quickly and easily identifying abstract features from within HTML web content. My typical workflow for webscraping is to first examine the structure of the pages in question using Chrome developer tools or Firebug. I look for CSS selectors that can be used to target the information that I’d like to extract. Next let’s look at some node-based web scraping in action.

If you don’t already have Node.js installed or haven’t updated it in a while. Download and install it from here. The node installer will not only install node itself, but will also install a program called npm. The node package manager, npm, can be used to download and install published node libraries quickly and easily. We’ll use npm to install the Cheerio (web scraping) library. Which you can do with the following command in your console.

npm install cheerio

Once Cheerio has been installed, we can get down to business. First let’s look at some Javascript code written for Node.js that lets us download the contents of an arbitrary webpage.

This snippet defines a little utility function that can be used to aynchronously download an arbitrary URL (using HTTP GET) and invoke a callback function when it’s done passing in the web page contents. The next snippet shows how we can use this to download the contents of a web page and write it out to the console. Note: Please refer to the download.js sample in the source code. To run this sample from the included source code, simply run the following command from your console.

This will download the contents of the specified URL and print it to the console. Now that we have a way to download some data from the web, let’s look at how we can use Cheerio to extract interesting information from it.

Before you can get down to web scraping, you have to do a bit of research and experimention to understand the structural layout of your target web pages so that you can hone in on the information that you’re interested in. As a concrete example, we’ll be trying to extract the URLs for the primary images on the same web page from the previous snippet. Load up the web page (url) referenced above in your browser and try to find a way to uniquely identify the primary images in the post with the goal of extracting the image URLs. You can use Chrome developer tools (easy) or just view the source (harder). Look for details on how to isolate and identify the primary images in the post…. Got it? Let’s look at how I did it with a little more code. Note: Please refer to the squirrel.js sample in the source code.

After importing the cheerio module, we can use the download function described earlier to load the contents of our target web page. Once we have the data, the load method on the cheerio object is used to parse the HTML contents and return an object that contains a DOM representation of the same content and has methods for doing jquery-like CSS queries against it (Note: I assigned this to a variable named “$” so it looks more like jquery). In the target web page, I noted that the primary images in the web page were each wrapped with a div that has the class “artSplitter” applied to it and the images themselves have the class “blkBorder” applied to them. In order to uniquely select these elements out of the DOM, I used this CSS selector query.

$("div.artSplitter > img.blkBorder")

This expression returns a list of all image tag objects that match this selector. We then use the jquery-like each method to enumerate these selected image tag objects and write out the src attribute for each out to the console. Pretty Cool… Let’s look at another example. Note: Please refer to the echo.js sample in the source code.

In this example, the target website is echojs.com. I want to be able to extract the links along with the name of the poster and output it to the console in markdown syntax. In this example, we zero in by first searching for the article tag using the following query.

$("article")

Then for each matching article node, a subquery using the find method is used to select an anchor tag, (a) that is wrapped by an h2 tag as follows:

var link = $(e).find("h2>a");

Simarily, we can do another subquery to isolate the poster’s name with the following query.

var poster = $(e).find("username").text();

I hope you’ve enjoyed this brief tour of using Node.js and Cheerio to do web scraping. Please refer to the Cheerio documentation for more details. While probably not well suited for any sort of large-scale scraping, it is a pretty powerful tool for those familiar with frontend Javascript and JQuery development.

Sign up for my mailing list to get the latest updates, news and articles from me. Just drop your email into the form below.

18 Comments.

No. Cheerio does not interpret the Javascript on the pages loaded. It only processes the static HTML, which was fine for my purposes. However there are projects that allow you to run a scriptable “headless” browser that can be automated to do crawling activities and can do such things as evaluating Javascript and taking screenshots if you need that level of emulation. Of course this comes with more overhead, so there are pros and cons to this approach.

There’s a cool python scraping tool – scrapy. It takes a short while to get a hang of it. But after the learning curve, I find it very useful. What about Node.js? In terms of flexibility & extensibility in handling gigantic scraping jobs.. does it compare to scrapy?

Great article, but how can you put the fetched content straight into the website stream from node server, instead of the console ? It seems that in async node won’t output the result unless it’s in the console.

Yep. The combination of jsdom and jquery came up during my research as a possible alternative. Cheerio seemed to get the stronger vote of confidence on performance, so I went that way. It would be great if someone having experience with both could comment on their experiences.

node.io does scraping and job management nicely. I’ve a sample project here https://github.com/vsbabu/nodejs-timeseries that runs a scraper with some config, gets data and puts into an SQLite DB. My code is very little, which is because all the hard work is done by node.io.

I used to use Python + Beautifulsoup + urllib2 before – works very well. The one above was a side project to learn node.js