Web Crawling with Node.js #2: Building the Page Object

Welcome to part 2 of the series crawling the web with Node.js. In this article we’re going to have a look at what valuable content we can grab from a page. Important parts when writing a crawler are obviously links, because our crawler wouldn’t know where to go next without them.

The data I’m going to extract from a page are not necessarily the ones you’ll want and it really all depends what you want with the project. Maybe you only want the content of specific tags or status codes. I’ll just put up some examples and you can see from there what’s possible and see what would make sense for your purpose.

Listing all Links on a page with node.js

To extract all the links from page is valuable for both:

knowing where a page/document links to

finding out which page to crawl next

When crawling an entire domain of pages, you probably want to see which other documents, your current document links to. The essential lines for that are listed below, if you have request and cheerio installed in the same folder you can give it a shot:

Note: some links might not have a title, since images are not converted to text by cheerio 😉 We could surely figure out a smarter way or to use the image name, title or alt tag for that, maybe even have a separate field in our data.

Now that you have got the links, you can either save them as separate objects/rows in a table or with the source object. I would recommend creating an object that resembles the following, since you will be able to keep track of (at least the first) source of the link. For now, we’ll keep track of if we already crawled that page with a boolean, whereas if you expect pages to change at some point, you might want to switch to a time-based format.

To generate an ID, you can either rely on your database that can create unique indexes for you, like MySQL/MariaDB with the auto incremented IDs or mongoDBs/rethinkDBs _id field or you can pick a module like hat.

To interactively test the regex, I used regex101.com, which is an awesome site, that while you type and try out your regex gives you an explanation of what strings it would match.

In case a link is relative, we need to append it to the URL we are currently crawling (technically without the query parameters if there are any, but we’ll leave those alone for now.)

To test if a link belongs to the same domain as the current, we should use a regular expression, to write regular expressions in general, I can recommend regex101.com, which interactively shows you if your RegEx matches a string or which portion of it. Also it shows you what the different characters in your RegEx mean, right in the sidebar.

For example, if you want to crawl both http and https links, you might want to write https?, which will result in the explanation:

^ assert position at start of the string
http matches the characters http literally (case sensitive)
s? matches the character s literally (case sensitive)
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]

The base variable we get from the URL that is currently being crawled and extracted like this:

The second split serves the purpose of escaping the . in the base domain.tld. Else it would represent a RegExp specific character and not the literal ., \. in the RegExp. On second look, there is also an npm module for this escape-string-regexp, but I just needed this one character for now.

The contents of the page

Wanting to save the actual page, I probably would build my pageObject like the following: