Here's my first attempt at a web crawler. I was able to do it in about 70 lines of code. The full code is included at the bottom with plenty of comments breaking it down and explaining each step.

Maybe in future posts I'll have it connect to a database and store keywords/URLs, and then a basic web app that lets you type in those keywords that query the database and return the URL (this is closer to how real search engines work). But let's start with the web crawler first.

How it works

The web crawler (or spider) is pretty straight forward. You give it a starting URL and a word to search for. The web crawler will attempt to find that word on the web page it starts at, but if it doesn't find it on that page it starts visiting other pages. Pretty basic, right? Like the Python and Java implementation, there are a few edge cases we need to handle such as not visiting the same page, or dealing with HTTP errors, but those aren't hard to implement.

Let's get started!

Pre-requisites

You need to have Node.js and npm (Node Package Manager) installed on your computer since we'll be using it to execute the JavaScript. You can verify you have both installed by running node --version and npm --version in a shell or command line. On a Windows machine it looks like this:

It's okay if your versions are a little newer. I have slightly older versions, but new ones should work just as well. If you don't have these packages installed, you can get both by heading over to the node.js downloads page which should install both Node.js and npm at the same time.

Creating the web crawler in JavaScript

Let's remind ourselves what we're trying to do:

Get a web page

Try to find a word on a given web page

If the word isn't found, collect all the links on that page so we can start visiting them

Let's start coding each of those steps up. I'll be using Atom as my text editor to write the code, but there are other great text editors out there such as Sublime Text or Notepad++. I've used all three at various points in my life and you can't go wrong with any one of them.

Go ahead and create an empty file we'll call crawler.js and add these three lines:

We use the library request to visit the page and then execute a callback after we get the response. The callback is the anonymous function that looks like function(error, response, body) {...}

In that function body (the {...} above) we examine the response status code and print it to the console. A 200 OK indicates that everything went ok.

We use cheerio to parse the page body and assign it to the variable $. You may recognize this convention if you're used to jQuery. cheerio let's us use much of the functionality of jQuery to parse the page. Therefore, we can write $('title').text() to select the HTML element such as <title>Page title</title> and display the text within it.

Run the code by typing node crawler.js

Now let's write some helper functions to do the rest of the work.

Parsing the page and searching for a word

Checking to see if a word is in the body of a web page isn't too hard. Here's what that function looks like:

We use the JavaScript function indexOf to check for occurences of a substring in a given string. Note that indexOf is case sensitive, so we have to convert both the search word and the web page to either uppercase or lowercase.

Collecting links on a web page in JavaScript

There are two types of links we'll come across on webpages. Hyperlinks can come as relative paths or absolute paths.

You'll notice that relative paths won't ever lead us away from the domain that we start on. Absolute paths could take us anywhere on the internet. That distinction is important when you're building the web crawler. Do you want your crawler to stay on the existing website (in this case arstechnica.com) and search only pages within that domain, or is it acceptable to adventure outside to other websites such as condenast.com? The code below will gather all of the relative hyperlinks as well as all the absolute hyperlinks for a given page:

If we modify the existing crawler.js to execute this function after fetching the page and then run it, we get something like:

Putting it all together

We'll need a place to put all the links that we find on every page. In this simple web crawler we just use an array that we call pagesToVisit.

The last thing we need to do is keep track of which pages we've visited (so we don't visit them more than once). In the Java version of the web crawler we used a Set<String> (specifically a HashSet<String>) that we called pagesVisited and added the URL to that set when we visited it. Before visiting a page, we make sure that the URL is not already in that set. If it is, we skip it.

Remember when we had to decide to follow absolute links or relative links (or both)? In the complete crawler below I decided to only follow relative links. Because of that we have to do some extra work to recreate the links. As an example, given /technology we need to turn it into http://www.arstechnica.com/technology. We use the URL library to achieve this.