Crawling an entire Domain / Website

For testing purposes I have created a simple set of HTML pages, that should resemble a generic website. It has some page and we want our crawler to go through them and make sure it finds all of them, where they’re linked. That means when our crawler hits a page, it should keep track of the links it finds and then only proceed to pages it has not crawled yet.

In this example we’re just going to make those arrays, if you have a lot of data, if you want to shard the process or similar, you probably want to go for a message queue with redis or activemq or the likes.

To get the example below running on your machine, just install the following packages first: npm install lodash async cheerio request.