Learning to Crawl - Building a Bare Bones Web Crawler with Elixir

I’ve been cooking up a side project recently that involves crawling through a domain, searching for links to specific websites.

While I’m keeping the details of the project shrouded in mystery for now, building out a web crawler using Elixir sounds like a fantastic learning experience.

Let’s roll up our sleeves and dig into it!

Let’s Think About What We Want

Before we start writing code, it’s important to think about what we want to build.

We’ll kick off the crawling process by handing our web crawler a starting URL. The crawler will fetch the page at that URL and look for links (<a> tags with href attributes) to other pages.

If the crawler finds a link to a page on the same domain, it’ll repeat the crawling process on that new page. This second crawl is considered one level “deeper” than the first. Our crawler should have a configurable maximum depth.

Once the crawler has traversed all pages on the given domain up to the specified depth, it will return a list of all internal and external URLs that the crawled pages link to.

To keep things efficient, let’s try to parallelize the crawling process as much as possible.

Hello, Crawler

Let’s start off our crawler project by creating a new Elixir project:

mix new hello_crawler

While we’re at it, let’s pull in two dependencies we’ll be using in this project: HTTPoison, a fantastic HTTP client for Elixir, and Floki, a simple HTML parser.

We’ll add them to our mix.exs file:

defp deps do
[
{:httpoison, "~> 0.13"},
{:floki, "~> 0.18.0"}
]
end

And pull them into our project by running a Mix command from our shell:

mix deps.get

Now that we’ve set up the laid out the basic structure of our project, let’s start writing code!

Crawling a Page

Based on the description given above, we know that we’ll need some function (or set of functions) that takes in a starting URL, fetches the page, and looks for links.

Let’s start sketching out that function:

def get_links(url) do
[url]
end

So far our function just returns a list holding URL that was passed into it. That’s a start.

We’ll use HTTPoison to fetch the page at that URL (being sure to specify that we want to follow 301 redirects), and pull the resulting HTML out of the body field of the response:

A page that links to itself, or a set of pages that form a loop, will cause our crawler to enter an infinite loop.

We’re not restricting crawling to our original host.

We’re not counting depth, or enforcing any depth limit.

Let’s address each of these issues.

Handling Relative URLs

Many links within a page are relative; they don’t specify all of the necessary information to form a full URL.

For example, on http://www.east5th.co/, there’s a link to /blog/. Passing "/blog/" into HTTPoison.get will return an error, as HTTPoison doesn’t know the context for this relative address.

We need some way of transforming these relative links into full-blown URLs given the context of the page we’re currently crawling. Thankfully, Elixir’s standard library ships with the fantastic URI module that can help us do just that!

Let’s use URI.merge and to_string to transform all of the links scraped by our crawler into well-formed URLs that can be understood by HTTPoison.get:

Now our link to /blog/ is transformed into http://www.east5th.co/blog/ before being passed into get_links and subsequently HTTPoison.get.

Perfect.

I find myself producing lots of ranging content from books and articles, like the one you're reading now, to open source software projects. If you like what I'm doing, nothing shows your support more than signing up for my mailing list.

Preventing Loops

While our crawler is correctly resolving relative links, this leads directly to our next problem: our crawler can get trapped in loops.

The first link on http://www.east5th.co/ is to /. This relative link is translated into http://www.east5th.co/, which is once again passed into get_links, causing an infinite loop.

To prevent looping in our crawler, we’ll need to maintain a list of all of the pages we’ve already crawled. Before we recursively crawl a new link, we need to verify that it hasn’t been crawled previously.

We’ll start by adding a new, private, function head for get_links that accepts a path argument. The path argument holds all of the URLs we’ve visited on our way to the current URL.

The context argument is a map that lets us cleanly pass around configuration values without having to drastically increase the size of our function signatures. In this case, we’re passing in the headers and options used by HTTPoison.get.

To populate context, let’s change our public get_links function to build up the map by merging user-provided options with defaults provided by the module:

It’s important to note that this doesn’t completely prevent the repeated fetching of the same page. It simply prevents the same page being refetched in a single recursive chain, or path.

Imagine if page A links to pages B and C. If page B or C link back to page A, page A will not be refetched. However, if both page B and page C link to page D, page D will be fetched twice.

We won’t worry about this problem in this article, but caching our calls to HTTPoison.get might be a good solution.

Restricting Crawling to a Host

If we try and run our new and improved WebCrawler.get_links function, we’ll notice that it takes a long time to run. In fact in most cases, it’ll never return!

The problem is that we’re not limiting our crawler to crawl only pages within our starting point’s domain. If we crawl http://www.east5th.co/, we’ll eventually get linked to Google and Amazon, and from there, the crawling never ends.

We need to detect the host of the starting page we’re given, and restrict crawling to only that host.

Thankful, the URI module once again comes to the rescue.

We can use URI.parse to pull out the host of our starting URL, and pass it into each call to get_links and handle_response via our context map:

Because we passed in the result of URI.parse into get_links, our url is now a URI struct, rather than a string. We need to convert it back into a string before passing it into HTTPoison.get and handle_response:

Limiting Our Crawl Depth

Once again, if we let our crawler loose on the world, it’ll most likely take a long time to get back to us.

Because we’re not limiting how deep our crawler can traverse into our site, it’ll explore all possible paths that don’t involve retracing its steps. We need a way of telling it not to continue crawling once it’s already followed a certain number of links and reached a specified depth.

Due to the way we structured out solution, this is incredibly easy to implement!

All we need to do is add another guard in our get_links function that makes sure that the length of path hasn’t exceeded our desired depth:

if continue_crawl?(path, context) and crawlable_url?(url, context) do

The continue_crawl? function verifies that the length of path hasn’t grown past the max_depth value in our context map:

As with our other opts, the default max_depth can be overridden by passing a custom max_depth value into get_links:

HelloCrawler.get_links("http://www.east5th.co/", max_depth: 5)

If our crawler tries to crawl deeper than the maximum allowed depth, continue_crawl? returns false and get_links returns the url being crawled, preventing further recursion.

Simple and effective.

Crawling in Parallel

While our crawler is fully functional at this point, it would be nice to improve upon it a bit. Instead of waiting for every path to be exhausted before crawling the next path, why can’t we crawl multiple paths in parallel?

Amazingly, parallelizing our web crawler is as simple as swapping our map over the links we find on a page with a parallelized map:

Instead of passing each scraped link into get_links and waiting for it to fully crawl every sub-path before moving onto the next link, we pass all of our links into asynchronous calls to get_links, and then wait for all of those asynchronous calls to return.

For every batch of crawling we do, we really only need to wait for the slowest page to be crawled, rather than waiting one by one for every page to be crawled.

Efficiency!

Final Thoughts

It’s been a process, but we’ve managed to get our simple web crawler up and running.

While we accomplished everything we set out to do, our final result is still very much a bare bones web crawler. A more mature crawler would come equipped with more sophisticated scraping logic, rate limiting, caching, and more efficient optimizations.

Crawling for Cash with Affiliate Crawler
– I've created a new tool called Affiliate Crawler that's designed to crawl through your written web content, looking for affiliate and referral marketing opportunities.