Real code: Testing a web crawler with RSpec

Published on October 17, 2017byPeter Hawkins ruby, testing

This is another post coming out of working on one of our products, Void. You often see articles teaching testing or TDD on an example topic, so I thought it was time to start writing about real world testing, with real code.

Web crawler feature specification

In Void when a user bookmarks a page or adds it to their reading list it needs to crawl the URL and fetch information about the web page, such as it’s title and a short description.

In the future we plan to fetch images and media using the pages open graph tags and some machine learning algorithms, but that’s for another day, we haven’t implemented any of that yet.

Architecture thoughts

When building this feature I wanted it to be a PORO (plain old ruby object), I didn’t want this critical piece of business logic to be coupled to the Rails framework. I named the object Crawler and placed it under the Void namespace to ensure it doesn’t collide with other libraries that I may use in the future.

For brevity I’ve skipped some of the finer grained TDD dance. As you can see the Crawler takes a web_page as it’s only argument and has a method called #start, let’s go ahead and see how this is tested.

Fast specs

The first thing you’ll notice is I included spec_helper instead of rails_helper, this allows the tests to run without loading the Rails environment and speeds things up considerably (if this interests you, I urge you to checkout destroy all software).

VCR

I use VCR to record and replay network requests in my tests. VCR allows you to ‘cache’ a request in your test and store the result on disk (I also commit them to git), this removes flaky network dependent tests but you can also delete your VCR cassettes (the saved requests) and when you re-run your tests they’ll be running against the real network again. This allows you to verify every so often that your tests working in real world scenarios.

With VCR setup you can make use of it in your tests, don’t forgot to require the vcrhelper. Anything inside the VCR.usecassette block will be recorded to a cassette with the name provided. The second time you run this spec it will use the recorded version instead of hitting the real network. N.B. you sometimes end up saving a cassette with the wrong requests if your doing TDD, during the initial development of a feature you may need to remove the cassette several times or wait until after the test is more fully formed before wrapping in a VCR.use_cassette block.

Stub web page

The Void::Crawler takes a web_page as its only argument, in Void I knew I wanted this to be an ActiveRecord Model, however for the purpose of this test the business logic doesn’t care about models, so I’m using an OpenStruct as a stub. In the happy path the crawler only requires the URL of the web page, so I’ve only implemented a URL method.

Test the public API

When testing I like to be able to refactor the internals of a feature and know it still works, this is why my tests for the Crawler only touch the public API of the object under test and are fairly simple.

require'spec_helper'require'vcr_helper'require'active_support/core_ext/string'require'void/crawler'require'ostruct'
describe Void::Crawlerdo
describe "#start"do
context "when the URL responds successfully"do
it "crawls websites"doVCR.use_cassette :crawl_pooreffortdo
web_page =OpenStruct.new(
url:"https://pooreffort.com/blog/postgresql-uuid-primary-keys-in-rails-5/")Void::Crawler.new(web_page).start
expect(web_page.title).to eq("PostgreSQL UUID primary keys in Rails 5 | poor effort")
expect(web_page.description).to eq("In a recent project I have been using UUIDs as the primary key type with Rails 5 and PostgreSQL. This can be useful if your objects IDs are publicly exposed and you want to disguise the fact that they are a sequence, or how early on in the sequence they might be ;-)")endendendendend

Parsing HTML titles

To solve the first test I knew I wanted to read the HTML <title> tag and may want to make this title parsing smarter in the future so I decided to split out a Void::HtmlTitle object and spec for this purpose, again for brevity I’ll skip the TDD dance and just show HtmlTitle’s spec and implementation.

# spec/lib/void/html_title_spec.rbrequire'spec_helper'require'active_support/all'require'void/html_title'
describe Void::HtmlTitledo
describe "when html source has a title tag"do# Take note: in the test below I don’t need to provide an entire HTML file# or do a network request, I can just provide a simple string fixture
let(:html){'<title>pooreffort.com // unreal post</title>'}
it "finds a useful description"do
title =Void::HtmlTitle.new(html).title
expect(title).to eq("pooreffort.com // unreal post")endend
describe "when html source has no title tag"do
let(:html){''}
it "has no title"do
expect(Void::HtmlTitle.new(html).title).to be_nil
endendend

The spec above is pretty straightforward and the solution adheres to the single responsibility principle nicely too, it’s only job is to look through a string of HTML and return the title.

I was now able to move back to the Crawler and implement the first part of the crawling, fetching the title from the HTML, which also requires doing an HTTP request to get the HTML, for simple network requests I’m currently using the HTTP gem as I really like it’s straightforward API 👌

require'void/html_title'require'http'moduleVoidclassCrawler
attr_reader :web_pagedefinitialize(web_page)@web_page= web_page
enddefstart
response =HTTP.get(web_page.url)
title =Void::HtmlTitle.new(response.body.to_s).title
# Since the object passed in is an OpenStruct/ActiveRecord Model# I settled with using attr writers to set the new properties
web_page.title = title
endendend

This shows implementing the Void::HtmlTitle object, I’ve also got a Void::HtmlDescription object that attempts to read a meta description, failing that dropping back to the first paragraph on the page, but due to the length of this article I’m not going to cover that in depth.

Dealing with failures

I started this feature with the happy path, I know there are lots of ways this code could fail but for my first implementation I just covered two:

Bad HTTP status codes – I didn’t want to store server error text if the site being crawled happens to 500 error when Void is crawling it.

SSL and general HTTP connection issues

I also wanted a way of tracking these errors so I could store the failures and stop attempting to crawl a site after N amount of failures.

If the HTTP status code is in the 400-500 range a message is added to the failed_crawls array.

If there is an HTTP::ConnectionError or SSL error, rescue it and add a message to the failed_crawls array.

Piecing this together with ActiveJob

Each time a new bookmark is created I queue a WebPageCrawlJob in Sidekiq that runs the Crawler, only instead of passing in an OpenStruct, I pass in the WebPage ActiveRecord Model and save it after the crawl has completed.

This ties together Void and the Crawler object, for good measure I have a high level test for this background job. It’s using FactoryGirl and persisting to the database, this might seem like overkill but it really helps give me confidence and peace of mind in my code.

As you can see in this instance I am only testing the happy path to make sure everything is integrated together. I have tests for the sad path in the Void::Crawler itself, no need to repeat them here.

Conclusion

Testing features that call out to third party services or use the network are easily tested using VCR.

Getting business logic like this under test gives me great confidence when deploying updates to Void. I can also easily write regression tests if the crawler has bugs in production, these allow me to be certain that issues are fixed and never occur again.

Do let me know in the comments if you’ve found this useful and would like more real world testing posts. If you want kept up to date with progress on Void, please subscribe to the mailing list ✌️