Menu

The story about SEO and SPA

30 September 2015

UPDATE: Google announced on their blog they're deprecating their AJAX crawling scheme

The web crawling bot

Once upon a time there was a web crawling bot also known as spider. It looked around on your website for new and updated pages to add to the Google, Bing or DuckDuckGo index. On your "old" and well indexed website this was no problem at all. But after developing your new website with one of the latest technology, single page application, this suddenly became a bit different. Google developed a scheme for search engines to crawl and index your content. If your SPA adopted this scheme, your content would show up in search results.

Supporting the AJAX crawling scheme

Once the crawling bot tries to crawl your site you need to tell the bot that your site is heavily based on JavaScript and that it implements the AJAX crawling scheme. To do this you need to add an exclamation mark after the hash in your URL. So http://www.example.com/#example becomes http://www.example.com/#!example.
When the bot sees the ! in your pretty URL right after the hash it will temporarily change the URL to a very ugly one: http://www.example.com/?_escaped_fragement=example. The bot will send a request to this URL, where it is our job to handle this _escaped_fragment_ and serve the bot with a HTML snapshot. This snapshot is then used by the bot to index the content of the page.

If you don't use hashes in your URL you have to tell the bot via a meta tag that the application is implementing the scheme:

<meta name="fragment" content="!">

Serving HTML snapshot

Once you've told the bot you created a website with a lot of JavaScript it is time to serve the bot with an HTML snapshot. A HTML Snapshot is a static version of your dynamically generated content.
To create this snapshot you'll need a headless browser (we use PhantomJs) to crawl your page.

The headless browser requests your dynamic page, which loads all the necessary JavaScript and CSS so that the static HTML file can be used to create a snapshot for the bot.

Obviously you don't want to serve these snapshots to a real user, so how do we see the difference between a bot and a real user? This is where crawling scheme comes in play. The bot will request your webpage with the ugly version of the url, so with _escaped_fragement at the end. In your server-side code (we use ASP.NET MVC and/or WebAPI) you need to handle it somewhat like this:

// Return the index view if the request is not from a bot and give control to your SPA framework
if (Request.QueryString["_escaped_fragment_"] == null) {
return View();
}
// If the request contains the _escaped_fragment_ then we return the created snapshot
try {
//Remove the ?_escaped_fragment part to be able create a snapshot with PhantomJS
var result = CrawlPage(Request.Url.AbsoluteUri.Replace("?_escaped_fragment_=", ""); //CrawlPage is requesting our WebApi
return Content(result);//result is the HTML of the snapshot which can be served to the bot
}
catch (Exception ex) {
return new HttpStatusCodeResult(HttpStatusCode.InternalServerError);
}

The CrawlPage method is requesting our SnapshotController which is part of an external WebApi. This controller, together with PhantomJS, creates the snapshot and returns the HTML to our web application. A perfect good example of all this can be found at Github

Maybe it takes some time to get your head around this, but your SPA website will be perfectly indexed using this approach.

Do you have another approach to make your SPA crawlable? Or have any questions? Sound off in the comments below.