Google May Be Crawling AJAX Now – How To Best Take Advantage Of It

In October 2009, Google proposed a new standard for implementing AJAX on web sites that would help search engines extract the content. Now, there’s evidence this proposal is either live or is about to be. Read on for more details on the proposal, how it works, and why it might be past the proposal stage.

The Trouble With AJAX

Historically, search engines have had trouble accessing AJAX-based content and this proposal would enable Google (and presumably other search engines that adopted the standard) to index more of the web. The standard SEO advice for AJAX implementations has traditionally been to follow accessibility best practices. If you build the site with progressive enhancement or graceful degradation techniques so that screen readers can render the content, chances are that search engines can access the content as well. Last May, I outlined some of the crawlability issues with AJAX and options for search-friendly implementations.

One of the primary search engine problems with AJAX is that it generates URLs that contain a hash mark (#). Since hash marks are also used for named anchors within a page, search engines typically ignore everything in a URL beginning with one (called the URL fragment). So, for instance, Google would see the following two URLs as identical:

http://www.buffy.com/seasons.php

http://www.buffy.com/seasons.php#best=2

Google’s AJAX Proposal

With Google’s proposal, an AJAX-generated URL that contains a hash mark (#) would also be replaced with a URL that uses #! in place of #. So, the second URL above would become http://www.buffy.com.seasons.php#!best=2. When Googlebot encounters the exclamation point after the hash mark, it would then request the URL from the server using a syntax that would replace the #! with ?_escaped_fragment_=.

Still with me? All this means is that when Googlebot encounters:

http://www.buffy.com/seasons.php#!best=2

it will request the following URL from the server:

http://www.buffy.com/seasons.php?_escaped_fragment_=best=2

Why, you ask? Well because ?_escaped_fragment_= in the URL tells the server to route the URL request to the headless browser to execute the AJAX code and render a static page.

But, you might protest, I don’t want my URLs in the search results to look like that! Not to worry, Google requests the URL using that syntax, but then translates the ?_escaped_fragment_= back into #! when displaying it to searchers.

How Do I Implement This?

This implementation basically requires that you:

Modify your AJAX implementation so that URLs that contain hash marks (#) are also available via the hash mark/exclamation point (#!) combination (or, as I recommend below, that you replace the # versions entirely with the #! ones).

Configure a headless browser on your web server that processes the ?_escaped_fragment_= versions of the URLs, executes the JavaScript on the page and returns a static page.

Oh, you still have questions? I have answers! Well, and some questions of my own.

What about all those links? Is Google going to consolidate all links to the # version of the URL and attribute them to the #! version? It appears that the answer is no. urrently, all links to URLs that contain a hash mark are attributed to the URL before the fragment, and that will continue to be the case. And the canonical tag won’t work in this case, since Google doesn’t process the # version of the URL. So returning to our earlier example, all links to http://www.buffy.com/seasons.php#best=2 are attributed to http://www.buffy.com/seasons.php.

Wait, do we need to start using #! instead of #? You likely don’t want to implement this in such a way that the # and #! URLs co-exist. Instead, you’ll want to replace# URLs with #! URLs. You can’t redirect search engine bots, of course (same reason bots can’t crawl and index the AJAX URLs as is). This means that as noted above, the pages won’t get credit for past links to the # version of the URLs. You should ensure that the #! version of the URLs is what displays in a visitor’s browser though, so that any new links are to the (now indexable) #! versions. What about visitors coming from existing links to the # versions of the URLs? You’ll want to add code that transforms the # version of the URLs to the #! version (see below for more on that).

How do I create #! URLs in place of # URLs? That’s pretty straightforward. Just (I know, there is not “just”) modify the AJAX code that creates URLs to output #! URLs instead of # URLs.

As noted above, for any existing AJAX pages that use #, you’ll want to redirect visitors to the new URLs that use #!. This won’t cause Google to transfer links from the # versions to the #! versions but it will ensure that visitors will see only the #! version and therefore, any new links will be to that version, which will causes Google to start accruing PageRank for those pages. Obviously, you’ll want to get any new links to the versions of the pages that Google will index so those pages have a better chance at ranking well.

My colleague Todd Nemet has a few suggestions for redirecting visitors from the # versions to the #! versions of the URLs.

JavaScript – You can use document.location, such as:<script type=”text/javascript”>
document.location=”http://www.buffy.com/seasons.php#!best=2″;
</script>

PHP – You can write a short PHP script, such as:<?php
header(“HTTP/1.1 301 Moved Permanently”);
header(“Location: http://www.buffy.com/seasons.php#!best=2”);
?>

.htaccess – For Apache servers, you can use the NE flag in a rewrite rule, as shown below (although this really only works if you’re moving to the #! structure from a non-# URL):RewriteCond %{QUERY_STRING} ^best=(.*)$
RewriteRule ^seasons.php$ /seasons.php#!%1? [R=301,NE]

Meta refresh – generally, a meta refresh isn’t recommended for redirects as search engines do a better job of following 301s, but in this case, you’re only redirecting visitors. You can add code similar to the following to the <head> section of the original page:<meta http-equiv=refresh content=”0; http://www.buffy.com/seasons.php#!best=2″>

What’s this about a “headless browser”? The headless browser runs on your web server and processes requests for the ?_escaped_fragment_= versions of URLs. In Google’s original blog post, they suggested checking out HtmlUnit, an open source headless browser. The headless browser executes the JavaScript and renders a static page, then returns it to the requestor. I can hear your next question already — what does that rendered static page look like? Well, it should probably expose all of the content on the page. The two important things here are that Google will be able to get to the content and index it and that Google will have distinct URLs for indexing that content.

What does this mean for accessibility? This question came up when the Google engineers spoke at the Jane and Robot Search Developer Summit I put on just after SMX East, where this proposal was announced. This implementation doesn’t help the content render correctly on mobile devices that don’t support JavaScript or on screen readers. So when considering whether to implement this vs. another technique, think about your accessibility needs.

You’ll also want to make sure that the AJAX URLs aren’t simply popups, since you don’t want search engines to index a popup without the surrounding page content. Ensure that the headless browser creates a static page that includes all content from the page.

Any other problems with this idea? Beyond the accessibility issues (which I think shouldn’t be overlooked), the biggest consideration is probably that this method doesn’t work for search engines other than Google. So if you care about getting this content indexed by Bing and Yahoo!, you’ll want to explore other methods. Also, as you’ll see below, it seems like it may be live on Google, but a bit buggy. So, if you plan to implement it, you’ll have to rely on Google working on the kinks. You should also fully plan out the implementation. Did you previously add workarounds for AJAX issues in other ways that will now conflict with this method?

Also, even if you don’t use AJAX and don’t implement this technique, potential for problems exist. It’s always been a good practice not to configure your server to resolve to any URL request. One reason is that from a crawl efficiency perspective, you can send the search engine bots into an infinite crawl space if your server responds with an HTTP response code of 200 for any URL. But note what has happened with the site below. The “real” URL is iankellysmusic.com/About/. But it seems a link exists on the web to iankellysmusic.com/About/#!. Google has followed that link and is interpreting it as this new AJAX technique.

This site has an additional issue in that the system is set up to automatically generate the title tag based on the text after the last slash in the URL. So, while the “real” URL creates a title tag of Ian Kelly | About, the above URL generates a title of Ian Kelly | ?_escaped_fragment_=. Not awesome.

Of course, the “real” URL is also in Google’s index:

Presumably, the real About page will be the one that ranks for relevant searches, as it should have more links, but why take the chance of having your one opportunity to engage with a potential audience through search be marred by a poor search results display. Not to mention that this provides opportunity for competitive attacks.

You Said This Is Beyond the Proposal Stage?

Maybe. As noted above, it looks like Google may have begun crawling and indexing these URLs. A Google Groups thread points out that Google’s search index contains URLs with ?_escaped_fragment_= in them. Granted, things seem a little buggy. One poster pointed at these search results, which do show pages that seem to match this implementation, but the search results don’t display any URLs at all, and when I click a page, I get a confusing redirect error messsage in Flock and get sent back to the Google home page in Internet Explorer.

In other instances, Google is displaying the #! version of the URL, which indicates that this proposal is at least partially live (since Google wouldn’t ordinarily crawl a URL past the #).

I’ve found some results that display ?_escaped_fragment_=. Google said they would translate these URLs back into the #! version before displaying them, and it seems that some sites actually have this fragment in their URLs (which I would say I would find surprising, but honestly, there’s nothing much I see on the web that can surprise me anymore). My guess would be that someone saw the proposal and misinterpreted it to mean that the site pages should use the ?_escaped_fragment_= versions of the URLs and coded them that way. When you go to the actual pages, the browser address bar does indeed display the ULRs this way. Ensure that if you implement this solution, you always code the URLs with #!, as Google will make the transformation when fetching the URLs and then transform them back into the user-friendly versions when displaying them.

A Google search for inurl:?_escaped_fragment_= seems to be returning all variations of what I describe above: URLs that include #!, URLs that include ?_escaped_fragment_=, search results that list no URLs at all, search results that once clicked, return you to the same Google search results page you were just on. So has Google made this change live? Probably. Is it 100% ready for prime time? Probably not.

Should you start implementing? Depends. If you’re able to implement another method for search-friendly AJAX, I’d suggest going that route, since it will make your site available for a wider audience, including those on screen readers, mobile devices, and who use search engines other than Google. If you just aren’t able to implement another method and it’s either this method or nothing, well, then it’s better than nothing. Should you start now? Google doesn’t seem to have everything polished, but there is evidence that they’ve begun crawling these URLs. It seems inevitable that Google will support this implementation, so if you’re planning to use this method, you may as well get started.

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

About The Author

Vanessa Fox is a Contributing Editor at Search Engine Land. She built Google Webmaster Central and went on to found software and consulting company Nine By Blue and create Blueprint Search Analytics< which she later sold. Her book, Marketing in the Age of Google, (updated edition, May 2012) provides a foundation for incorporating search strategy into organizations of all levels. Follow her on Twitter at @vanessafox.