Google Patent Granted on Polite Web Crawling

Your website may be invaded by robots at any time. If you’re lucky that is – at least if you want people to visit you from places like Google or Yahoo or Bing. And, if the visiting robots are polite.

In the early days of the Web, automated programs known as robots, or bots, were created to find information on the Web, and to create indexes of that information. They would do this regardless of whether you wanted them to visit your pages or not, and you had no way to tell them not to go through your web site.

Programs that automatically traverse the web can be quite useful, but have the potential to make a serious mess of things. Robots have been written which do a “breadth-first” search of the web, exploring many sites in a gradual fashion instead of aggressively “rooting out” the pages of one site at a time. Some of these robots now produce excellent indexes of information available on the web.

But others have written simple depth-first searches which, at the worst, can bring servers to their knees in minutes by recursively downloading information from CGI script-based pages that contain an infinite number of possible links. (Often robots can’t realize this!) Imagine what happens when a robot decides to “index” the CONTENTS of several hundred mpeg movies. Shudder.

A Google patent granted today describes how it might schedule the crawling of web sites so that it doesn’t bring servers to their knees.

It tells us that Google might use crawlers that are focused upon finding different kinds of content, such as a robot that looks specifically for images, another for news, one for shopping or sports, and yet another that is more general and will try to discover new URLs for pages, and new content on already discovered pages. These different types of robots might all decide to visit the same web site, or the same host server which might be home to more than one site. If too many robots try to visit the same server, they might use up all of the resources of the site being visited, and keep other people from seeing the pages on that host server.

This kind of fragmentation of web crawling with different robots organized by type of information sought was intended to put less stress, and less of a load on web pages being crawled. But segmenting crawling by type created new problems in managing which crawlers visited different servers, and when.

The patent was originally filed back in 2003, and chances are very good that Google’s crawling of web sites has evolved in the years since this document was created. But the basic concepts about using multiple crawlers, scheduling the crawling of pages, prioritizing different kinds of pages to be crawled based upon metrics like PageRank, might still be very similar almost 7 years later.

Search Engine Push Towards Page Speed

When someone visits your pages, if they have to wait while the content of the page appears, it’s possible that they may move on to another site. If they find a link to your page in Google’s (or Yahoo’s or Bing’s) search results, and click upon it, and nothing happens for a couple of seconds because your page is slow, they might click on another search result and visit another site.

I’ve written in the past on the topic of Does Page Load Time influence SEO?, partially about a patent from Yahoo that describes how they might use page speed as a ranking signal. Much of the discussion in the comments for that post describe other ways that the loading time of a page might impact web sites negatively as well, regardless of whether or not page speed is used as a ranking signal.

Google and Yahoo have been leading the way recently in trying to make the Web faster.

Google has released a number of tools and published articles that can help webmasters improve the speed with which their pages load.

One of Google’s tools is a FireFox add-on named Page Speed, which a webmaster can use to see how quick a site loads in a browser, and receive advice on changes that can be made to increase that page load time. Yahoo has also created a Firefox add-on called YSlow which runs a test on your site, looking at how quickly or slowly your pages load, and provides advice for improving page speed as well.

Google incorporated a Site Performance section in Google Webmaster Tools that can provide other information about the speed of your pages.

The Google Code pages also include a number of articles and tutorials on how to help “make the web faster.”

We don’t know if Yahoo implemented their patent mentioned in my page load post mentioned above, but Google announced in April of this year that they will be Using site speed in web search ranking for at least some sites.

Why the emphasis on quicker pages?

One answer cited by the search engines is that quicker pages provide better user experiences. Another is that they help make the Web faster overall.

Another “reason” webmasters have to help speed up the Web is that the crawling of Web pages can be negatively impacted by sites that respond slowly.

The patent tells us that:

On the other hand, the load capacity of a web host is often limited by the web host’s hardware setup. When the simultaneous requests for load capacity from various web crawlers are above the maximum load capacity a web host can provide, it is almost certain that some of the competing web crawlers will receive slow service from the web host, and some requests may even fail.

Such phenomenon is sometimes referred to as “load capacity starvation” for a web crawler. Load capacity starvation prevents web crawlers from retrieving documents from a web host and passing them to an indexer in a timely fashion, which adversely affects both the web host and the freshness of search results generated by the search engine.

The concept of scheduling visits by crawlers in a manner that doesn’t starve the load capacity of a server is known as a politeness protocol.

If you have a web site with a very large number of pages, or a smaller site that shares a web server with a number of other pages, the ability of your server to handle visits to your pages from crawlers may be limited by your server’s hardware and responsiveness. For large and small sites, this can mean that your pages may not all get indexed, or may not be indexed frequently enough to capture regular changes on your pages because a search engine may be polite when visiting and try not to overload the resources of your server.

A host load server balances a web host’s load capacity among multiple competing web crawlers of a search engine. The host load server establishes a lease for each pair of requesting web crawler and requested web host. The lease expires at a scheduled time. If the web crawler completes its mission of retrieving documents from the web host prior to the expiration of the lease, the host load server releases the load capacity allocated to the web crawler and makes it available for other competing web crawlers.

If the web crawler submits a request for renewing its lease with the web host at the scheduled time, the host load server allocates another share of load capacity to the web crawler. If the web crawler does not submit any request at the scheduled time, the host load server terminates the lease and releases the load capacity for other web crawlers.

Conclusion

If you’re interested in digging into the fine details of how different kinds of crawlers (image crawlers, news crawlers, general web crawlers) might be scheduled to visit servers politely, the patent is worth exploring in depth.

Yea, some of my sites are visited by robots like 4000 times a day and by people only 100 times :P.
There was this one time when the stupid msn bot was getting one page that was missing all day for thousands and thousands of times. The server returned 404 error but the bot did not give up. He made 90 minutes processor time instead of usual 25 minutes. I had to block the bot.

With the importance of improving Page load times for SEO, I’ve tried everything without too much improvement. I’ve used Yslow, caching plugins, upgraded to VPS hosting, combining CSS, reducing .css images, and most available options, but my website’s page load times from Google Webmaster keep going up.

What’s odd is that I have two different websites on the same VPS hosting account. They use the same plugins, WordPress theme, and everything. One of the sites has maybe three or four additional plugins for multi-author publishing.

The site with more plugins and traffic is slower than 55% of websites by Google Webmaster, while the other site is slower than 95%. I really can’t figure that one out. Any ideas?

Well it’s good that Google actually cares about it. Google’s not the issue but sometimes so many bots (image bots, spam bots vs.vs.) just keep crawling my sites. I’ve actually used a robots.txt that i’ve seen over sebastian’s pamphlets. It might help. Just saying 🙂

But i digress. This is a nice change for big sites. But on small sites and on non-frequently changed pages, this might reduce indexing time by a large amount…

One example that I can think of – a Yahoo patent filing from a while back noted that Yahoo’s crawler might visit a page, and then revisit a minute or so later to see if any of the links change on those pages – if they do, those links might be ones pointing to rotating advertisements, and they might treat them a little differently than links that stay the same from one visit to another.

But your experience with the MSN bot that kept returning to a 404 page sounds pretty odd. I wonder if the scheduling program that MSN may use to track visits from bots was experience some kind of bug.

It’s not talked about a lot, but the politeness protocol developed in the early days of web crawling to keep bots from overwhelming servers – I’ve seen a number of discussions on older Usenet pages from people complaining about their sites going down because of bots that weren’t very polite.

I would probably need to know both URLs if I were to have a chance to diagnose the problem that you are experiencing, but I can suggest a few different things for you to try.

Have you tried looking at page load speed with the slow site while turning one plug-in off at a time? Do you have javascript on the pages of the site that load external resources such as images or links or RSS feeds, and might they be the culprit. Try turning those off by commenting them out or removing them temporarily and running the page speed plugins.

Chances are that Google possibly implemented something like this years ago, with a careful scheduling of when they visit web pages, and servers. I don’t think this process is necessarily something new, but rather a good description of how they may be, or may have been crawling web pages, and some of the issues that they may face when doing so.

Sebastian’s pamphlets is a very good resource for topics such as web crawling and bots, and I’d imagine that if you followed his suggestions regarding robots.txt that it has been helpful.

SEO by the Sea focuses upon SEO as the search engines tell us about it, from sources such as patents and white papers from the search engines. This information about SEO is tempered by years of experience from the author of the site, who has been doing SEO since the days when search engines started appearing on the Web.