SpiderSpotting: When A Search Engine, Robot Or Crawler Visits

Search engines send out what are called spiders, crawlers or robots to visit your site and gather web pages. These robots leave traces behind in your access logs, just as an ordinary person does. If you know what to look for, you can tell when a spider has come to call. That can save you worrying that you haven't been visited. You can tell exactly what a robot has recorded or failed to record. You can also spot robots that may be making a large number of requests, which can affect your page impression statistics or even burden your server.

Searching for Spiders

How do you identify a spider? Those from the major search engines can sometimes be identified from their host names. These often incorporate part of the search engine's name or the company's name. For example, one of WebCrawler's host names is spidey.webcrawler.com.

A better way of spotting spiders is to look for their agent names, or what some people call browser names. Spiders have their own names, just like browsers. For example, Netscape identifies itself by saying Mozilla. Alta Vista's spider says Scooter, while HotBot's spider is named Slurp.

Some resources for getting a list of host and agent names for the major search engines is below. However, it's useful to know how to spot any robot, because names can change, or new robots can appear. So, let's take a look at how to spot robots when you don't know what to look for.

Be aware that in the examples below, spider names are from when this article was originally written, in 1997. The principles of spotting spiders still remains the same, however.

Your Best Clue: robots.txt

Start your search with a review of requests for the robots.txt file. This is a file that tells robots what they may and may not index within a site. Not all spiders follow the robots.txt convention, but most do. Anything requesting this file is almost certainly a spider, robot or an agent.

By reviewing the requests, you can usually spot spiders from the major search engines by their host names, which in turn tells you the latest agent names. You'll probably be surprised to see how many smaller search engines, personal agents and other robots are also accessing your site.

I call this review of the robots.txt file a crawler report. Here's an example of one:

I created the report by using log analysis software to analyze three months of log activity. The report lists requests first by agent name, then by host name, and ranked in order of visits.

As you can see, there are all sorts of robots visiting the site. Small search engines and even experimental search engines make visits, such as Stanford's BackRub (the predecessor to Google). Offline browsers also come to call, such as NetCarta's WebMapper. These are essentially personal spiders.

Naturally, the major search engines make appearances. It's easy to spot Infoseek as InfoSeek Sidewinder/0.9 and WebCrawler as WebCrawler/3.0 Robot libwww/5.0. But you need to know that HotBot is produced by Inktomi to match it to Slurp/2.0, or that Architext is the parent company to Excite to know that ArchitextSpider means that Excite has visited.

Even More Agent Names

Armed with agent names from the crawler report, you can go back and run a report for a specific search engine's spider. For example, I can look for Infoseek by searching for InfoSeek Sidewinder/0.9.

It's often a good idea to slightly broaden the search. By searching for *infoseek* with my log analysis software, I told the program to search the logs for any matches that have the "infoseek" in the agent name. I also told the program to list all the host names from these visits. Here's the report:

From the report, you can see the usefulness of broadening the search. Agent names can change, which is why InfoSeek Sidewinder and Infoseek Robot 1.17 are also listed.

Helpful Host Names

In addition, listing host names helps you spot any uses of the Infoseek crawler by someone other than Infoseek. For example, the seven.eccosys.com host requests are probably from a company that has licensed Infoseek's technology for its own purposes. Look at the HotBot report to see something similar.

See the crawler1.anzwers.ozemail.net request? This is probably from the Australian partner of HotBot called OzeMail. I want to exclude it from the report, so that I only get true HotBot requests.

I had another reason to want to know host names. I only began recording agent name information in mid-January 1997, but my statistics database went back to December. So, I needed to run reports filtered by host name in order to get a complete picture. In the case of Infoseek, it was easy to see that the host names all had the form of *-bbn.infoseek.com, so I entered this into my log analysis program in order to get only Infoseek requests. Here's a revised version of that earlier report:

Actually, when I ran this report, I broadened the search to *infoseek.com, to ensure I didn't miss anything. That's why a single request from topgun.infoseek.com appears.

But by also listing the agent names, I could see that there was a single request from Netscape 3.x. This is no doubt a human being from Infoseek that came to the site. By narrowing the filter to *-bbn.infoseek.com, as shown above, this non-spider request would be eliminated.

Some Final Clues

OK, now you know what to look for. However, there are a few things that can throw you off. For example, some search engines can go nuts now and then. Really. Go back and look at the HotBot report, above. Look way down on the report, where actual page requests are listed. You can see that for several days, all HotBot did was request the home page from the site, and nothing more.

During this period, HotBot was upgrading its crawler technology. It went back into normal mode in late February. You can see the change, because suddenly the search engine requests a large number of different pages from the web site.

That's what a good search engine will do. It will visit your site on a regular basis and request a large number of pages, perhaps spread out over a period of a few days. This is the search engine being "polite" and trying not to overburden your server all at once.

So, be sure to look at what the spiders are actually requesting, rather than just adding up the requests. It can be surprising.

Another oddity can be caused by "instant indexing" search engines, such as Alta Vista, Infoseek and WebCrawler. These will instantly spider any page submitted to them (though WebCrawler takes much longer to add that page to its the index, unlike the other two).

Sometimes, people see that spiders from these search engines have retrieved a page or two and mistakenly believe everything has been added. In reality, the spider still needs to return to properly catalog the site. Watch your logs, and you'll know when it has.

More Resources

How Search Engines WorkSpiders are just one part of a search engine. This page within the site puts it all in context for you.

Search engine spiders and crawlers can skew page view statistics, which means that advertisers might pay for impressions that human beings never see. To solve this, the Interactive Advertising Bureau is now offering a list of robotic visitors.

Is Google Panda real-time, or has it been missing in action since October 2014? Glenn Gabe looks to clear up Panda confusion by explaining the evolution of Panda updates, what we can expect in 2015, and how it can impact business owners, webmasters, and SEOs.
0 Comments