If you are wondering: how to monetize your aspiring Blog or Website; get help with your PC; or, any other topic, we are here to answer your questions and to give you hints and direction as to what options are available to you.

Pages

Friday, January 20, 2012

If you have a website and it is being crawled too often by Bing, Yahoo, or Live, this post describes how to reduce their crawl rate to acceptable levels.

Last week, we began receiving 500 and 503 errors from one of our affiliate stores. This had the undesirable side effect of placing our local instance of Apache's web server in an error state and thus taking our site offline for several hours each day.

We realized that our site was down, but did not know why. After reading through our servers log files, we discovered the 5xx errors. After researching these http error codes, we found that we could not fix these errors directly. Instead, we had to correct the root cause.

Searching through our affiliate's website, we found that they will return these error codes when their server receives too many requests from a particular IP. So, we returned back to our log files and found that the Bing, Yahoo, and Live crawlers were simultaneously requesting many of our pages at the same time.

In order to fix our problem, we had to slow down these crawlers. Our first action was to add a crawl delay to our robots.txt file. Initially, we set this to 60 seconds.

In order to utilize this, we needed to sign in with our Windows LiveID. We did not have one so we created a new one. That was very easy and we were able to Sign In to that site within minutes.

Next, we had to add our site to the crawler. The Bing Webmaster Home page has two sections. The first is for messages, and the second is for sites. We found the "Add Site" link and submitted our site's URL.

Unfortunately, it takes about 3 days before any statistics are displayed. So, we just waited.

Once we saw that Bing was crawling and indexing our pages, we were then able to reduce the crawl rate.

This was done by:

Signing into the Bing Webmaster Tools

Clicking on our site's URL listed in the Sites section.

That brought us to the Dashboard page.

At the top is a "Crawl" link, and we clicked on it.

The next page then provided a sub-menu.

We clicked on the "Crawl Settings" link and it brought us to a graphical "Crawl Rate" page.

We lowered our Crawl rate to Minimum (by highlighting the boxes for each hour of the day)

And lastly, we pressed the "Save" link.

Within 2 days, the Bing, Yahoo, and Live crawlers were behaving properly, and all of our HTTP 5xx errors disappeared.

During this process, we learned five important things about crawlers:

The Google bot crawl rate is well behaved, and does not overwhelm your server

The Google crawler ignores the "Crawl-delay" command in the robots.txt

Bing only allows a maximum crawl-delay of 4 seconds

Once your site becomes large enough, the crawling bots can harm your site

Crawlers are tamable.

Note: To set a crawl delay in your robots.txt file, enter the two lines:

User-Agent: *Crawl-delay: 4

at the top of the file.

Even if you are not experiencing problems with your website, we suggest that you submit your site to Bing Webmaster Tools. Although the interface is slow, it provides a wide variety of information about your website which is a great complement to the Google Webmaster Tools.

2 comments:

It was our experience that the Bing Webmaster tools were useless. Bing was crawling our site to such an extent, it was causing instability and crashes. Bing did not obey robots.txt and every other "canned" solution provided re: slowing the crawl directive was useless. Bing (and MSN bots) were slamming us in excess of 40+ bots crawling us at the same time. We finally had to result blocking all Bing IP ranges in htaccess. Within minutes, our site's stability returned. No, we don't show up on Bing's search results anymore, but between that or having a useable website, it was a no brainer. I would also like to add that Microsoft was absolutely no help whatsoever. After the third same "canned response", it made our decision easier.