To crawl or not to crawl, that is BingBot’s question

If you are reading this column, there is a good chance you publish quality content to your web site, which you would like to get indexed by Bing. Usually, things go smoothly: BingBot visits your web site and indexes your content, which then appears in our search results and generates traffic to your site. You are happy, Bing is happy and the searcher is happy.

However, things do not always go so smoothly. Sometimes BingBot gets really excited about your quality content and ends up crawling your web site beyond all expectations, digging deeper and harder than you otherwise wanted. Sometimes you did everything you could to promote your quality content but BingBot still does not visit your site.

As much as robots.txt is a reference tool to control BingBot’s behavior, it is also a double-edged sword that may be interpreted in a way that disallows (or allows) much more than you thought initially. In this column, we will go through the most common robots.txt directives supported by Bing, highlighting a few of their pitfalls, as seen in real-life feedback over the past few months.

Where does BingBot look for my robots.txt file?

For a given page, BingBot looks at the root of the host for your robots.txt file. For example, in order to determine if it is allowed to crawl the following page (and at which rate):

Note that the host here is the full subdomain (us.contoso.com), not contoso.com nor www.contoso.com. This means that if you have multiple subdomains, BingBot must be able to fetch robots.txt at the root of each one of them, even if all these robots.txt files are the same. In particular, if a robots.txt file is missing from a subdomain, BingBot will not try to fall back to any other file in your domain, meaning it will consider itself allowed anywhere on the subdomain. BingBot does not “assume” directives from other hosts which have a robots.txt in place, associated with a domain.

When does BingBot look for my robots.txt file?

Because it would cause a lot of unwanted traffic if BingBot tried to fetch your robots.txt file every single time it wanted to crawl a page on your website, it keeps your directives in memory for a few hours. Then, on an ongoing basis, it tries to fetch your robots.txt file again to see if anything changed.

This means that any change you put in your robots.txt file will be honored only after BingBot fetches the new version of the file, which could take a few hours if it was fetched recently.

Which directives does BingBot honor?

If there is no specific set of directives for the bingbot or msnbot user agent, then BingBot will honor the default set of directives, defined with the wildcard user agent. For example:

User-Agent: *Disallow: /useless_folder

In most cases, you want to tell all search engines the URL paths where you want them to crawl, and the URL paths you want them to not crawl. Also, maintaining only one default set of directives for all search engines is less error-prone and is our recommendation.

What if I want to allow only BingBot?

In your robots.txt file, you can choose to define individual sections based on user agent. For example, if you want to authorize only BingBot when others crawlers are disallowed, you can do this by including the following directives in your robots.txt file:

User-Agent: *Disallow: /

User-Agent: bingbotAllow: /

A key rule to remember is that BingBot honors only one set of directives, in this order of priority:

This rule has two main consequences in terms of what BingBot will be allowed to crawl (or not):

If you have a specific set of directives for the bingbot user agent, BingBot will ignore all the other directives in the robots.txt file. Therefore, if there is a default directive that should apply to BingBot as well, you must copy it to the bingbot section in order for BingBot to honor it.

If you have a specific set of directives for the msnbot user agent (but not for the bingbot user agent), BingBot will honor these. In particular, if you have old directives blocking MSNBot, you are also blocking BingBot altogether as a side effect. The most common example is:

User-agent: msnbotDisallow: /

Does BingBot honor the Crawl-delay directive?

Yes, BingBot honors the Crawl-delay directive, whether it is defined in the most specific set of directives or in the default one – that is an important exception to the rule defined above. This directive allows you to throttle BingBot and set, indirectly, a cap to the number of pages it will crawl.

One common mistake is that Crawl-delay does not represent a crawl rate. Instead, it defines the size of a time window (from 1 to 30 seconds) during which BingBot will crawl your web site only once. For example, if your crawl delay is 5, BingBot will slice the day in smaller five-second windows, crawling only one page (or none) in each of these, for a maximum of around 17,280 pages during the day.

This means the higher your crawl delay is, the fewer pages BingBot will crawl. As crawling fewer pages may result in getting less content indexed, we usually do not recommend it, although we also understand that different web sites may have different bandwidth constraints.Importantly, if your web site has several subdomains, each having its own robots.txt file defining a Crawl-delay directive, BingBot will manage each crawl delay separately. For example, if you have the following directive for both robots.txt files on us.contoso.com and www.contoso.com:

User-agent: *Crawl-delay: 1

Then BingBot will be allowed to crawl one page at us.contoso.com and one page at www.contoso.com during each one-second window. Therefore, this is something you should take into account when setting the crawl delay value if you have several subdomains serving your content.

My robots.txt file looks good… what else should I know?

There are some other mechanisms available for you to control BingBot’s behavior. One of them is to define hourly crawl rates through the Bing Webmaster Tools (see the Crawl Settings section). This is particularly useful when your traffic is very cyclical during the day and you would like BingBot to visit your web site more outside of peak hours. By adjusting the graph up or down, you can apply a positive or negative factor to the crawl rate automatically determined by BingBot. This fine tunes the crawl activity to be more or less at a given time of the day, all controlled by you. It is important to note that a crawl delay noted in your robots.txt file will override the direction set within the Bing Webmaster Tool, so plan carefully to ensure you are not sending BingBot contradictory messages.

Give us Bing Site Explorer, it will really enhance all webmaster's interest. You guys do have all the data and all you guys need to di is share it.

3 years ago

Steve876

Very useful article. I never knew it

3 years ago

heripanusunan

I think that has a high ranking site that will be appears in bing search results

I got an email from microsoft customer support:

"Please note That the indexed pages reported in the Bing Webmaster Tools Sometimes the which differ of that is shown in Bing SERP. This is Because only high ranking Those pages are actually shown in Bing and not all the indexed pages reported."

….???????

3 years ago

Haqy

I submit my sitemap but satatus is "Pending" and still waiting for approve. How long does it take to be approved a site map?

2 years ago

LarkB

Does BingBot honor the Crawl-delay directive?

The answer should be no, they do not, nor do they obey robots.txt. We tried everything to slow Bing bots down on our site, nothing worked and emails to support got the typical MS "canned response" to try the things we had already done, and were indicated in the letter we sent.

We finally had to resort to blocking all of Bing's IP ranges in .htacess to keep from getting slammed by Bing in excess of 30 bots crawling at once. This was causing our site to become unstable and unresponsive. Immediately they were blocked, things returned to normal.

2 years ago

jsxubar

Disappointed with this article. All the content is about robots and a little about crawl control in bing webmaster tool . We as webmasters , want to know how to measure a website's quality. In what circustances Bing will more likely to index the contents . Refering crawl control , when I setting up crawl control , I can't setup the crawl speed during 0am – 8am, 0am – 8am is my website non-busy time , but I can't setup the crawl speed during this period, so there is no use for crawl control . Maybe Google's auto control crawl speed is easier and reasonable for webmasters .

2 years ago

A Brand

Bing search engine crawls a page that already has been removed from our entire website! Bing sitemap points to the page that already has been removed. It is very frostrating. I did diagnostics and it says page can not be found and says <H1> tag is missing.

I called my web developer he asked me to give my webmaster tools passowrd so he can look into it. He doesn't think that there is any encoding error.

How can I Bing NOT to crawl the page that already has been removed from my site?

Please help.

2 years ago

Duane Forrester

@A Brand – if you want to ensure the page is not shown by Bing, you need to make sure when we call the URL, we see a 404 message. That will start the removal of the page from our index. Inside Webmaster Tools you can also flag the URL to tell us to remove it (secondary to the 404 option) and to tell us to not show the cached version we have on our servers.

If our tool is saying there is no <H1> tag, then its not visible to us. It could still be in the code, but if it's hidden inside something our tool cannot crawl through, we won't see it – which is the same as it not being there at all.

2 years ago

SkyMogul

We have had the same experience as LarkB. BingBot does appear to do a better job of respecting Crawl-Delay than before. It does not appear to possess IP level intelligence though. We host many sites per IP using host headers, as many providers do. BingBot often launches devastating crawler swarms as he describes, attempting to crawl all the sites on a given IP in parallel. It would be nice if BingBot was aware of how much traffic it was throwing at a given IP or class C, self-limiting the number of sites being crawled simultaneously on a particular address or in a particular IP range.

2 years ago

screwnicolas

if I set crawl rate at faster level, then is it guaranteed that site will be crawled and will have load on server?