Does Crawlbot respect the robots.txt protocol?

Yes, by default Crawlbot adheres to a site’s robots.txt instructions, including the disallow and crawl-delay directives.

In specific cases — typically because of a partnership or agreement you have with the site to be crawled — the robots.txt instruction can be ignored/overridden. This is often faster than waiting for the third-party site to update its robots.txt file.

To whitelist Crawlbot for a site, specify the “Diffbot” user-agent in the site’s robots.txt: