- What is Common Crawl?
Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.
- What can you do with a copy of the web?
The possibilities are endless, but people have used the data to improve language translation software, predict trends, track the disease propagation and much more.
- Can’t Google or Microsoft just do that?
Our goal is to democratize the data so everyone, not just big companies, can do high quality research and analysis.
- What terms is the data released under?
- What is the CCBot crawler?
CCBot is a Nutch-based web crawler that makes use of the Apache Hadoop project. We use Map-Reduce to process and extract crawl candidates from our crawl database. This candidate list is sorted by host (domain name) and then distributed to a set of spider (bot) servers.
- How does the bot identify itself?
Our older bot identified itself with the following User-Agent string: CCBot/1.0 (+http://www.commoncrawl.org/bot.html). The current version identifies itself as CCBot/2.0.
- Will your bot make my website slow for other users?
The CCBot crawler has a number of algorithms designed to prevent undue load on web servers for a given domain. We have taken great care to ensure that our crawler will never cause web servers to slow down or be unaccessible to other users.
The crawler uses an adaptive back-off algorithm that rapidly slows down requests to your website if your web server is responding slowly. Our crawler will request up to 2 pages per second if your web server completely responds to the last three requests in under 250 ms.
- How can I ask for a slower crawl if the bot is taking up too much bandwidth?
We obey the Crawl-delay parameter for robots.txt. By increasing that number, you will indicate to ccBot to slow down the rate of crawling. For instance, to limit our crawler from request pages more than once every 2 seconds, add the following to your robots.txt file:
User-agent: CCBot Crawl-Delay: 2
- How can I block this bot?
You configure your robots.txt file which uses the Robots Exclusion Protocol to block the crawler. Our bot’s Exclusion User-Agent string is: CCBot. Add these lines to your robots.txt file and our crawler will stop crawling your website:
User-agent: CCBot Disallow: /
We will periodically continue to check the robots.txt file has been updated.
- How can I ensure this bot can crawl my site effectively?
We are working hard to add features to the crawl system and hope to support the sitemap protocol in the future.
- Does the bot support conditional gets/compression?
We do support conditional get requests. We also currently support the gzip encoding format.
- Why is the bot crawling pages I don’t have links to?
The bot may have found your pages by following links from other sites.
- What is the IP range of the bot?
Older versions used the IPs 126.96.36.199 through 188.8.131.52. The current version crawls from Amazon AWS.
- Does the bot support nofollow?
Currently, we do honor the nofollow attribute as it applies to links embedded on your site. It should be noted that the nofollow attribute value is not meant for blocking access to content or preventing content to be indexed by search engines. Instead, the nofollow attribute is primarily used by site authors to prevent Search Engines such as Google from having the source page’s PageRank impact the PageRank of linked targets. If we ever did ignore nofollow in the future, we would do so only for the purposes of link discovery and would never create any association between the discovered link and the source document.
- What parts of robots.txt does the bot support?
We support Disallow as well as Disallow / Allow combinations. We also support the crawl-delay directive. We plan to support the sitemap directive in a future release.
- What robots meta tags does the bot support?
We support the NOFOLLOW meta-tag.
- What to do with the crawled content?
The crawl data is stored on Amazon’s S3 service, allowing it to be bulk downloaded as well as directly accessed for map-reduce processing in EC2.