Q: How does the Common Crawl CCBot identify itself?

CCBot identifies itself via its UserAgent string as CCBot/2.0 (https://commoncrawl.org/faq/). Our older bot identified itself with the UserAgent string CCBot/1.0 (+https://commoncrawl.org/bot.html). We may increment the version number in the future.

Question 1

What is Common Crawl?

Accepted Answer

Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the Internet to Internet researchers, companies and individuals at no cost for the purpose of research and analysis.

Question 2

What can you do with Common Crawl data?

Accepted Answer

The possibilities are endless. People have used the data to improve language translation software, predict trends, track disease propagation, and much more.
‍
The crawl data is stored on Amazon’s S3 service, allowing it to be bulk downloaded as well as directly accessed for Map-Reduce processing in EC2.

Question 3

Can't Google or Microsoft just do what Common Crawl does?

Accepted Answer

Our goal is to democratize the data so that everyone, not just big companies, can do high-quality research and analysis.

Question 4

Under what terms is Common Crawl data released?

Accepted Answer

As strong believers in Open Data, we apply as few restrictions as possible to the dataset.

The terms we add (primarily in an effort to prevent abusive or illegal usage) are described on our
‍Terms of Use page.

Question 5

What is the Common Crawl CCBot crawler?

Accepted Answer

CCBot is a Nutch-based web crawler that makes use of the Apache Hadoop project.

We use Map-Reduce to process and extract crawl candidates from our crawl database.

This candidate list is sorted by host (domain name) and then distributed to a set of crawler servers.

Question 6

How does the Common Crawl CCBot identify itself?

Accepted Answer

CCBot identifies itself via its UserAgent string as:
‍
CCBot/2.0 (https://commoncrawl.org/faq/)

Our older bot identified itself with the UserAgent string:
‍
CCBot/1.0 (+https://commoncrawl.org/bot.html)

We may increment the version number in the future.

Question 7

How does CCBot fetch a web page?

Accepted Answer

CCBot is an automated crawler, checking first the robots.txt, and if crawling a page is allowed, fetches pages using HTTP GET requests.

It supports both HTTP/1.1 and HTTP/2, the latter only over TLS (https://). Connections over IPv4 and IPv6 are supported.

CCBot follows up to four consecutive HTTP redirects, or up to five when fetching robots.txt in line with RFC 9309. Currently, JavaScript is not executed and Cookies are not used.

Question 8

Will the Common Crawl CCBot make my website slow for other users?

Accepted Answer

The CCBot crawler has a number of algorithms designed to prevent undue load on web servers for a given domain.

We have taken great care to ensure that our crawler will never cause web servers to slow down or be inaccessible to other users.

The crawler uses an adaptive back-off algorithm that slows down requests to your website if your web server is responding with a HTTP 429 or 5xx status. By default our crawler waits few seconds before sending the next request to the same site.

Question 9

How can I ask for a slower crawl if the Common Crawl CCBot is taking up too much bandwidth?

Accepted Answer

We obey the Crawl-delay parameter for robots.txt. By increasing that number, you will indicate to CCBot to slow down the rate of crawling.

Question 10

How can I block the Common Crawl CCBot?

Accepted Answer

You configure your robots.txt file which uses the Robots Exclusion Protocol to block the crawler. Our bot's exclusion UserAgent string is: CCBot.

Question 11

Can I add my website to Common Crawl?

Accepted Answer

Common Crawl's dataset is a sample of the web, and we do not generally archive any entire website but a randomly selected subset of it. Our crawler supports the Sitemap Protocol and utilizes any Sitemap announced in the robots.txt file; you can learn more here about setting that up for us to crawl your website more effectively if our crawler visits it: https://www.sitemaps.org/

Question 12

Why am I getting connection errors or 5xx responses from index.commoncrawl.org?

Accepted Answer

Our CDX API endpoint is frequently abused and therefore heavily rate limited. If your client sends too many requests in a short period of time, your IP address may be temporarily blocked.

To avoid connection issues, always use HTTPS (https://index.commoncrawl.org) — HTTP connections are not supported and may fail with browser or client errors.

If you receive HTTP 503 responses, please slow down your request rate. These are a sign you've exceeded the acceptable request rate. If your IP is temporarily blocked, please wait 24 hours before trying again.

Please sleep between calls to our API (including if you run your script repeatedly in a loop), don't run multiple threads at once on the same IP, and don't use proxy networks. You should also ensure that you are using a properly formulated UserAgent string (see RFC 9110).

You may wish to use our columnar index via Amazon Athena or Apache Spark if your query involves broad or large-scale filtering. These tools are better suited to high-volume access patterns and provide more flexibility for complex queries.

We also provide an official downloader client cc-downloader which is robust and polite. The cdx-toolkit project also serves as an example of good practices and politeness.

Please refer to https://status.commoncrawl.org/ to see the current load our systems are under.

Question 13

How can I ensure the Common Crawl CCBot can crawl my site effectively?

Accepted Answer

The crawler supports the Sitemap Protocol and utilizes any Sitemap announced in the robots.txt file.

Question 14

Does the Common Crawl CCBot support conditional GET and/or compression?

Accepted Answer

We do support conditional GET requests. We also currently support the gzip, Brotli, and ZStandard encoding formats.

Question 15

Why is the Common Crawl CCBot crawling pages I don't have links to?

Accepted Answer

The bot may have found your pages by following links from other sites.

Question 16

What is the IP range of the Common Crawl CCBot?

Accepted Answer

Our crawler is now run on dedicated IP address ranges with reverse DNS. This allows webmasters to verify whether a logged request stems from CCBot.

Question 17

Does the Common Crawl CCBot support nofollow?

Accepted Answer

We currently honor the nofollow attribute as it applies to links embedded on your site.

It should be noted that this attribute value is not meant for blocking access to content or preventing content to be indexed by search engines; instead, it is primarily used by site authors to prevent search engines such as Google from having the source page’s PageRank impact the PageRank of linked targets.

If we ever did ignore nofollow in the future, we would do so only for the purposes of link discovery and would never create any association between the discovered link and the source document.

Frequently Asked Questions

General Questions

Technical Questions

Need a more specific answer?

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use