*Important news* for users of Common Crawl data: we are introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data

Ten years ago(!) Common Crawl joined AWS’s Open Data Sponsorships program, hosted on S3, with free access to everyone. Since then, the dataset has expanded (by petabytes!) and our community of users has seen extraordinary growth. With growth comes change, therefore…

Following are new measures to accommodate the number and volume of requests for Common Crawl data, and to ensure efficient and stable access to the data:

1. Common Crawl data is now available on CloudFront under the URL base prefixes*:

  • https://data.commoncrawl.org/ or
  • https://ds5q9oxwqwsfj.cloudfront.net/

* Data is accessible via http:// or https://.

As of Monday, April 4, 2022

2. Access to data from the Amazon cloud using the S3 API will be restricted to authenticated AWS users, and unsigned access to s3://commoncrawl/ will be disabled. See Q&A for further details.

3. To access data from outside the Amazon cloud, via HTTP(S), a new URL prefix – https://data.commoncrawl.org/ – must be used instead of https://commoncrawl.s3.amazonaws.com/.

Q & A

Q: How can I identify whether my code is using unauthenticated S3 access?
A: Authenticated access is the default, but you should verify that your code is not configured otherwise; code fragments requesting unauthenticated access could be (but are not limited to):

  • AWS CLI with the command-line option --no-sign-request:
    aws --no-sign-request s3 cp s3://commoncrawl/...
  • Python using boto3 and botocore.UNSIGNED:
    import boto3
    import botocore
    s3client = boto3.client('s3', config=botocore.client.Config(signature_version=botocore.UNSIGNED))
  • Hadoop or Spark (various programming languages): usage of AnonymousAWSCredentialsProvider
Q: Where can I find up-to-date documentation and examples that are consistent with these new protocols?
A: We will update our documentation and examples to read data either via HTTP(S) using the new CloudFront access, or via S3 and authenticated access. We will also update all data download links. It will take a few days until everything is updated. Thanks for your patience!
Q: Are range requests supported?
A: Yes, range requests are supported by CloudFront same as for S3, see the CloudFront docs about RangeGETs.
Q: What is the recommended access method on AWS but in a different region (not us-east-1)?
A: We recommend that you run your computing workload in the same region (us-east-1) as the Common Crawl dataset whenever possible. If you have a specific, ongoing need to run computing workloads using Common Crawl in another AWS region, the AWS Open Data team would like to hear more about your use case ([email protected]).
Q: What are typical error messages indicating that unauthenticated access is used?
A: The HTTP response status code is a 403 Forbidden. However, also a restrictive IAM policy on the user’s side could deny access to s3://commoncrawl/ using the S3 API.
Two examples for error messages related to unauthenticated access to s3://commoncrawl/:

> curl https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2022-05/warc.paths.gz
<!--?xml version="1.0" encoding="UTF-8"?-->
<error><code>AccessDenied</code><message>Access Denied</message>...</error>

> aws --no-sign-request s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2022-05/warc.paths.gz .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Common Crawl on AWS Public Data Sets

Amazon Web Services

Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services’ Public Data Sets. This is great news because it means that the Common Crawl data corpus is now much more readily accessible and visible to the public. The greater accessibility and visibility is a significant help in our mission of enabling a new wave of innovation, education, and research.

Amazon Web Services (AWS) provides a centralized repository of public data sets that can be integrated in AWS cloud-based applications. AWS makes available such estimable large data sets as the mapping of the Human Genome and the US Census. Previously, such data was often prohibitively difficult to access and use. With the Amazon Elastic Compute Cloud, it takes a matter of minutes to begin computing on the data.

Demonstrating their commitment to an open web, AWS hosts public data sets at no charge for the community, so users pay only for the compute and storage they use for their own applications. What this means for you is that our data – all 5 billion web pages of it – just got a whole lot slicker and easier to use.

We greatly appreciate Amazon’s support for the open web in general, and we’re especially appreciative of their support for Common Crawl. Placing our data in the public data sets not only benefits the larger community, but it also saves us money. As a nonprofit in the early phases of existence, this is crucial.

A huge thanks to Amazon for seeing the importance in the work we do and for so generously supporting our shared goal of enabling increased open innovation!