Crawl data is free to access by anyone from anywhere. The data is hosted by Amazon Web Services’ Open Data Sets Sponsorships program on the bucket s3://commoncrawl/, located in the US-East-1 (Northern Virginia) AWS Region. You may process the data in the AWS cloud or download it for free over HTTP with a good Internet connection.
The URLs to access a given data file are composed the following way:
- if using S3
s3://commoncrawl/path_to_file
- if using HTTPS
https://data.commoncrawl.org/path_to_file
or
https://ds5q9oxwqwsfj.cloudfront.net/path_to_file
Note: the first URL is an easy-to-remember “CNAME” pointing to the same CloudFront distribution.
If Common Crawl provides a listing of a file path on this website or when announcing the publication of new crawl data, we always list the full path that follows s3://commoncrawl/
(when using S3) or https://data.commoncrawl.org/
(when using HTTPS).
Accessing the Data Using S3 in the AWS Cloud
You may use the AWS Command Line Interface but many AWS services (e.g., EMR) support the s3://
protocol/scheme and you may directly specify your input as s3://commoncrawl/path_to_file
. On Hadoop (not EMR) it’s recommended to use the S3A protocol: just change the protocol to s3a://
.
Two important notes about accessing the data using the S3 API:
- We strongly recommend to access the data from the region where it is located (us-east-1 / Northern Virginia). Please run your computing workload in the us-east-1 region. Otherwise, if you want to run your workload outside us-east-1, please access the data using HTTP/HTTPS.
- Since April 2022 access to
s3://commoncrawl/
using S3 is restricted to authenticated AWS users. However, even without an AWS account you can continue to use the data using HTTP/HTTPS, see the instructions below.
Accessing Common Crawl Data Using HTTP/HTTPS
If you want to download the data to your local machine or local cluster, you may use any HTTP download agent, as per the instructions below. It is not necessary to create an AWS account to access the data using HTTP/HTTPS.
File Listings
Listing path or files in s3://commoncrawl/
for a given prefix (or “sub-directory”) is only possible using the S3 API which requires an AWS account. We provide lists of file paths for all crawls and other data sets, see the links and instructions in the data set release notes. The listings can be used to fetch the data using S3 or HTTP by adding the prefix s3://commoncrawl/
resp. https://data.commoncrawl.org/
to the file path.
Examples
Using HTTP Download Agents
To download a file using an HTTP download agent add the full path to the prefix https://data.commoncrawl.org/
, e.g.
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-05/segments/1642320299852.23/warc/CC-MAIN-20220116093137-20220116123137-00000.warc.gz
The path without the prefix is listed in crawl-data/CC-MAIN-2022-05/warc.paths.gz. You can find the link in the announcement of the January 2022 crawl. We provide path listings for all datasets, links to the listings are shared in the dataset release announcements.
Using the AWS Command Line Interface (CLI)
The AWS Command Line Interface is easy to install and use on most operating systems (Windows, macOS, Linux), please follow the installation instructions. Once installed the CLI needs to be configured to include AWS account credentials. Alternatively (and preferable), you can associate EC2 instances or ECS containers with an IAM role allowing access to S3.
Once the AWS CLI is installed and configured, the command to copy a file to your local machine is
aws s3 cp s3://commoncrawl/path/to/file local/path/file
You may wish to look at the data or data file details before download. The command to do so is:
aws s3 ls s3://commoncrawl/path/
For instance, to list all WARC files written by the news crawler in February 2022 including the file size, use the command:
aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/2022/02/
The command to download the first file in the listing above and store it in the current directory will be:
aws s3 cp s3://commoncrawl/crawl-data/CC-NEWS/2022/02/CC-NEWS-20220201010205-00084.warc.gz ./
The AWS CLI supports recursive copying or allows for pattern-based inclusion/exclusion of files. For more information check the AWS CLI S3 user guide or call the command-line help. E.g., for help with the cp command, use:
aws s3 cp help
Range Requests to Fetch Single WARC Records
The Common Crawl URL indexes (CDX and columnar index) include the location of WARC records. For example, you may query the CDX index for the Common Crawl terms of use back in 2017:
curl 'https://index.commoncrawl.org/CC-MAIN-2017-34-index?url=https://commoncrawl.org/terms-of-use/&output=json' {"urlkey": "org,commoncrawl)/terms-of-use", "timestamp": "20170824102900", "url": "https://commoncrawl.org/terms-of-use/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "JXY77DKLYTHWKLKALRFKJQ4LM23FMLEV", "length": "7058", "offset": "91300676", "filename": "crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz"}
Using the returned WARC file name, offset and length it’s possible to fetch the WARC record from the archives by sending an HTTP range request from $offset to ($offset+$length-1). Here an example, how to fetch the ToU record using curl and uncompress it on the fly:
curl -s -r91300676-$((91300676+7058-1)) \ "https://data.commoncrawl.org/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz" \ | gzip -dc
The AWS CLI can be used to send a range request via the S3 API:
aws s3api get-object --bucket commoncrawl --range bytes=91300676-$((91300676+7058-1)) --key crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz .../my_local.warc.gz