Common Crawl data is free to access by anyone from anywhere. The data is hosted by Amazon Web Services’ Open Data Sets Sponsorships program on the bucket s3://commoncrawl/, located in the US-East-1 (Northern Virginia) AWS Region. You may process the data in the AWS cloud or download it for free over HTTP with a good Internet connection.
The URLs to access a given data file are composed the following way:
- if using S3
- or using HTTPS
If Common Crawl provides a listing of a file path on this website or when announcing the publication of new crawl data, we always list the full path that follows
s3://commoncrawl/ (when using S3) or
https://data.commoncrawl.org/ (when using HTTPS).
Accessing the Data Using S3 in the AWS Cloud
You may use the AWS Command Line Interface but many AWS services (e.g., EMR) support the
s3:// protocol/scheme and you may directly specify your input as
s3://commoncrawl/path_to_file. On Hadoop (not EMR) it’s recommended to use the S3A protocol: just change the protocol to
Two important notes about accessing the data using the S3 API:
- We strongly recommend to access the data from the region where it is located (us-east-1 / Northern Virginia). Please run your computing workload in this region.
- Since April 2022 access to
s3://commoncrawl/using S3 is restricted to authenticated AWS users. However, even without an AWS account you can continue to use the data using HTTP/HTTPS, see the instructions below.
Accessing Common Crawl Data Using HTTP/HTTPS
If you want to download the data to your local machine or local cluster, you may any HTTP download agent, as per the instructions below. It is not necessary to create an AWS account to access the data using HTTP/HTTPS.
Listing path or files in
s3://commoncrawl/ for a given prefix (or “sub-directory”) is only possible using the S3 API which requires an AWS account. We provide lists of file paths for all crawls and other data sets. The listings can be used to fetch the data both using S3 or HTTP by adding the prefix
https://data.commoncrawl.org/ to the file path.
Using HTTP Download Agents
To download a file using an HTTP download agent add the full path to the prefix
The path without the prefix is listed in crawl-data/CC-MAIN-2022-05/warc.paths.gz. You can find the link in the announcement of the January 2022 crawl. We provide path listings for all datasets, links to the listings are shared in the dataset release announcements.
Using the AWS Command Line Interface (CLI)
The AWS Command Line Interface is easy to install and use on most operating systems (Windows, macOS, Linux), please follow the installation instructions. Once installed the CLI needs to be configured to include AWS account credentials. Alternatively (and preferable), you can associate EC2 instances or ECS containers with an IAM role allowing access to S3.
Once the AWS CLI is installed and configured, the command to copy a file to your local machine is
aws s3 cp s3://commoncrawl/path/to/file local/path/file
You may wish to look at the data or data file details before download. The command to do so is:
aws s3 ls s3://commoncrawl/path/
For instance, to list all WARC files written by the news crawler in February 2022 including the file size, use the command:
aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/2022/02/
The command to download the first file in the listing above and store it in the current directory will be:
aws s3 cp s3://commoncrawl/crawl-data/CC-NEWS/2022/02/CC-NEWS-20220201010205-00084.warc.gz ./
The AWS CLI supports recursive copying or allows for pattern-based inclusion/exclusion of files. For more information check the AWS CLI S3 user guide or call the command-line help. E.g., for help with the cp command, use:
aws s3 cp help