The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Web Services’ Open Data Sponsorships program. You can download the files entirely free using HTTP(S) or S3.

As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

  • [ARC] s3://commoncrawl/crawl-001/ – Crawl #1 (2008/2009)
  • [ARC] s3://commoncrawl/crawl-002/ – Crawl #2 (2009/2010)
  • [ARC] s3://commoncrawl/parse-output/ – Crawl #3 (2012)
  • [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2013-20/ – Summer 2013
  • [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2013-48/ – Winter 2013
  • [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2014-10/ – March 2014
    (all subsequent crawls are provided in the WARC format)
  • s3://commoncrawl/crawl-data/CC-MAIN-2014-15/ – April 2014
  • s3://commoncrawl/crawl-data/CC-MAIN-2014-23/ – July 2014
  • s3://commoncrawl/crawl-data/CC-MAIN-2014-35/ – August 2014
  • s3://commoncrawl/crawl-data/CC-MAIN-2014-41/ – September 2014
  • s3://commoncrawl/crawl-data/CC-MAIN-2014-42/ – October 2014
  • s3://commoncrawl/crawl-data/CC-MAIN-2014-49/ – November 2014
  • s3://commoncrawl/crawl-data/CC-MAIN-2014-52/ – December 2014
  • s3://commoncrawl/crawl-data/CC-MAIN-2015-06/ – January 2015
  • s3://commoncrawl/crawl-data/CC-MAIN-2015-11/ – February 2015
  • s3://commoncrawl/crawl-data/CC-MAIN-2015-14/ – March 2015
  • s3://commoncrawl/crawl-data/CC-MAIN-2015-18/ – April 2015
  • s3://commoncrawl/crawl-data/CC-MAIN-2015-22/ – May 2015
  • s3://commoncrawl/crawl-data/CC-MAIN-2015-27/ – June 2015
  • s3://commoncrawl/crawl-data/CC-MAIN-2015-32/ – July 2015
  • s3://commoncrawl/crawl-data/CC-MAIN-2015-35/ – August 2015
  • s3://commoncrawl/crawl-data/CC-MAIN-2015-40/ – September 2015
  • s3://commoncrawl/crawl-data/CC-MAIN-2015-48/ – November 2015
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-07/ – February 2016
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-18 – April 2016
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-22 – May 2016
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-26 – June 2016
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-30 – July 2016
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-36 – August 2016
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-40 – September 2016
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-44 – October 2016
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-50 – December 2016
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-04 – January 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-09 – February 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-13 – March 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-17 – April 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-22 – May 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-26 – June 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-30 – July 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-34 – August 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-39 – September 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-43 – October 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-47 – November 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2017-51 – December 2017
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-05 – January 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-09 – February 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-13 – March 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-17 – April 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-22 – May 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-26 – June 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-30 – July 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-34 – August 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-39 – September 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-43 – October 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-47 – November 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2018-51 – December 2018
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-04 – January 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-09 – February 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-13 – March 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-18 – April 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-22 – May 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-26 – June 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-30 – July 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-35 – August 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-39 – September 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-43 – October 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-47 – November 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2019-51 – December 2019
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-05 – January 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-10 – February 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-16 – March/April 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-24 – May/June 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-29 – July 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-34 – August 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-40 – September 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-45 – October 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-50 – November/December 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2021-04 – January 2021
  • s3://commoncrawl/crawl-data/CC-MAIN-2021-10 – February/March 2021
  • s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021
  • s3://commoncrawl/crawl-data/CC-MAIN-2021-21 – May 2021
  • s3://commoncrawl/crawl-data/CC-MAIN-2021-25 – June 2021
  • s3://commoncrawl/crawl-data/CC-MAIN-2021-31 – July/August 2021
  • s3://commoncrawl/crawl-data/CC-MAIN-2021-39 – September 2021
  • s3://commoncrawl/crawl-data/CC-MAIN-2021-43 – October 2021
  • s3://commoncrawl/crawl-data/CC-MAIN-2021-49 – November/December 2021
  • s3://commoncrawl/crawl-data/CC-MAIN-2022-05 – January 2022
  • s3://commoncrawl/crawl-data/CC-MAIN-2022-21 – May 2022
  • s3://commoncrawl/crawl-data/CC-MAIN-2022-27 – June/July 2022
  • s3://commoncrawl/crawl-data/CC-MAIN-2022-33 – August 2022
  • s3://commoncrawl/crawl-data/CC-MAIN-2022-40 – September/October 2022
  • s3://commoncrawl/crawl-data/CC-MAIN-2022-49 – November/December 2022
  • s3://commoncrawl/crawl-data/CC-MAIN-2023-06 – January/February 2023
  • s3://commoncrawl/crawl-data/CC-MAIN-2023-14 – March/April 2023

For all crawls since 2013, the data has been stored in the WARC file format and also contains metadata (WAT) and text data (WET) extracts. We also provide file path lists for the segments, WARC, WAT, and WET files, which can be found at CC-MAIN-YYYY-WW/[segment|warc|wat|wet].paths.gz.

By replacing s3://commoncrawl/ with https://data.commoncrawl.org/ on each line, you can obtain the HTTP path for any of the files stored on S3. See also access-the-data for further information and examples.

Data Format

Common Crawl currently stores the crawl data using the Web ARChive (WARC) format.
Before that point, the crawl was stored in the ARC file format.
The WARC format allows for more efficient storage and processing of Common Crawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.
This document aims to give you an introduction to working with the new format, specifically the difference between:

  • WARC files which store the raw crawl data
  • WAT files which store computed metadata for the data stored in the WARC
  • WET files which store extracted plaintext from the data stored in the WARC

If you want all the nitty gritty details, the best source is the WARC standard.
If you’re more interested in diving into code, we’ve provided introductory examples in Java and Python that use the Hadoop or Spark frameworks to process WAT, WET and WARC (partially also ARC).

WARC Format

The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.
In the example below, we can see the crawler contacted http://news.bbc.co.uk/2/hi/africa/3414345.stm and received a HTML page in response. We can also see the page was served from the Apache web server, sets caching details, and attempts to set a cookie (shortened for display here).

Full WARC extract

WARC/1.0
WARC-Type: response
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID: 
Content-Length: 43428
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: 
WARC-Concurrent-To: 
WARC-IP-Address: 212.58.244.61
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3J
WARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJO
WARC-Truncated: length

HTTP/1.1 200 OK
Server: Apache
Vary: X-CDN
Cache-Control: max-age=0
Content-Type: text/html
Date: Sat, 02 Aug 2014 09:52:13 GMT
Expires: Sat, 02 Aug 2014 09:52:13 GMT
Connection: close
Set-Cookie: BBC-UID=...; expires=Sun, 02-Aug-15 09:52:13 GMT; path=/; domain=bbc.co.uk;

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>
	BBC NEWS | Africa | Namibia braces for Nujoma exit
</title>
...

WAT Response Format

WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.

This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, you can use one of the many JSON pretty print tools available.

The HTTP response metadata is most likely to be of interest to Common Crawl users. The skeleton of the JSON format is outlined below.

Envelope
  WARC-Header-Metadata
    WARC-Target-URI [string]
    WARC-Type [string]
    WARC-Date [datetime string]
    ...
  Payload-Metadata
    HTTP-Response-Metadata
      Headers
        Content-Language
        Content-Encoding
        ...
      HTML-Metadata
        Head
          Title [string]
          Link [list]
          Metas [list]
        Links [list]
      Headers-Length [int]
      Entity-Length [int]
      ...
    ...
  ...
Container
  Gzip-Metadata [object]
  Compressed [boolean]
  Offset [int]

As an example in Python, if we parsed the JSON into the data object, we could pull out interesting information from the BBC article easily…

Full WAT extract

>> data['Envelope']['WARC-Header-Metadata']['WARC-Type']
"response"
>> data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['Headers']['Server']
"Apache"
>> data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Head']['Title']
" BBC NEWS | Africa | Namibia braces for Nujoma exit "
>> len(data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Links'])
42
>> data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Links'][28]
{"path": "A@/href", "title": "Home of BBC Sport on the internet", "url": "http://news.bbc.co.uk/sport1/hi/default.stm"}

WET Response Format

As many tasks only require textual information, the Common Crawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

Full WET extract

WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID: 
WARC-Refers-To: 
WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJC
Content-Type: text/plain
Content-Length: 6724

BBC NEWS | Africa | Namibia braces for Nujoma exit
...
President Sam Nujoma works in very pleasant surroundings in the small but beautiful old State House...

Processing the file format

We maintain introductory examples on GitHub for the following programming languages and big data processing frameworks:

For each of these platforms, the examples describe how to:

  • Count the number of times various tags are used across HTML on the internet using the WARC files
  • Counting the number of different server types found in the HTTP headers using the WAT files
  • Execute a word count over the extracted plaintext found in the WET files

If you’re using a different programming language or prefer to work with another processing framework, there are a number of open source libraries that handle processing WARC files and the content therein, including:

More tools and libraries are found on the list of Awesome Web Archiving utilities maintained by the IIPC.

URL and metadata indexes

Using The Common Crawl URL Index of WARC and ARC files (2008 – present), you may look up URLs crawled in a given dataset, locate an archived page or pages within the dataset, search for URL prefixes in order to learn about coverage of hosts or domains in the Common Crawl archives, and more. To a limited extent, the Index server may be used as a “wayback machine” to manually “browse” a crawl archive.

The Parquet Index, on AWS S3, is an index to WARC files and URLs in a columnar format; it is most useful for running analytics queries. The columnar format, in Apache Parquet, enables highly efficient querying and processing of the index, which saves time and computing resources. When only a few columns are accessed, recent big data tools will run impressively fast.

The columnar index is free to access or download for anybody. All files are on AWS S3:
s3://commoncrawl/cc-index/table/cc-main/warc/

To date, we have tested the following data tools on the Parquet Index: Apache Spark, Apache Hive and AWS Athena. The latter makes it possible to run SQL queries on the columnar data without launching a server. For detailed examples and instructions on querying the data with Athena, please see this blog post.

Statistics and metrics

In addition, we also publish statistics and basic metrics of each crawl that include:

  • Size of the crawl as numbers of fetched pages, unique URLs, unique documents (by content digest), number of different hosts, domains, and top-level domains
  • Distribution of pages/URLs on hosts, domains, top-level domains
  • Content language, MIME types, character sets

Check out the statistics page on GitHub.

Want to calculate your own metrics? Then have a look at our list of example SQL queries to get numbers from the columnar index and our collection of Jupyter notebooks.

Other data sets

In addition to our monthly comprehensive web crawls we host