< Back to Blog
December 18, 2024

December 2024 Crawl Archive Now Available

Note: this post has been marked as obsolete.
The crawl archive for December 2024 is now available. The data was crawled between December 1st and December 15th, and contains 2.64 billion web pages (or 394 TiB of uncompressed content). Page captures are from 47.5 million hosts or 38.3 million registered domains and include 1.05 billion new URLs, not visited in any of our prior crawls.
Sebastian Nagel
Sebastian Nagel
Sebastian is a Distinguished Engineer with Common Crawl.

The crawl archive for December 2024 is now available.

The data was crawled between December 1st and December 15th, and contains 2.64 billion web pages (or 394 TiB of uncompressed content). Page captures are from 47.5 million hosts or 38.3 million registered domains and include 1.05 billion new URLs, not visited in any of our prior crawls.

File List #Files Total Size
Compressed (TiB)
Segments segment.paths.gz 100
WARC warc.paths.gz 90000 80.92
WAT wat.paths.gz 90000 18.70
WET wet.paths.gz 90000 7.37
Robots.txt robotstxt.paths.gz 90000 0.14
Non-200 responses non200responses.paths.gz 90000 2.75
URL index cc-index.paths.gz 302 0.20
Columnar URL index cc-index-table.paths.gz 900 0.23

Archive Location & Download

The December 2024 crawl archive is located in the commoncrawl bucket with the prefix: crawl-data/CC-MAIN-2024-51/.

To assist with exploring and using the dataset, we provide gzip-compressed files which list all segments, WARC, WAT and WET files.

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively. Please see Get Started for detailed instructions.

Changes to the WAT Metadata Format

Multi-valued headers

Repeated HTTP and WARC headers were not represented in the JSON data in WAT files. When a header was repeated adding a further value of that header, only the last value was stored and other values were lost. This old issue (ia-web-commons#18) is now fixed:

  • Single value headers are represented as before by a header name and a string value.
  • Headers with multiple values are represented by a header name and an associated list of values.

Users are advised to update any code consuming WAT files to this change. The examples in the projects cc-pyspark and cc-warc-examples were updated accordingly, see cc-pyspark#46 resp. cc-warc-examples#5.

Below are two JSON snippets of multi-valued headers:

{
  "Container": { "...": "..." },
  "Envelope": {
        "WARC-Header-Metadata": {
          "...": "...",
          "WARC-Target-URI": "https://en.wikipedia.org/wiki/Saturn",
          "WARC-Protocol": [
            "h2",
            "tls/1.3"
          ],
  • Many HTTP headers, most commonly the "Set-Cookie" header:
{
"Container": { "...": "..." },
"Envelope": {
  "Payload-Metadata": {
    "Actual-Content-Type": "application/http; msgtype=response",
    "HTTP-Response-Metadata": {
      "...": "...",
      "Headers": {
        "date": "Sat, 30 Nov 2024 11:13:30 GMT",
        "...": "...",
        "set-cookie": [
          "WMF-Last-Access=30-Nov-2024;Path=/;HttpOnly;secure;Expires=Wed, 01 Jan 2025 12:00:00 GMT",
          "WMF-Last-Access-Global=30-Nov-2024;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 01 Jan 2025 12:00:00 GMT",
          "WMF-DP=5b0;Path=/;HttpOnly;secure;Expires=Sun, 01 Dec 2024 00:00:00 GMT",
          "GeoIP=US:VA:Ashburn:39.05:-77.49:v4; Path=/; secure; Domain=.wikipedia.org",
          "NetworkProbeLimit=0.001;Path=/;Secure;SameSite=Lax;Max-Age=3600"
        ],

Add language attributes of the <html> root element as metadata

The WAT metadata now includes the language attributes of the <html> element. For example, the root element <html lang="es-MX"> is stored in the WAT file as:

"HTML-Metadata": {
"Head": {
    "Metas": [
         {
        "name": "HTML@/lang",
        "content": "en"
         },

Details on this change are tracked in ia-web-commons#35.

Do not include <meta itemprop="..."> as metadata

Schema.org annotations in <meta itemprop="..."> in the HTML body are not put as metadata into the WAT metadata, cf. ia-web-commons#40.

Crawling with IPv6

The crawler is now ready to crawl IPv6-only websites. While IPv4 is still preferred, sites which are only available by IPv6 are now visited by our crawler. As a consequence, IPv6 addresses now appear in the crawl data. For example, in the "WARC-IP-Address" header or in URLs in the URL indexes.

Crawler Verification

Our crawler "CCBot" is now run on dedicated IP address ranges with reverse DNS. This allows webmasters to verify whether a logged request stems from CCBot. Please read our FAQ for more information.

Feedback Welcome

We look forward to hearing your thoughts and comments. As ever, please feel free to join the discussions in our Google Group or in our Discord server.

This release was authored by:
Sebastian is a Distinguished Engineer with Common Crawl.
Sebastian Nagel
Thom is Principal Technologist at the Common Crawl Foundation.
Thom Vaughan