The crawl archive for October 2024 is now available.
The data was crawled between October 3rd and October 16th, and contains 2.49 billion web pages (or 365 TiB of uncompressed content). Page captures are from 47.5 million hosts or 38.3 million registered domains and include 1.03 billion new URLs, not visited in any of our prior crawls.
Archive Location & Download
The October 2024 crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2024-42/.
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively, please see Get Started for detailed instructions.
This release was authored by:
Sebastian Nagel
Sebastian is a Distinguished Engineer at the Common Crawl Foundation.
Thom Vaughan
Thom is Principal Engineer at the Common Crawl Foundation.
Erratum:
WARC Content-Type header in revisit records
Originally reported by:
Sebastian Nagel
Common Crawl's WARC revisit records use Content-Type: message/http (following the WARC 1.1 spec's example), but per iipc/warc-specifications#55 it should be application/http;msgtype=response for consistency with other HTTP response records.
Erratum:
Redirect target URL in URL indexes may be a relative URL
Originally reported by:
Sebastian Nagel
When the HTTP “Location” header includes a relative URL, the corresponding “redirect” field in the CDX index and “fetch_redirect” field in the columnar index will also store a relative URL. In all other cases, redirect targets in the URL indexes should be recorded as absolute URLs.
Erratum:
Content is truncated
Originally reported by:
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
Erratum:
SURT URLs do not properly encode non-UTF-8 percent-encoded characters
Originally reported by:
Tom Morris
When constructing SURT (Sort-friendly URI Reordering Transform) URLs, percent-encoded characters that are not valid UTF-8 sequences were not being correctly handled. This could lead to inconsistencies in URL normalization and sorting, potentially causing incorrect deduplication or retrieval issues in datasets that rely on SURT-based indexing.
Erratum:
WAT data: repeated WARC and HTTP headers are not preserved
Originally reported by:
Repeated HTTP and WARC headers were not represented in the JSON data in WAT files.
Erratum:
WARC revisit metadata records
Originally reported by:
The revisit records in the Common Crawl WARC archives (since Aug 2018) lack the metadata record which is attached to all response records.

