< Back to Blog
December 15, 2023

November/December 2023 Crawl Archive Now Available

The crawl archive for November/December 2023 is now available. The data was crawled between November 28th and December 12th, and contains 3.35 billion web pages (or 454 TiB of uncompressed content).
Thom Vaughan
Thom Vaughan
Thom is Principal Technologist at the Common Crawl Foundation.

The crawl archive for November/December 2023 is now available. The data was crawled between November 28th and December 12th, and contains 3.35 billion web pages (or 454 TiB of uncompressed content). Page captures are from 47.5 million hosts or 37.7 million registered domains and include 1.4 billion new URLs, not visited in any of our prior crawls.

Data Type File List #Files Total Size
Compressed (TiB)
Segments segment.paths.gz 100
WARC warc.paths.gz 90000 99.25
WAT wat.paths.gz 90000 22.99
WET wet.paths.gz 90000 9.30
Robots.txt robotstxt.paths.gz 90000 0.18
Non-200 responses non200responses.paths.gz 90000 3.43
URL index cc-index.paths.gz 302 0.25
Columnar URL index cc-index-table.paths.gz 900 0.28

Archive Location & Download

The November/December 2023 crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2023-50/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively, please see Accessing the Data for detailed instructions.

This release was authored by:
Julien is a member of the Apache Software Foundation, Emeritus member of the Common Crawl Foundation, and is the creator of StormCrawler.
Julien Nioche
Sebastian is a Distinguished Engineer with Common Crawl.
Sebastian Nagel