The crawl archive for June 2024 is now available.
The data was crawled between June 12th and June 26th, and contains 2.7 billion web pages (or 382 TiB of uncompressed content). Page captures are from 52.7 million hosts or 41.4 million registered domains and include 945 million new URLs, not visited in any of our prior crawls.
Archive Location & Download
The June 2024 crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2024-26/
.
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC
, WAT
and WET
files.
By simply adding either s3://commoncrawl/
or https://data.commoncrawl.org/
to each line, you end up with the S3
and HTTP
paths respectively, please see Get Started for detailed instructions.
This release was authored by:
![Sebastian is a Distinguished Engineer with Common Crawl.](https://cdn.prod.website-files.com/647b1c7a9990bad2048d3711/64eb7e4cff03d4e50b472ec2_sebastian.webp)
Sebastian Nagel
![Thom is Principal Technologist at the Common Crawl Foundation.](https://cdn.prod.website-files.com/647b1c7a9990bad2048d3711/6525482ffaaa06f071baaf93_thom_vaughan.webp)
Thom Vaughan
![Pedro is a French-Colombian mathematician, computer scientist and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université.](https://cdn.prod.website-files.com/647b1c7a9990bad2048d3711/6542644396dbd2726117afca_pedro.webp)
Pedro Ortiz Suarez