Common Crawl maintains a free,open repository of web crawl data that can be used by anyone.
Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers.
As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.