Blog

The Columnar Index Is Now the URL Index









Read our submission to the UK government's Copyright and AI consultation, supporting a legal exception for text and data mining (TDM) while respecting creators’ rights.
Read More...Announcing our February 2025 Web Graph release based on the crawls of December 2024 and January/February 2025, consisting of 267.4 million nodes and 2.7 billion edges at the host level, and 106.5 million nodes and 1.9 billion edges at the domain level.
Read More...The crawl archive for February 2025 is now available. The data was crawled between February 6th and February 20th, and contains 2.6 billion web pages (or 402 TiB of uncompressed content). Page captures are from 47.6 million hosts or 38.5 million registered domains and include 1 billion new URLs, not visited in any of our prior crawls.
Read More...Last week in Paris, at the AI Action Summit, a coalition of major technology companies and foundations announced the launch of ROOST: Robust Online Open Safety Tools.
Read More...We’re happy to share our January/February 2025 newsletter with updates and insights from the world of open data and web archiving.
Read More...We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2024 and January 2025. The host-level graph consists of 277.7 million nodes and 2.7 billion edges, and the domain-level graph has 100.8 million nodes and 1.9 billion edges.
Read More...We're pleased to announce our first crawl of 2025, containing 3.0 billion pages, and 460 TiB uncompressed content.
Read More...Introducing a command-line tool written in Rust for downloading data from Common Crawl.
Read More...We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of October, November, and December 2024. The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51.
Read More...The crawl archive for December 2024 is now available. The data was crawled between December 1st and December 15th, and contains 2.64 billion web pages (or 394 TiB of uncompressed content). Page captures are from 47.5 million hosts or 38.3 million registered domains and include 1.05 billion new URLs, not visited in any of our prior crawls.
Read More...The Common Crawl Foundation attended NeurIPS 2024, connecting with organisations, hosting a social event on tech and social impact, and showcasing contributions to AI research and data access.
Read More...We aim to enhance linguistic diversity in our dataset by inviting community contributions of non-English URLs and collaborating with MLCommons on a Language Identification campaign.
Read More...We’re pleased to announce this month's newsletter, featuring key updates, upcoming events, and community highlights.
Read More...We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of September, October, and November 2024. The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38.
Read More...The crawl archive for November 2024 is now available. The data was crawled between November 1st and November 15th, and contains 2.68 billion web pages (or 405 TiB of uncompressed content). Page captures are from 47.5 million hosts or 38.3 million registered domains and include 1 billion new URLs, not visited in any of our prior crawls.
Read More...Thom Vaughan and Pedro Ortiz Suarez discussed the power of Common Crawl’s open web data in driving research and innovation during two notable presentations last week.
Read More...As part of our commitment to accuracy and transparency, we are pleased to introduce a new Errata page on our website.
Read More...We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of August, September, and October 2024. The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42.
Read More...The data was crawled between October 3rd and October 16th, and contains 2.49 billion web pages (or 365 TiB of uncompressed content). Page captures are from 47.5 million hosts or 38.3 million registered domains and include 1.03 billion new URLs, not visited in any of our prior crawls.
Read More...We recently had the honor of briefing the White House Office of Science and Technology Policy (OSTP) on the role of The Common Crawl Foundation as critical infrastructure in the artificial intelligence ecosystem and how we can support U.S. federal efforts in advancing responsible AI use and research.
Read More...Earlier this month, the Common Crawl Foundation had the privilege of participating in a groundbreaking workshop hosted by the Internet Architecture Board (IAB) in Washington DC.
Read More...We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2024. The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38.
Read More...The crawl archive for September 2024 is now available. The data was crawled between September 7th and September 21st 2024, and contains 2.8 billion web pages (or 410 TiB of uncompressed content).
Read More...We're pleased to announce our newsletter for August and September 2024.
Read More...We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, August 2024. The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26.
Read More...