Search results
November 15, 2023. Oct/Nov 2023 Performance Issues. Our datasets have become very popular over time, with downloads doubling every 6 months for several years in a row. This post details some steps to take if you are impacted by performance issues.…
October 12, 2023. September/October 2023 crawl archive now available. The crawl archive for September/October 2023 is now available! The data was crawled Sept 21 – October 5 and contains 3.4 billion web pages or 456 TiB of uncompressed content.…
April 6, 2023. March/April 2023 crawl archive now available. The crawl archive for March/April 2023 is now available! The data was crawled March 20 – April 2 and contains 3.1 billion web pages or 400 TiB of uncompressed content.…
June 21, 2023. May/June 2023 crawl archive now available. The crawl archive for May/June 2023 is now available! The data was crawled May 27 – June 11 and contains 3.1 billion web pages or 390 TiB of uncompressed content.…
February 16, 2023. January/February 2023 crawl archive now available. The crawl archive for January/February 2023 is now available! The data was crawled January 26 – February 9 and contains 3.15 billion web pages or 400 TiB of uncompressed content.…
December 15, 2023. November/December 2023 Crawl Archive Now Available. The crawl archive for November/December 2023 is now available.…
May 5, 2024. Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, February, April 2024.…
October 18, 2023. Host- and Domain-Level Web Graphs Mar/May/Oct 2023. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, May, and October 2023.…
December 22, 2023. Host- and Domain-Level Web Graphs May/Sep/Nov 2023. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, September, and November of 2023. Thom Vaughan.…
March 14, 2024. Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024.…
March 15, 2023. Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023.…
A bad configuration was checked into our exclusion list on Sep 22, 2022 and was fixed on Oct 27, 2023. The configuration blocked a number of 2–level domains, meaning they were not included in certain crawls.…
March 26, 2024. March/April 2024 Newsletter. We're excited to share an update on some of our recent projects and initiatives in this newsletter! Common Crawl Foundation.…
The first iteration is the pre–crawl seed WARC files for October (Week 40 of 2023, ~134.0 TiB) and the second iteration is for December (Week 50 of 2023, ~1008 GiB).…
August 6, 2024. The Increase of Common Crawl Citations in Academic Research. Common Crawl's impact on research has grown substantially since its beginning.…
Before joining Common Crawl full-time in 2023, Greg was a member of the Event Horizon Telescope Collaboration, working at the Center for Astrophysics - Harvard & Smithsonian. He has also contributed to the Wayback Machine at the Internet Archive.…
January 30, 2026. Host- and Domain-Level Web Graphs November/December 2025 and January 2026.…
February 24, 2026. Host- and Domain-Level Web Graphs December 2025 and January/February 2026.…
February 1, 2025. Host- and Domain-Level Web Graphs November/December 2024 and January 2025. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2024 and January 2025.…
February 25, 2025. Host- and Domain-Level Web Graphs December 2024 and January/February 2025.…
February 10, 2021. Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021.…
October 10, 2023. Bridging Digital Exploration and Scientific Frontiers. This month Common Crawl Foundation members had the privilege of attending 5th International Open Search Symposium at CERN in Geneva, Switzerland. Thom Vaughan.…
March 16, 2022. Host- and Domain-Level Web Graphs October, November/December 2021 and January 2022. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2021 and January 2022.…
April 23, 2025. Introducing the Host Index. Introducing the Host Index: a new dataset with one row per web host per crawl, combining crawl stats, status codes, languages, and bot defence data. Queryable via AWS tools or downloadable. Greg Lindahl.…
September 10, 2024. August/September 2024 Newsletter. We're pleased to announce our newsletter for August and September 2024. Jen English.…
November 3, 2025. October/November 2025 Newsletter. Check out our newsletter for October/November 2025, with updates on what we've been up to. Common Crawl Foundation.…
June 25, 2024. May/June 2024 Newsletter. We’re pleased to share our newsletter for May/June 2024, featuring the latest updates, events, and highlights from our community. Greg Lindahl. Greg is Chief Technology Officer at the Common Crawl Foundation.…
February 3, 2025. January/February 2025 Newsletter. We’re happy to share our January/February 2025 newsletter with updates and insights from the world of open data and web archiving. Jen English.…
August 26, 2025. July/August 2025 Newsletter. We are pleased to release our newsletter for July and August 2025, with updates on our team's activities. Jen English.…
June 24, 2025. May/June 2025 Newsletter. We're happy to share our newsletter for May/June 2025 with updates from our team. Jen English.…
November 25, 2024. October/November 2024 Newsletter. We’re pleased to announce this month's newsletter, featuring key updates, upcoming events, and community highlights. Jen English.…
April 14, 2025. March/April 2025 Newsletter. We're excited to share our newsletter for March/April 2025 with updates from our team. Jen English.…
October 20, 2025. Common Crawl Foundation at COLM 2025. The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community. Malte Ostendorff.…
May 28, 2025. May 2025 Crawl Archive Now Available. We are pleased to announce that the crawl archive for May 2025 is now available.…
June 27, 2025. June 2025 Crawl Archive Now Available. We are pleased to announce that the crawl archive for June 2025 is now available. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.…
November 23, 2025. November 2025 Crawl Archive Now Available. We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content. Malte Ostendorff.…
January 28, 2026. January 2026 Crawl Archive Now Available. We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content. Thom Vaughan.…
February 22, 2026. February 2026 Crawl Archive Now Available. We are pleased to announce the release of the February 2026 crawl, consisting of 2.1 billion web pages (or 363 TiB of uncompressed content).…
August 18, 2024. August 2024 Crawl Archive Now Available. The crawl archive for August 2024 is now available. The data was crawled between August 3rd and August 16th, and contains 2.3 billion web pages (or 327.4 TiB of uncompressed content). Thom Vaughan.…
February 23, 2025. February 2025 Crawl Archive Now Available. The crawl archive for February 2025 is now available. The data was crawled between February 6th and February 20th, and contains 2.6 billion web pages (or 402 TiB of uncompressed content).…
February 2, 2022. January 2022 crawl archive now available. The crawl archive for January 2022 is now available! The data was crawled January 16 – 29 and contains 2.95 billion web pages or 320 TiB of uncompressed content.…
August 19, 2020. August 2020 crawl archive now available. The crawl archive for August 2020 is now available! It contains 2.45 billion web pages or 235 TiB of uncompressed content, crawled between August 2nd and 15th.…
August 22, 2022. August 2022 crawl archive now available. The crawl archive for August 2022 is now available! The data was crawled August 7 – 20 and contains 2.55 billion web pages or 295 TiB of uncompressed content.…
November 7, 2020. October 2020 crawl archive now available. The crawl archive for October 2020 is now available! The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content.…
October 4, 2021. September 2021 crawl archive now available. The crawl archive for September 2021 is now available! The data was crawled Sept 16 – 29 and contains 2.95 billion web pages or 310 TiB of uncompressed content.…
July 28, 2024. July 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for July 2024 is now available, containing 2.5 billion web pages, or 360 TiB of uncompressed content. Thom Vaughan.…
May 1, 2024. April 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for April 2024 is now available.…
July 23, 2025. July 2025 Crawl Archive Now Available. The crawl archive for July 2025 is now available. Crawled between July 7th and July 21st, the data contains 2.42 billion web pages, or 419 TiB of uncompressed content. Thom Vaughan.…
September 22, 2025. September 2025 Crawl Archive Now Available. We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content. Thom Vaughan.…
September 24, 2024. September 2024 Crawl Archive Now Available. The crawl archive for September 2024 is now available.…
December 20, 2025. December 2025 Crawl Archive Now Available. The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content). Thom Vaughan.…
June 3, 2024. May 2024 Crawl Archive Now Available. The crawl archive for May 2024 is now available. The data was crawled between May 18th and May 31st, and contains 2.7 billion web pages (or 377 TiB of uncompressed content). This is our 100th crawl!…
June 28, 2024. Dialog and Discovery at AI_dev 2024. This month members from the Common Crawl Foundation attended the AI_dev: Open Source GenAI & ML Summit in Paris, where discussions focused on AI advancements, ethics, and Open Source solutions.…
February 2, 2021. January 2021 crawl archive now available. The crawl archive for January 2021 is now available! The data was crawled between January 15th and 28th and contains 3.4 billion web pages or 350 TiB of uncompressed content.…
April 27, 2021. April 2021 crawl archive now available. The crawl archive for April 2021 is now available! The data was crawled April 10 – 23 and contains 3.1 billion web pages or 320 TiB of uncompressed content.…
March 4, 2020. February 2020 crawl archive now available. The crawl archive for February 2020 is now available! It contains 2.6 billion web pages or 240 TiB of uncompressed content, crawled between February 16th and 29th.…
June 28, 2021. June 2021 crawl archive now available. The crawl archive for June 2021 is now available! The data was crawled June 12 – 25 and contains 2.45 billion web pages or 260 TiB of uncompressed content.…
July 20, 2020. July 2020 crawl archive now available. The crawl archive for July 2020 is now available! It contains 3.14 billion web pages or 300 TiB of uncompressed content, crawled between July 2nd and 16th.…
May 4, 2025. April 2025 Crawl Archive Now Available. Announcing the release of the April 2025 crawl archive. The data was crawled between April 17th and May 1st, and contains 2.74 billion web pages (or 468 TiB of uncompressed content).…
February 11, 2026. CC-Citations: A Visualization of Research Papers Referencing Common Crawl. We are proud to release an interactive visualization of thousands of research papers using or citing Common Crawl data. Malte Ostendorff.…