Search results

Common Crawl - Blog - Introducing the Common Crawl Errata Page for Data Transparency

← Back to Blog. October 30, 2024. Introducing the Common Crawl Errata Page for Data Transparency. As part of our commitment to accuracy and transparency, we are pleased to introduce a new Errata page on our website. Thom Vaughan.…

Common Crawl - Blog - 3.25 Billion Pages Crawled in July 2018

← Back to Blog. July 28, 2018. 3.25 Billion Pages Crawled in July 2018. The crawl archive for July 2018 is now available! The archive contains 3.25 billion web pages and 255 TiB of uncompressed content, crawled between July 15th and 23th. Sebastian Nagel.…

Common Crawl - Blog

Blog. The latest news, interviews, technologies, and resources. Common Crawl Blog. The Data. Overview. CDXJ Index. URL Index. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent. Blog. Examples. CCBot.…

Common Crawl - Blog - March/April 2023 crawl archive now available

← Back to Blog. April 6, 2023. March/April 2023 crawl archive now available. The crawl archive for March/April 2023 is now available! The data was crawled March 20 – April 2 and contains 3.1 billion web pages or 400 TiB of uncompressed content.…

Common Crawl - Blog - July 2024 Crawl Archive Now Available

← Back to Blog. July 28, 2024. July 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for July 2024 is now available, containing 2.5 billion web pages, or 360 TiB of uncompressed content. Thom Vaughan.…

Common Crawl - Blog - August Crawl Archive Introduces Language Annotations

← Back to Blog. August 26, 2018. August Crawl Archive Introduces Language Annotations. The crawl archive for August 2018 is now available! It contains 2.65 billion web pages and 220 TiB of uncompressed content, crawled between August 14th and 22th.…

Common Crawl - Blog - Twelve steps to running your Ruby code across five billion web pages

← Back to Blog. March 26, 2012. Twelve steps to running your Ruby code across five billion web pages. The following is a guest blog post by Pete Warden, a member of the Common Crawl Advisory Board.…

Common Crawl - Blog - January 2022 crawl archive now available

← Back to Blog. February 2, 2022. January 2022 crawl archive now available. The crawl archive for January 2022 is now available! The data was crawled January 16 – 29 and contains 2.95 billion web pages or 320 TiB of uncompressed content.…

Common Crawl - Blog - January 2021 crawl archive now available

← Back to Blog. February 2, 2021. January 2021 crawl archive now available. The crawl archive for January 2021 is now available! The data was crawled between January 15th and 28th and contains 3.4 billion web pages or 350 TiB of uncompressed content.…

Common Crawl - Blog - May 2022 crawl archive now available

← Back to Blog. June 2, 2022. May 2022 crawl archive now available. The crawl archive for May 2022 is now available! The data was crawled May 16 – 29 and contains 3.45 billion web pages or 420 TiB of uncompressed content.…

Common Crawl - Blog - Common Crawl Statistics Now Available on Hugging Face

← Back to Blog. July 22, 2024. Common Crawl Statistics Now Available on Hugging Face. We're excited to announce that Common Crawl’s statistics are now available on Hugging Face! Ford Heilizer. Ford is an emeritus member of the Common Crawl Foundation.…

Common Crawl - Blog - July 2019 crawl archive now available

← Back to Blog. July 30, 2019. July 2019 crawl archive now available. The crawl archive for July 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between July 15th and 24th. Sebastian Nagel.…

Common Crawl - Blog - May/June 2020 crawl archive now available

← Back to Blog. June 10, 2020. May/June 2020 crawl archive now available. The crawl archive for May/June 2020 is now available! It contains 2.75 billion web pages or 255 TiB of uncompressed content, crawled between May 24th and June 7th.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

← Back to Blog. November 27, 2017. Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September, and October 2017.…

Common Crawl - Blog - June 2019 crawl archive now available

← Back to Blog. July 2, 2019. June 2019 crawl archive now available. The crawl archive for June 2019 is now available!…

Common Crawl - Blog - Answers to Recent Community Questions

It was wonderful to see our first blog post and the. great piece. by. Marshall Kirkpatrick. on ReadWriteWeb generate so much interest in Common Crawl last week! There were many questions raised on Twitter and in the comment sections of our blog, RWW and.…

Common Crawl - Blog - Welcome, Sebastian!

← Back to Blog. May 13, 2016. Welcome, Sebastian! It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April.…

Common Crawl - Blog - August 2019 crawl archive now available

← Back to Blog. August 30, 2019. August 2019 crawl archive now available. The crawl archive for August 2019 is now available! It contains 2.95 billion web pages or 260 TiB of uncompressed content, crawled between August 17th and 26th. Sebastian Nagel.…

Common Crawl - Blog - OSCON 2012

← Back to Blog. June 19, 2012. OSCON 2012. We're just one month away from one of the biggest and most exciting events of the year, O'Reilly's Open Source Convention (OSCON). This year's conference will be held July 16th-20th in Portland, Oregon.…

Common Crawl - Blog - October 2016 Crawl Archive Now Available

← Back to Blog. November 7, 2016. October 2016 Crawl Archive Now Available. The crawl archive for October 2016 is now available! The archive contains more than 3.25 billion web pages. Sebastian Nagel.…

Common Crawl - FAQ

How does CCBot fetch a web page? CCBot. is an automated crawler, checking first the. robots.txt. , and if crawling a page is allowed, fetches pages using. HTTP. GET. requests. It supports both. HTTP/1.1. and.…

Common Crawl - Blog - September 2019 crawl archive now available

← Back to Blog. September 28, 2019. September 2019 crawl archive now available. The crawl archive for September 2019 is now available! It contains 2.55 billion web pages or 240 TiB of uncompressed content, crawled between September 15th and 24th.…

Common Crawl - Blog - June 2024 Crawl Archive Now Available

← Back to Blog. June 28, 2024. June 2024 Crawl Archive Now Available. The crawl archive for June 2024 is now available. The data was crawled between June 12th and June 26th, and contains 2.7 billion web pages (or 382 TiB of uncompressed content).…

Common Crawl - Blog - Analysis of the NCSU Library URLs in the Common Crawl Index

← Back to Blog. January 16, 2013. Analysis of the NCSU Library URLs in the Common Crawl Index. Last week we announced the Common Crawl URL Index.…

Common Crawl - Blog - April 2025 Crawl Archive Now Available

← Back to Blog. May 4, 2025. April 2025 Crawl Archive Now Available. Announcing the release of the April 2025 crawl archive. The data was crawled between April 17th and May 1st, and contains 2.74 billion web pages (or 468 TiB of uncompressed content).…

Common Crawl - Erratum - Some 2–Level CCTLDs Excluded

Some 2–Level CCTLDs Excluded. A bad configuration was checked into our exclusion list on Sep 22, 2022 and was fixed on Oct 27, 2023. The configuration blocked a number of 2–level domains, meaning they were not included in certain crawls.…

Common Crawl - Blog - November 2024 Crawl Archive Now Available

← Back to Blog. November 18, 2024. November 2024 Crawl Archive Now Available. The crawl archive for November 2024 is now available.…

Common Crawl - Blog - October 2024 Crawl Archive Now Available

← Back to Blog. October 20, 2024. October 2024 Crawl Archive Now Available. The data was crawled between October 3rd and October 16th, and contains 2.49 billion web pages (or 365 TiB of uncompressed content).…

Common Crawl - Blog - December 2019 crawl archive now available

← Back to Blog. December 19, 2019. December 2019 crawl archive now available. The crawl archive for December 2019 is now available! It contains 2.45 billion web pages or 234 TiB of uncompressed content, crawled between December 5th and 16th.…

Common Crawl - Blog - November/December 2020 crawl archive now available

← Back to Blog. December 10, 2020. November/December 2020 crawl archive now available. The crawl archive for November/December 2020 is now available!…

Common Crawl - Blog - October 2021 crawl archive now available

← Back to Blog. November 1, 2021. October 2021 crawl archive now available. The crawl archive for October 2021 is now available! The data was crawled Oct 15 – 28 and contains 3.3 billion web pages or 360 TiB of uncompressed content.…

Common Crawl - Blog - March/April 2020 crawl archive now available

← Back to Blog. April 14, 2020. March/April 2020 crawl archive now available. The crawl archive for March/April 2020 is now available! It contains 2.85 billion web pages or 280 TiB of uncompressed content, crawled between March 28th and April 10th.…

Common Crawl - Blog - September 2020 crawl archive now available

← Back to Blog. October 7, 2020. September 2020 crawl archive now available. The crawl archive for September 2020 is now available!…

Common Crawl - Blog - September 2021 crawl archive now available

← Back to Blog. October 4, 2021. September 2021 crawl archive now available. The crawl archive for September 2021 is now available! The data was crawled Sept 16 – 29 and contains 2.95 billion web pages or 310 TiB of uncompressed content.…

Common Crawl - Blog - October 2019 crawl archive now available

← Back to Blog. October 29, 2019. October 2019 crawl archive now available. The crawl archive for October 2019 is now available! It contains 3.0 billion web pages or 280 TiB of uncompressed content, crawled between October 13th and 24th.…

Common Crawl - Blog - May/June 2023 crawl archive now available

← Back to Blog. June 21, 2023. May/June 2023 crawl archive now available. The crawl archive for May/June 2023 is now available! The data was crawled May 27 – June 11 and contains 3.1 billion web pages or 390 TiB of uncompressed content.…

Common Crawl - Blog - September 2018 crawl archive now available

← Back to Blog. October 3, 2018. September 2018 crawl archive now available. The crawl archive for September 2018 is now available! It contains 2.8 billion web pages and 220 TiB of uncompressed content, crawled between September 17th and 26th.…

Common Crawl - Blog - You can now build directly on Common Crawl from the browser

← Back to Blog. May 6, 2026. You can now build directly on Common Crawl from the browser. Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages. Thom Vaughan.…

Common Crawl - Blog - February 2025 Crawl Archive Now Available

← Back to Blog. February 23, 2025. February 2025 Crawl Archive Now Available. The crawl archive for February 2025 is now available.…

Common Crawl - Blog - May 2021 crawl archive now available

← Back to Blog. May 23, 2021. May 2021 crawl archive now available. The crawl archive for May 2021 is now available! The data was crawled May 5 – 19 and contains 2.6 billion web pages or 280 TiB of uncompressed content.…

Common Crawl - Blog - June/July 2022 crawl archive now available

← Back to Blog. July 13, 2022. June/July 2022 crawl archive now available. The crawl archive for June/July 2022 is now available! The data was crawled June 24 – July 7 and contains 3.1 billion web pages or 370 TiB of uncompressed content.…

Common Crawl - Blog - April 2024 Crawl Archive Now Available

← Back to Blog. May 1, 2024. April 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for April 2024 is now available.…

Common Crawl - Blog - IETF 123 Report

← Back to Blog. August 4, 2025. IETF 123 Report. A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Announcing GneissWeb Annotations

← Back to Blog. October 6, 2025. Announcing GneissWeb Annotations.…

Common Crawl - Blog - GneissWeb Annotations Examples

← Back to Blog. January 13, 2026. GneissWeb Annotations Examples. A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket. Thijs Dalhuijsen. Thijs Dalhuijsen is Engineering Manager at Common Crawl.…

Common Crawl - Blog - Data 2.0 Summit

← Back to Blog. March 28, 2012. Data 2.0 Summit. Next week a few members of the Common Crawl team are going the Data 2.0 Summit in San Francisco. Common Crawl Foundation.…

Common Crawl - Blog - Web Data Commons

← Back to Blog. March 22, 2012. Web Data Commons.…

Common Crawl - Blog - News Dataset Available

← Back to Blog. October 4, 2016. News Dataset Available. We are pleased to announce the release of a new dataset containing news articles from news sites all over the world. Sebastian Nagel.…

Common Crawl - Blog - New Crawl Data Available!

← Back to Blog. November 27, 2013. New Crawl Data Available! We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed).…

Common Crawl - Blog - URL Search Tool!

← Back to Blog. March 5, 2013. URL Search Tool! A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index.…

Common Crawl - Blog - How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals

← Back to Blog. January 19, 2026. How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals.…

Common Crawl - Blog - August 2020 crawl archive now available

← Back to Blog. August 19, 2020. August 2020 crawl archive now available. The crawl archive for August 2020 is now available! It contains 2.45 billion web pages or 235 TiB of uncompressed content, crawled between August 2nd and 15th.…

Common Crawl - Blog - October 2020 crawl archive now available

← Back to Blog. November 7, 2020. October 2020 crawl archive now available. The crawl archive for October 2020 is now available!…

Common Crawl - Blog - Introducing cc-downloader

← Back to Blog. January 21, 2025. Introducing cc-downloader. Introducing a command-line tool written in Rust for downloading data from Common Crawl. Pedro Ortiz Suarez. Pedro is a Principal Research Scientist at the Common Crawl Foundation.…

Common Crawl - Blog - Analyzing a Web graph with 129 billion edges using FlashGraph

← Back to Blog. February 25, 2015. Analyzing a Web graph with 129 billion edges using FlashGraph. This is a guest blog post by Da Zheng, the architect and main developer of the FlashGraph project.…

Common Crawl - Blog - June 2021 crawl archive now available

← Back to Blog. June 28, 2021. June 2021 crawl archive now available. The crawl archive for June 2021 is now available! The data was crawled June 12 – 25 and contains 2.45 billion web pages or 260 TiB of uncompressed content.…

Common Crawl - Blog - April 2021 crawl archive now available

← Back to Blog. April 27, 2021. April 2021 crawl archive now available. The crawl archive for April 2021 is now available! The data was crawled April 10 – 23 and contains 3.1 billion web pages or 320 TiB of uncompressed content.…

Common Crawl - Blog - Announcing the Common Crawl Index!

← Back to Blog. April 8, 2015. Announcing the Common Crawl Index! This is a guest post by Ilya Kreymer, a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl.…

Common Crawl - Blog - January/February 2023 crawl archive now available

← Back to Blog. February 16, 2023. January/February 2023 crawl archive now available. The crawl archive for January/February 2023 is now available!…

Common Crawl - Blog - January 2020 crawl archive now available

← Back to Blog. February 3, 2020. January 2020 crawl archive now available. The crawl archive for January 2020 is now available! It contains 3.1 billion web pages or 300 TiB of uncompressed content, crawled between January 17th and 29th.…

Search results

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use