Search results

Common Crawl - Blog - January 2021 crawl archive now available

February 2, 2021. January 2021 crawl archive now available. The crawl archive for January 2021 is now available! The data was crawled between January 15th and 28th and contains 3.4 billion web pages or 350 TiB of uncompressed content.

Common Crawl - Blog - March/April 2023 crawl archive now available

April 6, 2023. March/April 2023 crawl archive now available. The crawl archive for March/April 2023 is now available! The data was crawled March 20 – April 2 and contains 3.1 billion web pages or 400 TiB of uncompressed content.

Common Crawl - Blog - May 2022 crawl archive now available

June 2, 2022. May 2022 crawl archive now available. The crawl archive for May 2022 is now available! The data was crawled May 16 – 29 and contains 3.45 billion web pages or 420 TiB of uncompressed content.

Common Crawl - Blog - January 2022 crawl archive now available

February 2, 2022. January 2022 crawl archive now available. The crawl archive for January 2022 is now available! The data was crawled January 16 – 29 and contains 2.95 billion web pages or 320 TiB of uncompressed content.

Common Crawl - Blog - February 2020 crawl archive now available

March 4, 2020. February 2020 crawl archive now available. The crawl archive for February 2020 is now available! It contains 2.6 billion web pages or 240 TiB of uncompressed content, crawled between February 16th and 29th.

Common Crawl - Blog - August 2020 crawl archive now available

August 19, 2020. August 2020 crawl archive now available. The crawl archive for August 2020 is now available! It contains 2.45 billion web pages or 235 TiB of uncompressed content, crawled between August 2nd and 15th.

Common Crawl - Blog - July 2020 crawl archive now available

July 20, 2020. July 2020 crawl archive now available. The crawl archive for July 2020 is now available! It contains 3.14 billion web pages or 300 TiB of uncompressed content, crawled between July 2nd and 16th.

Common Crawl - Blog - January 2020 crawl archive now available

February 3, 2020. January 2020 crawl archive now available. The crawl archive for January 2020 is now available! It contains 3.1 billion web pages or 300 TiB of uncompressed content, crawled between January 17th and 29th.

Common Crawl - Blog - July 2024 Crawl Archive Now Available

July 28, 2024. July 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for July 2024 is now available, containing 2.5 billion web pages, or 360 TiB of uncompressed content. Thom Vaughan.

Common Crawl - Blog - May/June 2020 crawl archive now available

June 10, 2020. May/June 2020 crawl archive now available. The crawl archive for May/June 2020 is now available! It contains 2.75 billion web pages or 255 TiB of uncompressed content, crawled between May 24th and June 7th.

Common Crawl - Blog - March/April 2020 crawl archive now available

April 14, 2020. March/April 2020 crawl archive now available. The crawl archive for March/April 2020 is now available! It contains 2.85 billion web pages or 280 TiB of uncompressed content, crawled between March 28th and April 10th.

Common Crawl - Blog - October 2020 crawl archive now available

November 7, 2020. October 2020 crawl archive now available. The crawl archive for October 2020 is now available! The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - April 2021 crawl archive now available

April 27, 2021. April 2021 crawl archive now available. The crawl archive for April 2021 is now available! The data was crawled April 10 – 23 and contains 3.1 billion web pages or 320 TiB of uncompressed content.

Common Crawl - Blog - June 2021 crawl archive now available

June 28, 2021. June 2021 crawl archive now available. The crawl archive for June 2021 is now available! The data was crawled June 12 – 25 and contains 2.45 billion web pages or 260 TiB of uncompressed content.

Common Crawl - Blog - November/December 2020 crawl archive now available

December 10, 2020. November/December 2020 crawl archive now available. The crawl archive for November/December 2020 is now available!

Common Crawl - Blog - September 2020 crawl archive now available

October 7, 2020. September 2020 crawl archive now available. The crawl archive for September 2020 is now available! The data was crawled between September 18th and October 2nd and contains 3.45 billion web pages or 345 TiB of uncompressed content.

Common Crawl - Blog - September 2021 crawl archive now available

October 4, 2021. September 2021 crawl archive now available. The crawl archive for September 2021 is now available! The data was crawled Sept 16 – 29 and contains 2.95 billion web pages or 310 TiB of uncompressed content.

Common Crawl - Blog - July/August 2021 crawl archive now available

August 9, 2021. July/August 2021 crawl archive now available. The crawl archive for July/August 2021 is now available! The data was crawled July 23 – August 6 and contains 3.15 billion web pages or 360 TiB of uncompressed content.

Common Crawl - Blog - October 2021 crawl archive now available

November 1, 2021. October 2021 crawl archive now available. The crawl archive for October 2021 is now available! The data was crawled Oct 15 – 28 and contains 3.3 billion web pages or 360 TiB of uncompressed content.

Common Crawl - Blog - May 2021 crawl archive now available

May 23, 2021. May 2021 crawl archive now available. The crawl archive for May 2021 is now available! The data was crawled May 5 – 19 and contains 2.6 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024

May 5, 2024. Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, February, April 2024.

Common Crawl - Blog - November/December 2021 crawl archive now available

December 14, 2021. November/December 2021 crawl archive now available. The crawl archive for November/December 2021 is now available! The data was crawled Nov 26 – Dec 9 and contains 2.5 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - February/March 2021 crawl archive now available

March 14, 2021. February/March 2021 crawl archive now available. The crawl archive for February/March 2021 is now available! The data was crawled between February 24th and March 9th and contains 2.7 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - May/June 2023 crawl archive now available

June 21, 2023. May/June 2023 crawl archive now available. The crawl archive for May/June 2023 is now available! The data was crawled May 27 – June 11 and contains 3.1 billion web pages or 390 TiB of uncompressed content.

Common Crawl - Blog - August 2022 crawl archive now available

August 22, 2022. August 2022 crawl archive now available. The crawl archive for August 2022 is now available! The data was crawled August 7 – 20 and contains 2.55 billion web pages or 295 TiB of uncompressed content.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021

February 10, 2021. Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021.

Common Crawl - Erratum - Some 2–Level CCTLDs Excluded

Some 2–Level CCTLDs Excluded. A bad configuration was checked into our exclusion list on Sep 22, 2022 and was fixed on Oct 27, 2023. The configuration blocked a number of 2–level domains, meaning they were not included in certain crawls.

Common Crawl - Blog - June 2024 Crawl Archive Now Available

June 28, 2024. June 2024 Crawl Archive Now Available. The crawl archive for June 2024 is now available. The data was crawled between June 12th and June 26th, and contains 2.7 billion web pages (or 382 TiB of uncompressed content).

Common Crawl - Blog - September/October 2023 crawl archive now available

October 12, 2023. September/October 2023 crawl archive now available. The crawl archive for September/October 2023 is now available! The data was crawled Sept 21 – October 5 and contains 3.4 billion web pages or 456 TiB of uncompressed content.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November/December 2021 and January 2022

March 16, 2022. Host- and Domain-Level Web Graphs October, November/December 2021 and January 2022. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2021 and January 2022.

Common Crawl - Blog - August Crawl Archive Introduces Language Annotations

It contains 2.65 billion web pages and 220 TiB of uncompressed content, crawled between August 14th and 22th.

Common Crawl - Blog - November 2024 Crawl Archive Now Available

November 18, 2024. November 2024 Crawl Archive Now Available. The crawl archive for November 2024 is now available. The data was crawled between November 1st and November 15th, and contains 2.68 billion web pages (or 405 TiB of uncompressed content).

Common Crawl - Blog - April 2024 Crawl Archive Now Available

May 1, 2024. April 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for April 2024 is now available.

Common Crawl - Blog - June/July 2022 crawl archive now available

July 13, 2022. June/July 2022 crawl archive now available. The crawl archive for June/July 2022 is now available! The data was crawled June 24 – July 7 and contains 3.1 billion web pages or 370 TiB of uncompressed content.

Common Crawl - Blog - February 2025 Crawl Archive Now Available

February 23, 2025. February 2025 Crawl Archive Now Available. The crawl archive for February 2025 is now available. The data was crawled between February 6th and February 20th, and contains 2.6 billion web pages (or 402 TiB of uncompressed content).

Common Crawl - Blog - April 2025 Crawl Archive Now Available

May 4, 2025. April 2025 Crawl Archive Now Available. Announcing the release of the April 2025 crawl archive. The data was crawled between April 17th and May 1st, and contains 2.74 billion web pages (or 468 TiB of uncompressed content).

Common Crawl - Blog - November/December 2022 crawl archive now available

December 14, 2022. November/December 2022 crawl archive now available. The crawl archive for November/December 2022 is now available! The data was crawled November 26 – December 10 and contains 3.35 billion web pages or 420 TiB of uncompressed content.

Common Crawl - Blog - December 2024 Crawl Archive Now Available

December 18, 2024. December 2024 Crawl Archive Now Available. The crawl archive for December 2024 is now available. The data was crawled between December 1st and December 15th, and contains 2.64 billion web pages (or 394 TiB of uncompressed content).

Common Crawl - Blog - January/February 2023 crawl archive now available

February 16, 2023. January/February 2023 crawl archive now available. The crawl archive for January/February 2023 is now available! The data was crawled January 26 – February 9 and contains 3.15 billion web pages or 400 TiB of uncompressed content.

Common Crawl - Blog - Common Crawl Statistics Now Available on Hugging Face

July 22, 2024. Common Crawl Statistics Now Available on Hugging Face. We're excited to announce that Common Crawl’s statistics are now available on Hugging Face! Ford Heilizer.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2024 and January 2025

February 1, 2025. Host- and Domain-Level Web Graphs November/December 2024 and January 2025. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2024 and January 2025.

Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2024 and January/February 2025

February 25, 2025. Host- and Domain-Level Web Graphs December 2024 and January/February 2025.

Common Crawl - Blog - September 2024 Crawl Archive Now Available

September 24, 2024. September 2024 Crawl Archive Now Available. The crawl archive for September 2024 is now available.

Common Crawl - Blog - July 2019 crawl archive now available

It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between July 15th and 24th. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for July 2019 is now available!

Common Crawl - Blog - January 2025 Crawl Archive Now Available

January 31, 2025. January 2025 Crawl Archive Now Available. We're pleased to announce our first crawl of 2025, containing 3.0 billion pages, and 460 TiB uncompressed content. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - November/December 2023 Crawl Archive Now Available

December 15, 2023. November/December 2023 Crawl Archive Now Available. The crawl archive for November/December 2023 is now available.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

As compared to prior web graphs, two changes are caused by the large size of this host-level graph (5.1 billion hosts): The text dump of the graph is split into multiple files; there is no page rank calculation at this time.

Common Crawl - Blog - February/March 2024 Crawl Archive Now Available

March 11, 2024. February/March 2024 Crawl Archive Now Available. The crawl archive for February/March 2024 is now available.

Common Crawl - Blog - Introducing the Common Crawl Errata Page for Data Transparency

October 30, 2024. Introducing the Common Crawl Errata Page for Data Transparency. As part of our commitment to accuracy and transparency, we are pleased to introduce a new Errata page on our website. Thom Vaughan.

Common Crawl - Blog - September 2019 crawl archive now available

It contains 2.55 billion web pages or 240 TiB of uncompressed content, crawled between September 15th and 24th. It includes page captures of 1.0 billion URLs not contained in any crawl archive before.

Common Crawl - Blog - New Crawl Data Available!

The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed). Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Common Crawl - Blog - October 2019 crawl archive now available

It contains 3.0 billion web pages or 280 TiB of uncompressed content, crawled between October 13th and 24th. It includes page captures of 1.1 billion URLs not contained in any crawl archive before. Sebastian Nagel.

Common Crawl - Blog - December 2019 crawl archive now available

It contains 2.45 billion web pages or 234 TiB of uncompressed content, crawled between December 5th and 16th. It includes page captures of 850 million URLs not contained in any crawl archive before. Sebastian Nagel.

Common Crawl - Blog - June 2019 crawl archive now available

July 2, 2019. June 2019 crawl archive now available. The crawl archive for June 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between June 16th and 27th with an operational break from 21st to 24th.

Common Crawl - Blog - August 2019 crawl archive now available

It contains 2.95 billion web pages or 260 TiB of uncompressed content, crawled between August 17th and 26th. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for August 2019 is now available!

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

March 14, 2024. Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023

March 15, 2023. Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023.

Common Crawl - Blog - September/October 2022 crawl archive now available

October 11, 2022. September/October 2022 crawl archive now available. The crawl archive for September/October 2022 is now available! The data was crawled September 24 – October 8 and contains 3.15 billion web pages or 380 TiB of uncompressed content.

Common Crawl - Blog - The Environmental Impact of the Cloud - the Common Crawl Case Study

Originally. posted on LinkedIn. by Julien Nioche on 26th March 2024. Generated with AI by. https://designer.microsoft.com/.

Common Crawl - Blog - Announcing the Common Crawl Index!

This query returns: This indicates that there are 989 total pages, at 5 compressed index blocks per page!