Search results

Common Crawl - Blog - April 2024 Crawl Archive Now Available

April 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for April 2024 is now available. The data was crawled between April 12th and April 25th, and contains 2.7 billion web pages (or 386 TiB of uncompressed content).

Common Crawl - Erratum - Incorrect fetch_time metadata

In crawls. CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.

Common Crawl - Blog - March/April 2024 Newsletter

Web Graph. releases (which allow visualisation of the crawl metrics) are now done following each crawl release, giving better ranking information than before (graphs were previously released after every third crawl).

Common Crawl - Erratum - Missing content_truncated flag in URL indexes

This indicator is missing in our indexes for all previous crawl releases. In the CDX index this is referred to as "truncated", and the columnar index refers to this as "content_truncated".

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2019.

Common Crawl - Blog - April 2025 Crawl Archive Now Available

April 2025 Crawl Archive Now Available. Announcing the release of the April 2025 crawl archive. The data was crawled between April 17th and May 1st, and contains 2.74 billion web pages (or 468 TiB of uncompressed content).

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September, and October 2017.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/June/July 2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2018.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/May 2020

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March/April and May/June 2020.

Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2025

We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of January, February, and March 2025. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2018.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2019.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/June/July 2019

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2019.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2025

We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of February, March, and April 2025.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, and December 2024

We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of October, November, and December 2024. The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2024 and January 2025

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2024 and January 2025.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2018 and January 2019.

Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2024 and January/February 2025

Announcing our February 2025 Web Graph release based on the crawls of December 2024 and January/February 2025, consisting of 267.4 million nodes and 2.7 billion edges at the host level, and 106.5 million nodes and 1.9 billion edges at the domain level.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2019 and January 2020.

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2024

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, August 2024. The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26. Thom Vaughan.

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2024

We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of August, September, and October 2024. The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42. Thom Vaughan.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August and September 2020.

Common Crawl - Erratum - ARC Format (Legacy) Crawls

ARC Format (Legacy) Crawls. Our early crawls were archived using the ARC (Archive) format, not the WARC (Web ARChive) format. The ARC format, which predates WARC, was the initial format used for storing web crawl data.

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2024

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2024. The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38. Thom Vaughan.

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2024

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of April, May, June 2024. The crawls used to generate the graphs were CC-MAIN-2024-18, CC-MAIN-2024-22, and CC-MAIN-2024-26. Thom Vaughan.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, November 2024

We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of September, October, and November 2024. The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of September, November, February 2023-24. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, February, April 2024. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/Apr 2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2018.

Common Crawl - Blog - Common Crawl Statistics Now Available on Hugging Face

Common Crawl has long been an incredibly valuable resource, offering a vast archive of web crawl data that is accessible to the public.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2024

We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of May, June, and July 2024. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/Sep/Nov 2023

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, September, and November of 2023. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April, and May 2024

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, April, and May 2024. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April and May 2021

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February/March, April and May 2021.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2020 and January 2021.

Common Crawl - News Crawl

News Crawl. News is a text genre that is often discussed on our. user and developer mailing list. Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Mar/May/Oct 2023

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, May, and October 2023.

Common Crawl - Blog - August/September 2024 Newsletter

Common Crawl Citations in Academic Research. Common Crawl Statistics on Hugging Face. Monthly Crawl Updates. Updates on our Policy Efforts. Roadmap and Future Plans. Common Crawl Citations in Academic Research.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023

We are pleased to announce a new release of host-level and domain-level web graphs based on the September/October, November/December 2022 and January/February 2023 crawls.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November/December 2021 and January 2022

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2021 and January 2022.

Common Crawl - Blog - February 2025 Crawl Archive Now Available

February 2025 Crawl Archive Now Available. The crawl archive for February 2025 is now available. The data was crawled between February 6th and February 20th, and contains 2.6 billion web pages (or 402 TiB of uncompressed content).

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July/August and September 2021

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July/August and September 2021.

Common Crawl - Blog - June 2024 Crawl Archive Now Available

June 2024 Crawl Archive Now Available. The crawl archive for June 2024 is now available. The data was crawled between June 12th and June 26th, and contains 2.7 billion web pages (or 382 TiB of uncompressed content).

Common Crawl - Blog - November 2024 Crawl Archive Now Available

November 2024 Crawl Archive Now Available. The crawl archive for November 2024 is now available. The data was crawled between November 1st and November 15th, and contains 2.68 billion web pages (or 405 TiB of uncompressed content).

Common Crawl - Blog - October 2024 Crawl Archive Now Available

October 2024 Crawl Archive Now Available. The data was crawled between October 3rd and October 16th, and contains 2.49 billion web pages (or 365 TiB of uncompressed content).

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June/July and August 2022

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, June/July and August 2022.

Common Crawl - Blog - Common Crawl's First In-House Web Graph

We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. Sebastian Nagel.

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl