Search results
April 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for April 2024 is now available. The data was crawled between April 12th and April 25th, and contains 2.7 billion web pages (or 386 TiB of uncompressed content).…
In crawls. CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.…
Web Graph. releases (which allow visualisation of the crawl metrics) are now done following each crawl release, giving better ranking information than before (graphs were previously released after every third crawl).…
This indicator is missing in our indexes for all previous crawl releases. In the CDX index this is referred to as "truncated", and the columnar index refers to this as "content_truncated".…
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2019.…
April 2025 Crawl Archive Now Available. Announcing the release of the April 2025 crawl archive. The data was crawled between April 17th and May 1st, and contains 2.74 billion web pages (or 468 TiB of uncompressed content).…
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September, and October 2017.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2018.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March/April and May/June 2020.…
We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of January, February, and March 2025. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2018.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2019.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2019.…
We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of February, March, and April 2025.…
We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of October, November, and December 2024. The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2024 and January 2025.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2018 and January 2019.…
Announcing our February 2025 Web Graph release based on the crawls of December 2024 and January/February 2025, consisting of 267.4 million nodes and 2.7 billion edges at the host level, and 106.5 million nodes and 1.9 billion edges at the domain level.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2019 and January 2020.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, August 2024. The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26. Thom Vaughan.…
We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of August, September, and October 2024. The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42. Thom Vaughan.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August and September 2020.…
ARC Format (Legacy) Crawls. Our early crawls were archived using the ARC (Archive) format, not the WARC (Web ARChive) format. The ARC format, which predates WARC, was the initial format used for storing web crawl data.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2024. The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38. Thom Vaughan.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of April, May, June 2024. The crawls used to generate the graphs were CC-MAIN-2024-18, CC-MAIN-2024-22, and CC-MAIN-2024-26. Thom Vaughan.…
We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of September, October, and November 2024. The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of September, November, February 2023-24. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, February, April 2024. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2018.…
Common Crawl has long been an incredibly valuable resource, offering a vast archive of web crawl data that is accessible to the public.…
We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of May, June, and July 2024. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, September, and November of 2023. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, April, and May 2024. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February/March, April and May 2021.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2020 and January 2021.…
News Crawl. News is a text genre that is often discussed on our. user and developer mailing list. Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, May, and October 2023.…
Common Crawl Citations in Academic Research. Common Crawl Statistics on Hugging Face. Monthly Crawl Updates. Updates on our Policy Efforts. Roadmap and Future Plans. Common Crawl Citations in Academic Research.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the September/October, November/December 2022 and January/February 2023 crawls.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2021 and January 2022.…
February 2025 Crawl Archive Now Available. The crawl archive for February 2025 is now available. The data was crawled between February 6th and February 20th, and contains 2.6 billion web pages (or 402 TiB of uncompressed content).…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July/August and September 2021.…
June 2024 Crawl Archive Now Available. The crawl archive for June 2024 is now available. The data was crawled between June 12th and June 26th, and contains 2.7 billion web pages (or 382 TiB of uncompressed content).…
November 2024 Crawl Archive Now Available. The crawl archive for November 2024 is now available. The data was crawled between November 1st and November 15th, and contains 2.68 billion web pages (or 405 TiB of uncompressed content).…
October 2024 Crawl Archive Now Available. The data was crawled between October 3rd and October 16th, and contains 2.49 billion web pages (or 365 TiB of uncompressed content).…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, June/July and August 2022.…
We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. Sebastian Nagel.…