Search results
News Crawl. News is a text genre that is often discussed on our. user and developer mailing list. Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events.…
December 2024 Crawl Archive Now Available. The crawl archive for December 2024 is now available. The data was crawled between December 1st and December 15th, and contains 2.64 billion web pages (or 394 TiB of uncompressed content).…
February 2025 Crawl Archive Now Available. The crawl archive for February 2025 is now available. The data was crawled between February 6th and February 20th, and contains 2.6 billion web pages (or 402 TiB of uncompressed content).…
Common Crawl Statistics Now Available on Hugging Face. We're excited to announce that Common Crawl’s statistics are now available on Hugging Face! Ford Heilizer.…
February 2019 crawl archive now available. The crawl archive for February 2019 is now available! It contains 2.9 billion web pages or 225 TiB of uncompressed content, crawled between February 15th and 24th. Sebastian Nagel.…
July 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for July 2024 is now available, containing 2.5 billion web pages, or 360 TiB of uncompressed content. Thom Vaughan.…
April 2018 Crawl Archive Now Available. The crawl archive for April 2018 is now available! The archive contains 3.1 billion web pages and 230 TiB of uncompressed content, crawled between April 19th and 27th. Sebastian Nagel.…
November/December 2021 crawl archive now available. The crawl archive for November/December 2021 is now available! The data was crawled Nov 26 – Dec 9 and contains 2.5 billion web pages or 280 TiB of uncompressed content.…
July/August 2021 crawl archive now available. The crawl archive for July/August 2021 is now available! The data was crawled July 23 – August 6 and contains 3.15 billion web pages or 360 TiB of uncompressed content.…
Common Crawl. General Questions. What is Common Crawl?…
Monthly Crawl Updates. Updates on our Policy Efforts. Roadmap and Future Plans. Common Crawl Citations in Academic Research. Common Crawl's impact on research has grown substantially since its beginning.…
September 2019 crawl archive now available. The crawl archive for September 2019 is now available! It contains 2.55 billion web pages or 240 TiB of uncompressed content, crawled between September 15th and 24th.…
May 2021 crawl archive now available. The crawl archive for May 2021 is now available! The data was crawled May 5 – 19 and contains 2.6 billion web pages or 280 TiB of uncompressed content.…
May 2022 crawl archive now available. The crawl archive for May 2022 is now available! The data was crawled May 16 – 29 and contains 3.45 billion web pages or 420 TiB of uncompressed content.…
ARC Format (Legacy) Crawls. Our early crawls were archived using the ARC (Archive) format, not the WARC (Web ARChive) format. The ARC format, which predates WARC, was the initial format used for storing web crawl data.…
October 2020 crawl archive now available. The crawl archive for October 2020 is now available! The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content.…
January 2022 crawl archive now available. The crawl archive for January 2022 is now available! The data was crawled January 16 – 29 and contains 2.95 billion web pages or 320 TiB of uncompressed content.…
August 2022 crawl archive now available. The crawl archive for August 2022 is now available! The data was crawled August 7 – 20 and contains 2.55 billion web pages or 295 TiB of uncompressed content.…
September 2021 crawl archive now available. The crawl archive for September 2021 is now available! The data was crawled Sept 16 – 29 and contains 2.95 billion web pages or 310 TiB of uncompressed content.…