Search results

Common Crawl - Blog - Common Crawl URL Index

Common Crawl URL Index. We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. Scott Robertson.…

Common Crawl - Blog - URL Search Tool!

URL Search Tool! A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index.…

Common Crawl - Blog - May 2017 Crawl Archive Now Available

160 million URLs are a random sample extracted from. sitemaps.…

Common Crawl - Blog - June 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-26/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.…

Common Crawl - Blog - February 2016 Crawl Archive Now Available

This crawl archive holds more than 1.73 billion urls. Julien Nioche. Julien is a member of the Apache Software Foundation, emeritus member of the Common Crawl Foundation.…

Common Crawl - Blog - March 2017 Crawl Archive Now Available

Harmonic Centrality. , and. added 600 million URLs within a maximum of 2 links (“hops”) away from the home pages of the top 8 million hosts; used. sitemaps.…

Common Crawl - Blog - May 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-22/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.…

Common Crawl - Blog - April 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-18/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.…

Common Crawl - Blog - Analysis of the NCSU Library URLs in the Common Crawl Index

Analysis of the NCSU Library URLs in the Common Crawl Index. Last week we announced the Common Crawl URL Index.…

Common Crawl - Blog - April 2017 Crawl Archive Now Available

Harmonic Centrality. , and. added 390 million URLs within a maximum of 2 links (“hops”) away from the home pages of the top 16 million hosts; used. sitemaps.…

Common Crawl - Blog - December 2017 Crawl Archive Now Available

To improve coverage and freshness we added 650 million new URLs (not contained in any crawl archive before). sampled from. sitemaps. if provided by any of the top 80 million hosts taken from the.…

Common Crawl - Blog - July 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-30/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.…

Common Crawl - Blog - September 2015 Crawl Archive Now Available

This crawl archive is over 106TB in size and holds more than 1.32 billion urls. Ilya Kreymer. Ilya Kreymer is Lead Software Engineer at Webrecorder Software.…

Common Crawl - Blog - November 2015 Crawl Archive Now Available

This crawl archive is over 151TB in size and holds more than 1.82 billion urls. Ilya Kreymer. Ilya Kreymer is Lead Software Engineer at Webrecorder Software.…

Common Crawl - Blog - February 2018 Crawl Archive Now Available

The February crawl contains more than one billion new URLs, not contained in any crawl archive before. New URLs are “mined” by. extracting and sampling URLs from. sitemaps. if provided by any of the highest-ranking 100 million hosts taken from the.…

Common Crawl - Blog - August 2017 Crawl Archive Now Available

May/June/July 2017 webgraph data set. and added over 800 million new URLs (not contained in any crawl archive before), of which. 300 million URLs were found by a side crawl within a maximum of 3 links (“hops”) away from the home pages of the top 40 million…

Common Crawl - Blog - October 2017 Crawl Archive Now Available

To improve coverage and freshness we added over 900 million new URLs (not contained in any crawl archive before): 350 million URLs are a random sample extracted from. sitemaps. if provided by any of the top 80 million hosts taken from the.…

Common Crawl - Blog - February 2017 Crawl Archive Now Available

To extend the coverage of the crawl we. continued to use. sitemaps. to find fresh URLs for known hosts; added 250 million URLs within a maximum of 2 links (“hops”) away from the home pages of the top 5 million hosts. We also ranked these hosts by.…

Common Crawl - Blog - August 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-36/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.…

Common Crawl - Blog - September 2017 Crawl Archive Now Available

To improve coverage and freshness we added one billion new URLs (not contained in any crawl archive before): 300 million URLs are a random sample extracted from. sitemaps. if provided by any of the top 60 million hosts taken from the.…

Common Crawl - Blog - January 2017 Crawl Archive Now Available

To extend the coverage of the crawl we. continued to use. sitemaps. to achieve fresh URLs for already known hosts; added all accessible URLs from the. top-million domains from Alexa.…

Common Crawl - Blog - June 2017 Crawl Archive Now Available

Feb/Mar/Apr 2017 webgraph data set. and added almost 800 million new URLs (not contained in any crawl archive before), of which: 500 million URLs were found by a side crawl within a maximum of 3 links (“hops”) away from the home pages of the top 40 million…

Common Crawl - Blog - July 2017 Crawl Archive Now Available

Feb/Mar/Apr 2017 webgraph data set. and added over 550 million new URLs (not contained in any crawl archive before), of which: 300 million URLs were found by a side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 50 million hosts…

Common Crawl - Erratum - Missing content_truncated flag in URL indexes

Missing content_truncated flag in URL indexes. The flag in our URL indexes (CDX and columnar) that indicates whether or not a WARC record payload was truncated was added in CC-MAIN-2019-47.…

Common Crawl - Blog - September 2016 Crawl Archive Now Available

For the majority of sitemaps, a maximum of 5,000 potential new URLs per-sitemap were allowed. For the top 5,000 hosts/sitemaps, up to 200,000 potential new URLs were allowed.…

Common Crawl - Blog - January 2018 Crawl Archive Now Available

The January crawl contains 1.1 billion new URLs, not contained in any crawl archive before. New URLs are “mined” by. extracting and sampling URLs from. sitemaps. if provided by any of the highest-ranking 100 million hosts taken from the.…

Common Crawl - Blog - December 2016 Crawl Archive Now Available

October. crawls, we used. sitemaps. to find new URLs for already known hosts shortly before the crawl was launched. In addition to the. top-million domains from Alexa. , sitemaps were mined for a. list of multi-lingual sites. Thanks to the.…

Common Crawl - Blog - October 2016 Crawl Archive Now Available

The resulting crawl included 2 billion new URLs, not contained in previous crawls. We are grateful to. webxtrakt. for donating a list of 14 million verified, DNS-resolvable domain names of European country-code TLDs (eu, .fr, .be, .de, .ch, .nl, .pl).…

Common Crawl - Blog - November 2017 Crawl Archive Now Available

To improve coverage and freshness we added 750 million new URLs (not contained in any crawl archive before). sampled from. sitemaps. if provided by any of the top 80 million hosts taken from the.…

Common Crawl - Blog - July 2020 crawl archive now available

It includes page captures of 1.1 billion URLs unknown in any of our prior crawl archives. Bug Fixes and Improvements.…

Common Crawl - Erratum - WARC-Target-URI May Include Non-ASCII Characters

WARC-Target-URI. header in WARC record, but also corresponding WAT, WET and URL index records may include non-ASCII characters, not encoded using percent-encoding or Punycode. The issue has been fixed for June 2024 (CC-MAIN-2024-26).…

Common Crawl - Erratum - Missing WARC File

The corresponding WAT file is present, as well as the URL index entries contained in the missing WARC file. For more details, see the. release announcement in the Common Crawl Google Group.…

Common Crawl - Overview

Common Crawl URL Index. Check out the. Example Projects. , view. Use Cases. , or. Statistics. for our crawls. The Data. Overview. CDXJ Index. Columnar Index. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent.…

Common Crawl - Blog - September 2018 crawl archive now available

(HTTP status 304) in the URL indexes do not include a field for the payload "digest" anymore.…

Common Crawl - Erratum - Redirect target URL in URL indexes may be a relative URL

Redirect target URL in URL indexes may be a relative URL. Originally reported by. Sebastian Nagel.…

Common Crawl - Blog - January 2021 crawl archive now available

It includes page captures of 1.15 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for January 2021 is now available!…

Common Crawl - Blog - April 2021 crawl archive now available

It includes page captures of 1.35 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for April 2021 is now available!…

Common Crawl - Blog - June 2021 crawl archive now available

It includes page captures of 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for June 2021 is now available!…

Common Crawl - Blog - September 2021 crawl archive now available

Common Crawl - Blog - October 2019 crawl archive now available

It includes page captures of 1.1 billion URLs not contained in any crawl archive before. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for October 2019 is now available!…

Common Crawl - Blog - January 2022 crawl archive now available

Common Crawl - Blog - October 2020 crawl archive now available

It includes page captures of 1.5 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for October 2020 is now available!…

Common Crawl - Blog - March/April 2023 crawl archive now available

Page captures are from 43 million hosts or 34 million registered domains and include 1.2 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - August 2020 crawl archive now available

It includes page captures of 940 million URLs unknown in any of our prior crawl archives. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for August 2020 is now available!…

Common Crawl - Blog - May 2021 crawl archive now available

It includes page captures of 1.28 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for May 2021 is now available!…

Common Crawl - Blog - November/December 2022 crawl archive now available

Page captures are from 44 million hosts or 34 million registered domains and include 1.2 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - August 2022 crawl archive now available

Page captures are from 46 million hosts or 37 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - November/December 2020 crawl archive now available

It includes page captures of 1.4 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for November/December 2020 is now available!…

Common Crawl - Blog - October 2021 crawl archive now available

Common Crawl - Blog - December 2019 crawl archive now available

It includes page captures of 850 million URLs not contained in any crawl archive before. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for December 2019 is now available!…

Common Crawl - Blog - March/April 2020 crawl archive now available

It includes page captures of 1 billion URLs unknown in any of our prior crawl archives. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for March/April 2020 is now available!…

Common Crawl - Blog - September 2020 crawl archive now available

Common Crawl - Blog - August Crawl Archive Introduces Language Annotations

ISO-639-3 code. are shown in the URL index as a new field, e.g. "languages": "zho,eng". The WARC metadata records contain the full CLD2 response including scores and text coverage: On github you'll find the.…

Common Crawl - Blog - May 2022 crawl archive now available

Page captures are from 45 million hosts or 36 million registered domains and include 1.4 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - January/February 2023 crawl archive now available

Page captures are from 40 million hosts or 33 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - September 2019 crawl archive now available

It includes page captures of 1.0 billion URLs not contained in any crawl archive before. The other 1.5 billion pages have been already captured in prior crawls and are now revisited. Sebastian Nagel.…

Common Crawl - Blog - 3.25 Billion Pages Crawled in July 2018

The July crawl contains 625 million new URLs, not contained in any crawl archive before. New URLs are “mined” by. extracting and sampling URLs from. sitemaps. , RSS and Atom feeds if provided by hosts visited in prior crawls.…

Common Crawl - Blog - February 2020 crawl archive now available

Common Crawl - Blog - September/October 2022 crawl archive now available

Page captures are from 44 million hosts or 34 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls. This crawl includes improvements made in extracting clean text in WET files and WAT anchor texts.…

Common Crawl - Blog - March 2018 Crawl Archive Now Available

The March crawl contains 800 million new URLs, not contained in any crawl archive before. New URLs are “mined” by. extracting and sampling URLs from. sitemaps. if provided by any of the highest-ranking 100 million hosts taken from the.…

Search results

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use