Search results

Common Crawl - Blog - Announcing the Common Crawl Index!

ZipNum’ CDX format. and it is the same format that is used by the Wayback Machine at the Internet Archive. Index Query API.

Common Crawl - Erratum - Missing content_truncated flag in URL indexes

The flag in our URL indexes (CDX and columnar) that indicates whether or not a WARC record payload was truncated was added in CC-MAIN-2019-47. This indicator is missing in our indexes for all previous crawl releases.

Common Crawl - Erratum - Missing fetch_status fields

In the cdx index (columnar: `. content_mime_type. `), fields `. mime. ` and `. status. ` are missing. Affected Crawls. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent. Blog. Examples.

Common Crawl - Erratum - Redirect target URL in URL indexes may be a relative URL

When the HTTP “Location” header includes a relative URL, the corresponding “redirect” field in the CDX index and “fetch_redirect” field in the columnar index will also store a relative URL.

Common Crawl - Blog - November 2019 crawl archive now available

We've added two new fields to the URL indexes (CDX and columnar): the redirect target location is stored in the CDX JSON field "redirect" resp. the column "fetch_redirect".

Common Crawl - Blog - Introducing Common Crawl AI Agent by ReadyAI

Questions about Common Crawl’s indexes, both cdx and columnar. Questions about example uses of Common Crawl data. Generic questions about web archiving. The end of most answers contains a link to a specific webpage with more information about the answer.

Common Crawl - FAQ

Our CDX API endpoint is frequently abused and therefore heavily rate limited. If your client sends too many requests in a short period of time, your IP address may be temporarily blocked. To avoid connection issues, always use.

Common Crawl - Blog - May/June 2024 Newsletter

If you use the cdx index, https://index.commoncrawl.org/collinfo.json. now has 2 new fields, “from” and “to”, giving the exact dates when the crawling started and ended. Volunteer for Common Crawl!

Common Crawl - Blog - September 2018 crawl archive now available

The. columnar index. has been updated to contain two new fields added to WARC and CDX files starting with the. August crawl. : content_charset: the character encoding used by the HTML page. content_languages: a comma-separated list of.