Search results

Common Crawl - Blog - Announcing the Common Crawl Index!

ZipNum’ CDX format. and it is the same format that is used by the Wayback Machine at the Internet Archive. Index Query API.

Common Crawl - Blog - November 2019 crawl archive now available

We've added two new fields to the URL indexes (CDX and columnar): the redirect target location is stored in the CDX JSON field "redirect" resp. the column "fetch_redirect".

Common Crawl - Blog - September 2018 crawl archive now available

The. columnar index. has been updated to contain two new fields added to WARC and CDX files starting with the. August crawl. : content_charset: the character encoding used by the HTML page. content_languages: a comma-separated list of.