Search results

Common Crawl - Erratum - WARC revisit metadata records

WARC revisit metadata records. The. revisit records. in the Common Crawl. WARC. archives in all crawls from. CC-MAIN-2018-34. to. CC-MAIN-2024-46. (since. Aug 2018. ) lack the metadata record which is attached to all response records. Fixed with.…

Common Crawl - Erratum - Missing WARC File

Missing WARC File. One WARC and WET is missing in June 2017 Crawl (CC-MAIN-2017-26). The corresponding WAT file is present, as well as the URL index entries contained in the missing WARC file.…

Common Crawl - Blog - Navigating the WARC file format

Navigating the WARC file format. Wait, what's WAT, WET and WARC? Recently CommonCrawl has switched to the Web ARChive (WARC) format.…

Common Crawl - Erratum - No Truncation Indicator in WARC Records

No Truncation Indicator in WARC Records. Originally reported by. Henry Thompson. Due to. an issue. with our crawler, not all truncations were indicated correctly.…

Common Crawl - Get Started

WARC. files of a specific segment of the April 2018 crawl: > aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/. 2018-04-20 10:27:49 931210633 CC-MAIN-20180420081400-20180420101400-00000.warc.gz. 2018-04-20 10:28:32 935833042…

Common Crawl - Blog - New Crawl Data Available!

We have switched from ARC files to WARC files to better match what the industry has standardized on.…

CDXJ Index

"crawl-data/CC-MAIN-2025-43/segments/1759648664410.92/warc/CC-MAIN-20251016175923-20251016205923-00549.warc.gz". curl -s -r. $OFFSET. -$((. $OFFSET. +. $LENGTH. -. 1. )). "https://data.commoncrawl.org/. $FILENAME. ". > get-started.warc.gz.…

Common Crawl - Erratum - Redundant extra line in response records

The WARC files of the August 2018 crawl contain a redundant empty line between the HTTP headers and the payload. of WARC response records.…

Common Crawl - Erratum - WARC-Target-URI May Include Non-ASCII Characters

WARC-Target-URI May Include Non-ASCII Characters. The. WARC-Target-URI. header in WARC record, but also corresponding WAT, WET and URL index records may include non-ASCII characters, not encoded using percent-encoding or Punycode.…

Common Crawl - Blog - August Crawl Archive Introduces Language Annotations

Please note that the WARC files of August 2018 (CC-MAIN-2018-34) are affected by a WARC format error and contain an extra \r\n between HTTP header and payload content. Also the given "Content-Length" is off by 2 bytes.…

Common Crawl - Blog - Web Archiving File Formats Explained

WARC (Web ARChive) Format. WARC. was developed as a successor to the ARC format (detailed below), and is now the industry standard for web archiving. It can store multiple resources, similar to ARC, but with more capabilities.…

Common Crawl - Blog - December 2024 Crawl Archive Now Available

-compressed files which list all segments, WARC. , WAT. and. WET. files. By simply adding either. s3://commoncrawl/. or. https://data.commoncrawl.org/. to each line, you end up with the. S3. and. HTTP. paths respectively. Please see.…

Common Crawl - Blog - Index to WARC Files and URLs in Columnar Format

Index to WARC Files and URLs in Columnar Format. We're happy to announce the release of an index to WARC files and URLs in a columnar format.…

Common Crawl - Blog - July 2024 Crawl Archive Now Available

WARC. headers were introduced to hold information related to the. HTTP. protocol. -. WARC-Protocol. shows the. HTTP. protocol version used to retrieve a web page. For. HTTPS. URLs a repeated header contains the SSL/TLS version. -.…

Common Crawl - Blog - May 2017 Crawl Archive Now Available

The following changes have been made to WARC (also WAT and WET) files: the timestamp in WARC filenames now indicates the capture time (fetch time) of the WARC content (see. details. ).…

Common Crawl - Erratum - Missing content_truncated flag in URL indexes

The flag in our URL indexes (CDX and columnar) that indicates whether or not a WARC record payload was truncated was added in CC-MAIN-2019-47. This indicator is missing in our indexes for all previous crawl releases.…

Common Crawl - Erratum - ARC Format (Legacy) Crawls

Our early crawls were archived using the ARC (Archive) format, not the WARC (Web ARChive) format. The ARC format, which predates WARC, was the initial format used for storing web crawl data.…

Common Crawl - Erratum - Columnar Index Subsets with Fewer than 900 Partitions per Crawl

The columnar index is partitioned using Hive partitioning (. column=value. ) leading to the following structure: s3://commoncrawl/cc-index/table/cc-main/warc/. |-- crawl=CC-MAIN-2025-46. | |-- subset=crawldiagnostics. | | `--. | |-- subset=robotstxt. | | `-…

Common Crawl - Blog - November 2019 crawl archive now available

The key is absent (resp. the field value is null) in case the "Location" value is missing, not a valid URL or not a valid relative URL path. truncation of the WARC record payload is indicated by the key "truncated" resp. the column "content_truncated".…

Common Crawl - Blog - May/June 2020 crawl archive now available

August 2018. only in WARC and WAT files and URL indexes. It is now also provided in WET files in the WARC header "WARC-Identified-Content-Language". Up to three language(s) are detected per document and given as comma-separated list of.…

Common Crawl - Blog - News Dataset Available

WARC files are released on a daily basis, identifiable by file name prefix which includes year and month. We provide. lists of the published WARC files. , organized by year and month from 2016 to-date.…

Common Crawl - Erratum - WAT data: repeated WARC and HTTP headers are not preserved

WAT data: repeated WARC and HTTP headers are not preserved. Repeated. HTTP. and. WARC. headers were not represented in the. JSON. data in. WAT. files.…

Common Crawl - Blog - September 2018 crawl archive now available

WARC revisit records. (HTTP status 304) in the URL indexes do not include a field for the payload "digest" anymore.…

Common Crawl - Blog - Data Sets Containing Robots.txt Files and Non-200 Responses

The new data sets are available as WARC files in subdirectories of the August 2016 crawl archives: s3://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/*/robotstxt/. for the robots.txt responses, and. s3://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/*/crawldiagnostics…

Common Crawl - Blog - IIPC General Assembly & Web Archiving Conference 2025

Pedro Ortiz Suarez presenting at the IIPC WAC 2025 for Common Crawl on ARC and WARC formats. Among our lightning talks, posters, and workshops, our team gave presentations during the. General Assembly. and.…

Common Crawl - Blog - September 2014 Crawl Archive Available

(CC-MAIN-2014-41/segment.paths.gz). all WARC files. (CC-MAIN-2014-41/warc.paths.gz). all WAT files. (CC-MAIN-2014-41/wat.paths.gz). all WET files. (CC-MAIN-2014-41/wet.paths.gz).…

Common Crawl - Blog - October 2014 Crawl Archive Available

(CC-MAIN-2014-42/segment.paths.gz). all WARC files. (CC-MAIN-2014-42/warc.paths.gz). all WAT files. (CC-MAIN-2014-42/wat.paths.gz). all WET files. (CC-MAIN-2014-42/wet.paths.gz).…

Common Crawl - Blog - August 2019 crawl archive now available

Starting with this crawl the following fixes and improvements are applied to the provided data formats: reliable marking of WARC records with truncated payload, see. issue report “WARC-Truncated header”. improved. decoding of XML/HTML character entities in…

Columnar Index

As the name suggests, it is an index to the WARC files and URLs in the Common Crawl corpus in a columnar format (. Apache Parquet™. ).…

Common Crawl - Blog - May/June 2024 Newsletter

Our summer intern, Ford Heilizer, has been hard at work making a tool that transforms our usual WARC/WAT/WET data into a table.…

Common Crawl - Blog - July 2014 Crawl Data Available

(CC-MAIN-2014-23/segment.paths.gz). all WARC files. (CC-MAIN-2014-23/warc.paths.gz). all WAT files. (CC-MAIN-2014-23/wat.paths.gz). all WET files. (CC-MAIN-2014-23/wet.paths.gz).…

Common Crawl - Blog - A Further Look Into the Prevalence of Various ML Opt–Out Protocols

WARC. files, and finding which proportions of domains are using which opt–out protocols. First, we need data to look for usage of the various opt–out protocols.…

Common Crawl - Blog - July 2016 Crawl Archive Now Available

(CC-MAIN-2016-30/segment.paths.gz). all WARC files. (CC-MAIN-2016-30/warc.paths.gz). all WAT files. (CC-MAIN-2016-30/wat.paths.gz). all WET files. (CC-MAIN-2016-30/wet.paths.gz).…

Common Crawl - Blog - February 2020 crawl archive now available

The HTTP headers in WARC response records have been fixed: the HTTP response status line now has a white space following the status code if the reason-phrase is empty.…

Common Crawl - Blog - February 2015 Crawl Archive Available

(CC-MAIN-2015-11/segment.paths.gz). all WARC files. (CC-MAIN-2015-11/warc.paths.gz). all WAT files. (CC-MAIN-2015-11/wat.paths.gz). all WET files. (CC-MAIN-2015-11/wet.paths.gz).…

Common Crawl - Blog - January 2020 crawl archive now available

WARC request records now show the HTTP protocol version sent with the HTTP request which can be different from the version received in the HTTP response message, cf. NUTCH-2760. Archive Location and Download.…

Common Crawl - Blog - March 2025 Crawl Archive Now Available

To assist with exploring and using the dataset, we provide. gzip. compressed files which list all segments, WARC. , WAT. and. WET. files. By simply adding either. s3://commoncrawl/. or. https://data.commoncrawl.org/. to each line, you end up with the.…

Common Crawl - Blog - September 2016 Crawl Archive Now Available

(CC-MAIN-2016-40/segment.paths.gz). all WARC files. (CC-MAIN-2016-40/warc.paths.gz). all WAT files. (CC-MAIN-2016-40/wat.paths.gz). all WET files. (CC-MAIN-2016-40/wet.paths.gz).…

Common Crawl - Erratum - Missing Language Classification

Starting with crawl CC-MAIN-2018-39 we added a language classification field (‘content-languages’) to the columnar indexes, WAT files, and WARC metadata for all subsequent crawls.…

Common Crawl - Blog - April 2018 Crawl Archive Now Available

We accept these – it's a part of the web and these WARC records are useful to gain insights, e.g. to. test PDF or Office document parsers at scale.…

Common Crawl - Blog - December 2016 Crawl Archive Now Available

(CC-MAIN-2016-50/segment.paths.gz). all WARC files. (CC-MAIN-2016-50/warc.paths.gz). all WAT files. (CC-MAIN-2016-50/wat.paths.gz). all WET files. (CC-MAIN-2016-50/wet.paths.gz). robots.txt files.…

Common Crawl - Blog - October 2016 Crawl Archive Now Available

(CC-MAIN-2016-44/segment.paths.gz). all WARC files. (CC-MAIN-2016-44/warc.paths.gz). all WAT files. (CC-MAIN-2016-44/wat.paths.gz). all WET files. (CC-MAIN-2016-44/wet.paths.gz). robots.txt files.…

Common Crawl - Erratum - Charset Detection Bug in WET Records

The charset detection required to properly transform non-UTF-8 HTML pages in WARC files into WET records didn't work before November 2016 due to a bug in. IIPC Web Archive Commons. (see the. related issue. in the CC fork of Apache Nutch).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019

The following improvements have been made for this webgraph release: the graphs now also included edges stemming from HTTP 303 "See Other" redirects (in addition to other HTTP redirect status codes). the Common Crawl. robots.txt WARC files. are used to get…

Common Crawl - Blog - Measuring Web Accessibility from Crawl Archives

I built a pipeline that takes the. 500 most-crawled registered domains. from Common Crawl's crawl archives, in this instance that’s the February 2026 crawl (CC-MAIN-2026-08), and then retrieves the archived homepage captures directly from WARC files.…

Common Crawl - Blog - August 2025 Crawl Archive Now Available

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either. s3://commoncrawl/. or. https://data.commoncrawl.org/. to each line, you end up with the. S3. and.…

Common Crawl - Blog - October 2025 Crawl Archive Now Available

Common Crawl - Blog - November 2014 Crawl Archive Available

(CC-MAIN-2014-49/segment.paths.gz). all WARC files. (CC-MAIN-2014-49/warc.paths.gz). all WAT files. (CC-MAIN-2014-49/wat.paths.gz). all WET files. (CC-MAIN-2014-49/wet.paths.gz).…

Common Crawl - Blog - April 2014 Crawl Data Available

(CC-MAIN-2014-15/segment.paths.gz). all WARC files. (CC-MAIN-2014-15/warc.paths.gz). all WAT files. (CC-MAIN-2014-15/wat.paths.gz). all WET files. (CC-MAIN-2014-15/wet.paths.gz).…

Common Crawl - Blog - June 2025 Crawl Archive Now Available

Common Crawl - Blog - January 2026 Crawl Archive Now Available

Common Crawl - Blog - December 2014 Crawl Archive Available

(CC-MAIN-2014-52/segment.paths.gz). all WARC files. (CC-MAIN-2014-52/warc.paths.gz). all WAT files. (CC-MAIN-2014-52/wat.paths.gz). all WET files. (CC-MAIN-2014-52/wet.paths.gz).…

Common Crawl - Blog - August 2014 Crawl Data Available

(CC-MAIN-2014-35/segment.paths.gz). all WARC files. (CC-MAIN-2014-35/warc.paths.gz). all WAT files. (CC-MAIN-2014-35/wat.paths.gz). all WET files. (CC-MAIN-2014-35/wet.paths.gz).…

Common Crawl - Blog - January 2025 Crawl Archive Now Available

Common Crawl - Blog - December 2025 Crawl Archive Now Available

Common Crawl - Blog - November 2025 Crawl Archive Now Available

Common Crawl - Blog - May 2025 Crawl Archive Now Available

Common Crawl - Blog - August 2024 Crawl Archive Now Available

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.…

Common Crawl - Blog - February 2026 Crawl Archive Now Available

Common Crawl - Blog - September/October 2023 crawl archive now available

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.…