Erratum
Columnar Index Subsets with Fewer than 900 Partitions per Crawl
The columnar index is partitioned using Hive partitioning (column=value) leading to the following structure:
s3://commoncrawl/cc-index/table/cc-main/warc/
...
|-- crawl=CC-MAIN-2025-46
| |-- subset=crawldiagnostics
| | `-- ...
| |-- subset=robotstxt
| | `-- ...
| `-- subset=warc
| `-- ...
`-- crawl=CC-MAIN-2025-51
`-- subset=crawldiagnostics
| |-- part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
| |-- ... (298 Parquet files)
| `-- part-00299-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
|-- subset=robotstxt
| |-- part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
| |-- ... (298 Parquet files)
| `-- part-00299-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
| `-- ...
`-- subset=warc
|-- part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
|-- ... (298 other partitions)
`-- part-00299-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
Every Hive partition of one crawl has 900 Parquet files.
However, older crawls from CC-MAIN-2013-20 including CC-MAIN-2016-30 do not have the robotstxt and crawldiagnostics subsets. In these crawls, only successfully fetched WARC captures have been archived, cf. the announcement of the August 2016 crawl. The number of Parquet files is 300 for these crawls.
The URL indexes are globally sorted and split into 300 chunks. Occasionally, it may happen that one sort chunk does not include captures of one of the three types (successfully fetched, 404/redirect/etc., robots.txt). Then there are no Parquet files of the corresponding partition and subset.