< Back to Errata

Erratum

Columnar Index Subsets with Fewer than 900 Partitions per Crawl

Originally reported by 
Sebastian Nagel
.

The columnar index is partitioned using Hive partitioning (column=value) leading to the following structure:

s3://commoncrawl/cc-index/table/cc-main/warc/
...
|-- crawl=CC-MAIN-2025-46
|   |-- subset=crawldiagnostics
|   |   `-- ...
|   |-- subset=robotstxt
|   |   `-- ...
|   `-- subset=warc
|       `-- ...
`-- crawl=CC-MAIN-2025-51
    `-- subset=crawldiagnostics
    |   |-- part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
    |   |-- ... (298 Parquet files)
    |   `-- part-00299-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
    |-- subset=robotstxt
    |   |-- part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
    |   |-- ... (298 Parquet files)
    |   `-- part-00299-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
    |   `-- ...
    `-- subset=warc
        |-- part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
        |-- ... (298 other partitions)
        `-- part-00299-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet

Every Hive partition of one crawl has 900 Parquet files.

However, older crawls from CC-MAIN-2013-20 including CC-MAIN-2016-30 do not have the robotstxt and crawldiagnostics subsets. In these crawls, only successfully fetched WARC captures have been archived, cf. the announcement of the August 2016 crawl. The number of Parquet files is 300 for these crawls.

The URL indexes are globally sorted and split into 300 chunks. Occasionally, it may happen that one sort chunk does not include captures of one of the three types (successfully fetched, 404/redirect/etc., robots.txt). Then there are no Parquet files of the corresponding partition and subset.

Affected Crawls
Affected Web Graphs
No items found.