Erratum

Columnar Index Subsets with Fewer than 900 Partitions per Crawl

Originally reported by

Sebastian Nagel

The columnar index is partitioned using Hive partitioning (column=value) leading to the following structure:

s3://commoncrawl/cc-index/table/cc-main/warc/
...
|-- crawl=CC-MAIN-2025-46
|   |-- subset=crawldiagnostics
|   |   `-- ...
|   |-- subset=robotstxt
|   |   `-- ...
|   `-- subset=warc
|       `-- ...
`-- crawl=CC-MAIN-2025-51
    `-- subset=crawldiagnostics
    |   |-- part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
    |   |-- ... (298 Parquet files)
    |   `-- part-00299-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
    |-- subset=robotstxt
    |   |-- part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
    |   |-- ... (298 Parquet files)
    |   `-- part-00299-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
    |   `-- ...
    `-- subset=warc
        |-- part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
        |-- ... (298 other partitions)
        `-- part-00299-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet

‍

Every Hive partition of one crawl has 900 Parquet files.

However, older crawls from CC-MAIN-2013-20 including CC-MAIN-2016-30 do not have the robotstxt and crawldiagnostics subsets. In these crawls, only successfully fetched WARC captures have been archived, cf. the announcement of the August 2016 crawl. The number of Parquet files is 300 for these crawls.

The URL indexes are globally sorted and split into 300 chunks. Occasionally, it may happen that one sort chunk does not include captures of one of the three types (successfully fetched, 404/redirect/etc., robots.txt). Then there are no Parquet files of the corresponding partition and subset.

Affected Web Graphs

No items found.

Erratum

Columnar Index Subsets with Fewer than 900 Partitions per Crawl

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use