←  Back to Blog
June 3, 2026

The Columnar Index Is Now the URL Index

We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format.

Today we have renamed the Columnar Index to the URL Index.

The URL Index (formerly the Columnar Index) is one of the indexes we provide for querying the Common Crawl corpus, alongside the CDXJ Index.  As its new name makes clear, it is an index to the URLs and WARC files in the corpus, stored in a columnar format (Apache Parquet™).  That format is well suited to efficient analytical and bulk queries, saving both time and computing resources, and it works with a wide range of tools including AWS Athena, Apache Spark™, Pandas, Polars, Apache Arrow™, and DuckDB.

Why the change?  The old name described how the index was stored rather than what it contained. "Columnar" refers to the file format, but it told you nothing about the actual purpose of the dataset, which is indexing URLs.  As we aim to publish more of our datasets in columnar formats, naming a single dataset after the format it happens to use would only become more confusing.  A future where several different datasets are all "columnar" needs names that distinguish them by what they are for.  Calling this one the URL Index does exactly that.

Nothing else has changed.  The data, the schema, the S3 location, and the way you query it all remain the same.  You will still find the files at s3://commoncrawl/cc-index/table/cc-main/warc/, and your existing queries will continue to work without modification.  This is purely a renaming to make the index clearer in purpose and to leave room for the datasets to come.

This release was authored by:
No items found.

Erratum: 

Content is truncated

Originally reported by: 
More details
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.