June 3, 2026

The Columnar Index Is Now the URL Index

We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Today we have renamed the Columnar Index to the URL Index.

The URL Index (formerly the Columnar Index) is one of the indexes we provide for querying the Common Crawl corpus, alongside the CDXJ Index. As its new name makes clear, it is an index to the URLs and WARC files in the corpus, stored in a columnar format (Apache Parquet™). That format is well suited to efficient analytical and bulk queries, saving both time and computing resources, and it works with a wide range of tools including AWS Athena, Apache Spark™, Pandas, Polars, Apache Arrow™, and DuckDB.

Why the change? The old name described how the index was stored rather than what it contained. "Columnar" refers to the file format, but it told you nothing about the actual purpose of the dataset, which is indexing URLs. As we aim to publish more of our datasets in columnar formats, naming a single dataset after the format it happens to use would only become more confusing. A future where several different datasets are all "columnar" needs names that distinguish them by what they are for. Calling this one the URL Index does exactly that.

Nothing else has changed. The data, the schema, the S3 location, and the way you query it all remain the same. You will still find the files at s3://commoncrawl/cc-index/table/cc-main/warc/, and your existing queries will continue to work without modification. This is purely a renaming to make the index clearer in purpose and to leave room for the datasets to come.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

The Columnar Index Is Now the URL Index

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use