Today we have renamed the Columnar Index to the URL Index.
The URL Index (formerly the Columnar Index) is one of the indexes we provide for querying the Common Crawl corpus, alongside the CDXJ Index. As its new name makes clear, it is an index to the URLs and WARC files in the corpus, stored in a columnar format (Apache Parquet™). That format is well suited to efficient analytical and bulk queries, saving both time and computing resources, and it works with a wide range of tools including AWS Athena, Apache Spark™, Pandas, Polars, Apache Arrow™, and DuckDB.
Why the change? The old name described how the index was stored rather than what it contained. "Columnar" refers to the file format, but it told you nothing about the actual purpose of the dataset, which is indexing URLs. As we aim to publish more of our datasets in columnar formats, naming a single dataset after the format it happens to use would only become more confusing. A future where several different datasets are all "columnar" needs names that distinguish them by what they are for. Calling this one the URL Index does exactly that.
Nothing else has changed. The data, the schema, the S3 location, and the way you query it all remain the same. You will still find the files at s3://commoncrawl/cc-index/table/cc-main/warc/, and your existing queries will continue to work without modification. This is purely a renaming to make the index clearer in purpose and to leave room for the datasets to come.
