Search results

Common Crawl - Blog - Web Archiving File Formats Explained

Apache Parquet™. for efficient indexing and data analysis, offering insights into how these technologies can refine the process of web data management.

Common Crawl - Blog - Index to WARC Files and URLs in Columnar Format

The columnar format (we use Apache Parquet) allows to efficiently query or process the index and saves time and computing resources. Especially, if only few columns are accessed, recent big data tools will run impressively fast. Sebastian Nagel.

Common Crawl - Blog - November/December 2021 crawl archive now available

The columnar index is now built using Spark version 3.2.0 and Parquet MR 1.12.1 – these upgrades allow us to go for further improvements next year.

Common Crawl - Blog - Oct/Nov 2023 Performance Issues

Because we make a lot of queries, we downloaded the parquet columnar index files for the past few crawls and used DuckDB to run SQL queries against them. In Python, starting. DuckDB. and running a query looks like this: Contact Us for Help.