March 1, 2024

Web Archiving File Formats Explained

In the ever–evolving landscape of digital archiving and data analysis, it is helpful to understand the various file formats used for web crawling. From the early ARC format to the more advanced WARC, and the specialised WET and WAT files, each plays an important role in the field of web archiving. In this post, we explain these formats, exploring their unique features, applications, and the enhancements they offer.

Thom Vaughan

Thom is a Principal Engineer at the Common Crawl Foundation.

The Capabilities of ARC, WARC, WET, and WAT Formats

We also highlight the integration of Apache Parquet™ for efficient indexing and data analysis, offering insights into how these technologies can refine the process of web data management.

Whether you're a data scientist, a digital archivist, or just interested in the complexities of web archiving, hopefully you will find this post helpful and informative.

Which Format Does What?

WARC (Web ARChive) Format

WARC was developed as a successor to the ARC format (detailed below), and is now the industry standard for web archiving. It can store multiple resources, similar to ARC, but with more capabilities. This includes the ability to store request and response headers, additional metadata, and new record types like resource revisit, metadata, and conversion.

Enhanced capability for metadata storage, better scalability, and standardization are among its advantages. It's more suitable for large–scale web archiving and supports the archiving of complex digital resources. Common Crawl has used WARC since the crawl of Summer 2013 (CC-MAIN-2013-20).

To process this data there are various packages available to use, such as https://github.com/webrecorder/warcio, https://pkg.go.dev/github.com/slyrz/warc, and https://docs.rs/warc/latest/warc/.

Example Data

Here’s a Gist that shows an example of a WARC record.

WAT (Web ARChive Timestamp) Files

These files are also part of the Common Crawl dataset, but they focus on the metadata associated with the crawled web pages.

WAT files contain parsed data from the HTTP response headers, links extracted from HTML pages, and other metadata. This can include information like server response codes, content types, languages, and more.

They are useful for analysis that requires understanding the structure of the web, such as link analysis, studying the evolution of websites, and metadata–based research.

Example Data

Here’s a Gist showing an example of a WAT file.

WET (WARC Encapsulated Text) Files

These files are part of the Common Crawl dataset and contain extracted plain text from web content. WET files only contain the body text of web pages, extracted from the HTML and excluding any HTML code, images, or other media. This makes them useful for text analysis and natural language processing (NLP) tasks.

WET files are ideal for applications where only the text of web pages is needed, such as linguistic analysis, content categorisation, and other text–focused activities.

Example Data

Here’s an example of a WET file from the same Gist as the previous two examples.

Columnar (Parquet) Indexes

In addition to the above, we provide an index for WARC files and URLs in a columnar format using Apache Parquet™. This enables more efficient querying and data analysis.

Parquet files are binary files, and their columnar storage is optimised for fast retrieval of specific columns of data, which can significantly speed up searches and analysis. This means there's no need to process entire WARC segments just to retrieve a specific column.

Other advantages include being able to split files into smaller, manageable chunks that can be processed in parallel across multiple nodes or machines in a distributed computing environment, and schema evolution, meaning that you can easily add new columns to your dataset or change the data types of existing columns without having to rewrite or modify the entire dataset.

You can find plenty of Use Cases and Example Projects on our website and on our GitHub to help you get started.

Commonly used software for working with the columnar index includes DataFusion, Pandas, PyArrow, Polars, and parquet-go.

Example Data

ARC (Archive) Format

Originally developed by the Internet Archive, ARC was one of the first file formats used for storing web crawls. Common Crawl's initial crawls (up to Summer 2013) are in this format. It encapsulates multiple resources (like HTML documents) into a single file. Each item is preceded by a header containing metadata such as URL, timestamp, content type, and so on.

The ARC format has limitations in handling metadata and scalability. It lacks flexibility in documenting additional metadata, and doesn't support the storage of data beyond the actual content, such as HTTP request and response headers. Because of these limitations, as of Summer 2013 we now use the more suitable WARC format.

Example Data

Here’s a Gist containing an example of an ARC record taken from our CC-MAIN-2009-2010 crawl (which was the last of our crawls to be in this format).

Summary

ARC and WARC are more general formats for web archiving with WARC being an advanced and more versatile evolution of ARC. WET and WAT are specialised formats, focusing respectively on the extracted text and the metadata of web content.

The combination of archive formats and Parquet for indexing exemplifies how we use different technologies to enhance the efficiency and effectiveness of web data processing and analysis.

If you have any comments or would like to submit your project to our examples and use cases please Contact Us, or join in the discussion in our Google Group.

Apache Parquet™ is a trademark of the Apache Software Foundation.

This release was authored by:

Thom is a Principal Engineer at the Common Crawl Foundation.

Thom Vaughan

Pedro is a French-Colombian mathematician, computer scientist, and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université.

Pedro Ortiz Suarez

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Erratum:

Content is truncated

The Data

Resources

Community

About