March 3, 2025

Submission to the UK’s Copyright and AI Consultation

Note: this post has been marked as obsolete.

Read our submission to the UK government's Copyright and AI consultation, supporting a legal exception for text and data mining (TDM) while respecting creators’ rights.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Common Crawl started long before generative AI was front page news. Researchers, developers, and students around the world rely on our archive, analyzing open data in order to advance translation tools, monitor trends in public information on social media, track public health information to support disaster response, and much, much more.

In recent years, our mission to democratize access to high-quality web data has positioned Common Crawl as a linchpin in the AI ecosystem, especially as demand for extensive, diverse datasets has surged with the rise of generative AI. Today, Common Crawl is the source of an estimated 70–90% of the tokens used in training data for nearly all of the world’s large language models (LLMs), making us perhaps the most universally relied-upon resource for LLMs in production. Common Crawl is particularly useful because it encompasses a wide range of topics, writing styles, and perspectives.

The frontispiece of Ted Nelson's Computer Lib/Dream Machines (1974). Original image by John R. Neill for L. Frank Baum's Tik-tok of Oz (1914).

In the face of that growth, policymakers around the world are examining how copyright laws can facilitate text and data mining in general and AI training in particular in order to serve the public interest. Changes to the law could have a huge impact not only on Common Crawl and our community, but also everyone who relies on large scale, public datasets for computer analysis.

Last week, Common Crawl provided comments to the United Kingdom’s consultation on Copyright and AI. While countries around the world already allow text and data mining using copyrighted materials, the UK does not generally do so. The consultation proposes changing that – it rejects the idea that this is a binary debate that pits “tech v. creators” and instead proposes to create limited exceptions to copyright in a way that serves society as a whole. We strongly support that direction.

“The ‘right to read’ should encompass the ‘right to mine,’ and people’s rights in the analog world should not go away simply because we have shifted to a digital one.”

The Common Crawl Foundation supports the rights of creators and content owners to receive proper attribution and fair compensation. Where the output of text and data mining communicates someone’s protected work – e.g., a generative AI tool regurgitating an entire copyrighted book or a movie – from the training data to the public, copyright regulates the use.

At the same time, copyright has never regulated (and should not regulate) the mere act of reading a text, watching a video, listening to audio, and so on – which is the essence of text and data mining. The “right to read” should encompass the “right to mine,” and people’s rights in the analog world should not go away simply because we have shifted to a digital one, where copies are an incidental process of accessing and using any material.

While we believe that the UK should create a clear, broad exception for text and data mining, we also support existing processes for building functional standards and norms around how rightsholders can express their preferences with respect to AI training. The Common Crawl Foundation is a responsible participant in the web ecosystem – we respect robots.txt, do not bypass paywalls, never log in to private sites, and crawl at a measured rate to prevent server overload. These standards and norms have developed over the course of the web’s history, and they continue to work well. Where they need improvement, efforts have already begun – specifically, Common Crawl has been active in the Internet Engineering Task Force (IETF) AI Preferences Working Group and other fora to develop standards where websites (and other creators) can express preferences with respect to AI training.

We hope the UK engages not only with these web fora, but also with Common Crawl and our community. Headlines across the UK last week highlighted musicians’ and newspapers’ objections to tech companies’ generative AI services that were trained on their works. But generative AI is only a tiny slice of the myriad ways text and data mining is used, and neither AI nor text and data mining are just about tech companies. It’s also about all the researchers, developers, and students who rely on archives like ours.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Submission to the UK’s Copyright and AI Consultation

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

Opt-out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use