Common Crawl's Submission to the UK's Copyright and AI Consultation

Published 17th December 2024

The Common Crawl Foundation welcomes the opportunity to respond to the UK Government’s open consultation on “Copyright and Artificial Intelligence.” In our comments below, we provide further background on our leadership building an open repository of web crawl data and its utility to researchers, developers, and students, including in the context of AI. We then offer guidance on how the UK can create a supportive legal environment both for us and the people who depend on our archive for text and data mining (TDM).

Specifically, we advocate for clear, fair exceptions to copyright that facilitate TDM and allow organizations like us to continue supporting research and innovation. We then offer recommendations on how any AI training “opt-outs” should interact with different parts of the ecosystem; specifically, any opt-outs should be applied by the entity using the data for AI training or other TDM, rather than organizations like ours that crawl sites to create archives.

About Common Crawl

The Common Crawl Foundation is a nonprofit organization that has been operating since 2007 with the mission of preserving and freely sharing samples of the public Internet. Common Crawl has amassed a 10-petabyte archive containing over 250 billion web pages, and this repository grows by an additional 3–5 billion pages each month. Researchers, developers, and students around the world rely on this open data, and the concept of freely sharing a vast corpus for diverse uses is now a mature idea that has proven its value over many years.

We strive to provide a comprehensive crawl of public websites, while also operating as a responsible participant in the online ecosystem. Common Crawl’s crawling agent, CCBot, strictly honors the robots.txt standard to avoid disallowed pages, identifies itself clearly, and respects requests to cease crawling. CCBot also does not bypass paywalls, never logs in to private sites, and crawls at a measured rate to prevent server overload. Additionally, Common Crawl provides cryptographically verifiable provenance for its data, enabling users to confirm the authenticity and integrity of the data.

Our open repository of web crawl data lowers the barriers for innovation and technological advancements, because anyone can access vast amounts of web information without the need for costly web crawling or data gathering.  This democratization of data allows smaller entities to compete with larger organizations.

While the focus of this consultation is AI, it is important to underscore that our data has been essential to driving progress in a wide range of areas.  For instance:

  • Our dataset has enabled significant progress in fields such as language processing, search engine optimization, and web analytics.
  • It has allowed experts to develop and refine algorithms, understand web trends, and enhance user experiences.
  • We have also supported education and learning by providing a valuable resource for students and educators. With access to real-world web data, learners can engage in hands-on projects, gaining practical experience in data analysis and web development. This exposure helps build a robust foundation for the next generation of technologists and data scientists.
  • Common Crawl’s data has been instrumental in various social initiatives, from monitoring misinformation and tracking public health trends to supporting disaster response efforts.
  • Researchers and activists also use this data to analyse social media, news sites, and other web sources, providing insights that can drive social change and inform policy decisions.
  • Common Crawl is a crucial dataset for machine translation research and development. While content in English currently makes up around 43% of the crawled content, there has been demand to include more underrepresented languages. In response, Common Crawl launched its Web Languages project in 2024. This initiative aims to expand the dataset's diversity by incorporating content from 40 additional languages that each have over 50 million speakers, while also collecting data from as many of the world's 7,000+ living languages as possible.

Because of our broad utility to a wide range of research, citations of Common Crawl have grown 60-fold over the course of a decade. This graph shows our citation count in Google Scholar from 2012 to 2025. Overall, our archive has been cited in over 10,000 academic papers, spanning natural language processing, machine learning, digital humanities, and many other fields.

Because of our broad utility to a wide range of research, citations of Common Crawl have grown 60-fold over the course of a decade. This graph shows our citation count in Google Scholar from 2012 to 2025. Overall, our archive has been cited in over 10,000 academic papers, spanning natural language processing, machine learning, digital humanities, and many other fields.

In recent years, our mission to democratize access to high-quality web data has positioned Common Crawl as a linchpin in the AI ecosystem, especially as demand for extensive, diverse datasets has surged with the rise of generative AI. Today, Common Crawl is the source of an estimated 70–90% of the tokens used in training data for nearly all of the world’s large language models (LLMs), making us perhaps the most universally relied-upon resource for LLMs in production. Common Crawl is particularly useful because it encompasses a wide range of topics, writing styles, and perspectives.

Copyright should facilitate TDM

The Common Crawl Foundation supports the rights of creators and content owners to receive proper attribution and fair compensation. At the same time, copyright has never regulated (and should not regulate) the mere act of reading a text, watching a video, listening to audio, and so on. People have always been free to both enjoy and learn from past works in order to create new ones. This is wholly consistent with the purpose of copyright: providing sufficient incentives to create, for the public’s benefit.

Today, people are using machines to assist with reading, watching, and listening to works, in order to derive insights and create new works. That is the essence of TDM. Copyright should not prohibit this process simply because a copy is made as an intermediate step. Of course, where the output of that process communicates copyrightable expression from the training data to the public, copyright may still regulate the use. But where the output is non-infringing (deriving facts, ideas, concepts, or other uncopyrightable elements in order to create a new, non-infringing work) the fact that a copy was made as an intermediate step should be irrelevant to the analysis. The “right to read” should encompass the “right to mine,” and people’s rights in the analog world should not go away simply because we have shifted to a digital one, where copies are an incidental process of accessing and using any material.

This is true for TDM generally, and for AI training specifically. Large Language Models and other AI models are not databases or copies of the works they were trained on, but rather new tools based on analyzing vast amounts of data in order to derive uncopyrightable elements, like the syntax of language and basic facts about the world.

In turn, we strongly support the UK government’s intention to create an exception that allows TDM. If the UK fails to create a clear, fair exception, then it will impede the ability of entities like us, and the people who depend on us, from conducting research and development in the UK. In turn, research, investment, and innovation will flow to other countries with more hospitable legal environments.

Consideration for Opt-Outs

As noted above, the Common Crawl Foundation is a responsible participant in the web ecosystem. We respect robots.txt, do not bypass paywalls, never log in to private sites, and crawl at a measured rate to prevent server overload. These standards and norms have developed over the course of the Web’s history. They continue to work well, and we believe continued collaboration among all stakeholders will continue to ensure a fair and innovative ecosystem.

As a result, we recommend that the UK work to complement the existing processes for building functional standards and norms. These processes are more flexible than creating a legally-binding “rights reservation” mechanism for opting out. We would have concerns about requirements that disrupted our existing reliance on robots.txt; for instance, organizations like the Common Crawl Foundation should not have to parse terms of service looking for arbitrary declarations of opt-outs, as opposed to using the common standards of robots.txt.

Any “opt-out” mechanism must also carefully distinguish between different actors in the ecosystem. Specifically, it should distinguish between entities like Common Crawl, which crawl and ingest data into a dataset, and the AI developers who actually use that data for AI training.

Despite our careful approach to crawling, conflicts sometimes arise when publishers initially offer content for free but later move it behind paywalls. Those publishers may object if people (or AI systems) access the once-public version without paying, in part because of financial motivations tied to search visibility.

In the Common Crawl Foundation’s view, the point of ingestion (the right to crawl and archive publicly accessible data) must be protected to ensure a comprehensive archive of online information. Meanwhile, the point of application (where the data is ultimately used) bears responsibility for compliance with standards, norms, and laws. The point of application is also where measures such as an opt-out registry should be enforced, giving content owners control over how their data is used in AI applications, while still allowing the free flow of information at the ingestion stage. Removing data from the archive is not the most effective way to ensure responsible behavior; rather, establishing clear guidelines and accountability at the usage level creates a more balanced framework, and is ultimately more sustainable in the long term.

Diverse initiatives, from major technology firms to emerging startups, are exploring mechanisms to already facilitate such opt-outs, as well as licensing agreements. From Common Crawl’s perspective, promoting broad distribution of content while implementing a “fee” at the point of monetizable AI application offers a sustainable path forward: one that respects both the need for open data and the legitimate interests of content owners.