Common Crawl - Blog - Reflections on Recent Talks at the Turing Institute and UCL

Turing Institute NLP Special Interest Group

Thom Vaughan, Pedro Ortiz Suarez, Common Crawl Foundation, beside an Enigma machine on loan to the Turing Institute from GCHQ. Photo credit: Robert Blackwell

The first presentation took place at the Turing Institute, as part of the NLP Special Interest Group. Addressing a knowledgeable audience of NLP researchers and practitioners, they discussed how Common Crawl's web-scale data has become a crucial resource for applications in many different areas. Thom and Pedro outlined the dataset's role in training language models and enabling diverse linguistic research, and addressed key challenges associated with curating large-scale web data and the ethical considerations that are inherent in its use. The session concluded with some constructive discussion, which reflected a growing interest in using open data responsibly.

Co-hosted Talk at UCL with Valyu

Thom Vaughan, Pedro Ortiz Suarez, Common Crawl Foundation. Photo credit: Valyu

The second event was held at University College London, co-hosted with Valyu. The talk was on the transformative potential of open datasets for research and innovation. Thom and Pedro showcased examples of how Common Crawl is used in various academic and industrial projects, showing examples of the dataset's contribution to advancements in data science and machine learning. The discussion also focused on strategies to enhance data accessibility and the crucial role of collaboration in promoting a healthy open-data ecosystem. Representatives from Valyu's team Hirsh Pithadia and Harvey Yorke talked about the implications of measured rises of restrictions of data in web archives.

Hirsh Pithadia, Valyu.

Summary

Both events underscored the relevance and importance of accessible web data for driving forward scientific and technological progress. We are grateful to the Turing Institute and UCL, along with Valyu, for facilitating these discussions and for their commitment to advancing the open data landscape. As we continue our work, we look forward to further engaging with the community and supporting new and impactful applications of our datasets.

We’d like to thank Robert Blackwell and Anthony Rhys Hills at the Turing Institute for the opportunity to present at the NLP Special Interest Group, our friends at Valyu for their insightful talk, and Professor Philip Treleaven from UCL for the warm introduction.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Reflections on Recent Talks at the Turing Institute and UCL

Turing Institute NLP Special Interest Group

Co-hosted Talk at UCL with Valyu

Summary

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use