< Back to Blog
November 4, 2024

Reflections on Recent Talks at the Turing Institute and UCL

Note: this post has been marked as obsolete.
Thom Vaughan and Pedro Ortiz Suarez discussed the power of Common Crawl’s open web data in driving research and innovation during two notable presentations last week.
Common Crawl Foundation
Common Crawl Foundation
Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

At the turn of October into November, our colleagues Thom Vaughan and Pedro Ortiz Suarez had the opportunity to present at two events, sharing insights into the work of the Common Crawl Foundation and the impact of open web data on research and industry.

Turing Institute NLP Special Interest Group

Thom Vaughan, Pedro Ortiz Suarez, Common Crawl Foundation, beside an Enigma machine on loan to the Turing Institute from GCHQ. Photo credit: Robert Blackwell

The first presentation took place at the Turing Institute, as part of the NLP Special Interest Group. Addressing a knowledgeable audience of NLP researchers and practitioners, they discussed how Common Crawl's web-scale data has become a crucial resource for applications in many different areas. Thom and Pedro outlined the dataset's role in training language models and enabling diverse linguistic research, and addressed key challenges associated with curating large-scale web data and the ethical considerations that are inherent in its use. The session concluded with some constructive discussion, which reflected a growing interest in using open data responsibly.

Co-hosted Talk at UCL with Valyu

Thom Vaughan, Pedro Ortiz Suarez, Common Crawl Foundation. Photo credit: Valyu

The second event was held at University College London, co-hosted with Valyu. The talk was on the transformative potential of open datasets for research and innovation. Thom and Pedro showcased examples of how Common Crawl is used in various academic and industrial projects, showing examples of the dataset's contribution to advancements in data science and machine learning. The discussion also focused on strategies to enhance data accessibility and the crucial role of collaboration in promoting a healthy open-data ecosystem. Representatives from Valyu's team Hirsh Pithadia and Harvey Yorke talked about the implications of measured rises of restrictions of data in web archives.

Hirsh Pithadia, Valyu.

Summary

Both events underscored the relevance and importance of accessible web data for driving forward scientific and technological progress. We are grateful to the Turing Institute and UCL, along with Valyu, for facilitating these discussions and for their commitment to advancing the open data landscape. As we continue our work, we look forward to further engaging with the community and supporting new and impactful applications of our datasets.

We’d like to thank Robert Blackwell and Anthony Rhys Hills at the Turing Institute for the opportunity to present at the NLP Special Interest Group, our friends at Valyu for their insightful talk, and Professor Philip Treleaven from UCL for the warm introduction.

This release was authored by:
No items found.