At the turn of October into November, our colleagues Thom Vaughan and Pedro Ortiz Suarez had the opportunity to present at two events, sharing insights into the work of the Common Crawl Foundation and the impact of open web data on research and industry.
Turing Institute NLP Special Interest Group
The first presentation took place at the Turing Institute, as part of the NLP Special Interest Group. Addressing a knowledgeable audience of NLP researchers and practitioners, they discussed how Common Crawl's web-scale data has become a crucial resource for applications in many different areas. Thom and Pedro outlined the dataset's role in training language models and enabling diverse linguistic research, and addressed key challenges associated with curating large-scale web data and the ethical considerations that are inherent in its use. The session concluded with some constructive discussion, which reflected a growing interest in using open data responsibly.
Co-hosted Talk at UCL with Valyu
The second event was held at University College London, co-hosted with Valyu. The talk was on the transformative potential of open datasets for research and innovation. Thom and Pedro showcased examples of how Common Crawl is used in various academic and industrial projects, showing examples of the dataset's contribution to advancements in data science and machine learning. The discussion also focused on strategies to enhance data accessibility and the crucial role of collaboration in promoting a healthy open-data ecosystem. Representatives from Valyu's team Hirsh Pithadia and Harvey Yorke talked about the implications of measured rises of restrictions of data in web archives.
Summary
Both events underscored the relevance and importance of accessible web data for driving forward scientific and technological progress. We are grateful to the Turing Institute and UCL, along with Valyu, for facilitating these discussions and for their commitment to advancing the open data landscape. As we continue our work, we look forward to further engaging with the community and supporting new and impactful applications of our datasets.
We’d like to thank Robert Blackwell and Anthony Rhys Hills at the Turing Institute for the opportunity to present at the NLP Special Interest Group, our friends at Valyu for their insightful talk, and Professor Philip Treleaven from UCL for the warm introduction.