
From the 16th to the 20th of June, the Common Crawl Foundation team was in New York City for the United Nations Open Source Week, and select industry side-events. Over the course of the week we engaged with developers, researchers, and policymakers on all things related to Open Source and AI. We presented at IBM’s Thomas J. Watson Research Center, and co-hosted the “AI Unconference” event at IBM One Madison: a gathering designed for open discussions of what we see as some of the most important issues facing the industry today: transparency, safety, diversity, and the importance of ethical data pipelines.
UN Open Source Maintain-a-thon

Our team attended the United Nations for the Open Source Maintain-a-thon on Tuesday. Attendees from numerous global organisations split into groups and produced “Today I Learned” takeaways, “Tomorrow I Will” actions, and “Gee, I Wish” ideas for application in various areas of the (AI) industry. This culminated in a collective playbook for maintainability which will be released at a later date via the United Nations website.
IBM Thomas J. Watson Research Center

The team travelled to Yorktown Heights to IBM’s Thomas J. Watson Research Center, where our distinguished engineer Sebastian Nagel gave a series of presentations on Common Crawl’s activities, goals, and partnerships. Our team then met with dozens of representatives from departments across IBM to discuss mutual goals and identify areas where collaboration might benefit the industry at large.

Side-events at LinkedIn, Meta, and PwC
Common Crawl Foundation team members attended LinkedIn’s event “AI and the Future of Work: The ICT Sector in Transition” and their Empire State Building offices in midtown. This was a chance for the team to meet with more industry professionals and policymakers.
Pedro Ortiz Suarez (Senior Research Scientist, Common Crawl) and Laurie Burchell (Senior Research Engineer, Common Crawl) also attended two further side-events: the first of which took place at Meta’s NYC offices on Friday, where Mary Williamson of Meta presented on the Open Language Data Initiative, which Laurie is co-organising. The second was held at PwC, where Pedro gave a brief general presentation about Common Crawl and data-driven open source software. Pedro and Laurie also met and discussed with additional industry experts and policymakers. These engagements contributed to ongoing discussions around language data, openness, and cross-sector collaboration.
AI Unconference, IBM One Madison
Our main event was the AI Unconference, part of the official UN Open Source Week side-events, which Common Crawl co-hosted with our friends at IBM, the AI Alliance and BrightQuery. One attendee described it as ‘the most impactful AI event of the year’.
The event brought together over 100 attendees from around the world, including leading technologists and industry pioneers. Highlights included talks from Rich Skrenta (Executive Director, Common Crawl), Jose Plehn-Dujowich (CEO, BrightQuery), Andrea Greco (Research Business Partnerships, IBM), Dean Wampler (Chief Technical Representative to the AI Alliance, IBM), and Thom Vaughan (Principal Technologist, Common Crawl).
Rich Skrenta opened the event with a welcome from Common Crawl, followed by an introduction to the AI Alliance by Andrea Greco, an introduction to BrightQuery from Jose Plehn-Dujowich, an introduction to the AI Alliance’s Open Trusted Data Initiative by Dean Wampler, and a detailed presentation by Thom Vaughan on Common Crawl’s mission. Roberto di Cosmo (Director, Software Heritage) also gave a presentation on their efforts in ethical data collection operations.



This was followed by a dynamic and well-received panel discussion, featuring Jose Plehn-Dujowich, Dean Wampler, Lilith Bat-Leah (DMLR Working Group Co-chair, MLCommons), Dave Buckley (Senior Policy Manager, OpenMined), Greg Lindahl (CTO, Common Crawl), and Roberto di Cosmo. Thom Vaughan served as moderator.
The issues around transparency and accountability in AI discussed by the panel are critical to the industry. A repeated term was “full chain of transparency” (thanks, Dave Buckley!) which it was broadly agreed is desperately needed across the industry. Another key theme was that attribution and provenance in training data should be systematic; embedded into data practices by design, rather than as an afterthought. Several panellists also highlighted the developers’ responsibility to uphold ethical standards, with repeated reference to Croissant, the community-developed metadata standard from MLCommons, as a promising tool for responsible data documentation.

As Dean Wampler noted during the discussions, “Constraints liberate, liberties constrain”, a saying at IBM that resonated with the panel. In the context of AI, the idea points to how well-designed boundaries like clear data documentation standards, transparent governance structures, and ethical constraints can enable greater innovation, trust, and collaboration, rather than limit progress.
Breakout sessions followed the panel, discussing the ethics of large scale data collection, preserving authenticity in user preference signals, governance and transparency in AI training data usage, collaborative standards across the AI ecosystem, and building trust in public data pipelines.
We would like to thank our friends at the AI Alliance, BrightQuery, and IBM for co-hosting this special event with us. Thanks in particular to Tim Bonnemann, Community Lead at IBM, for his tireless efforts and thoughtful coordination throughout the event.
It was a full and productive week in New York. We had meaningful conversations, made valued connections, and saw real interest in the work we’re doing at Common Crawl.
Our thanks to everyone who took part. We’re looking forward to what comes next.
Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.