September 15, 2025

Trip Report: AI_dev (Linux Foundation) August 2025

On the 28th and 29th of August 2025, Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen attended the Linux Foundation’s AI_dev event in Amsterdam.

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

On the 28th and 29th of August 2025, Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen attended the Linux Foundation’s AI_dev event in Amsterdam. We caught up with many familiar faces and made friends with plenty of new ones.

‍We talked with the folks from OpenSearch, Neo4j, LFAI & Data (and we staffed their booth for a few hours on Friday). The Neo4j developer relations team are very eager to help us with our initiative to use Neo4j with our Web Graphs.

A photo of Thom and Pedro with folks from Eventual.Inc’s team — Left-to-right: Thom Vaughan, Colin Ho, Sammy Sidhu, and Pedro Ortiz Suarez. “Oh wow, you’re from Common Crawl?!”

We heard from people working in the AI world who use our data regularly, amongst whom Sammy Sidhu and Colin Ho from Eventual.Inc (who are authors of Daft) who reached out to us to tell us that our dataset was amongst the most popular in their user-base.

A photo with (left-to-right) Thijs Dalhuijsen, Pedro Ortiz Suarez, Brewster Kahle, Stefano Maffuli, Thom Vaughan at lunch with Brewster and Stefano on day one of AI_dev — Lunch with Brewster and Stefano on day one, left-to-right: Thijs Dalhuijsen, Pedro Ortiz Suarez, Brewster Kahle, Stefano Maffuli, Thom Vaughan

We met once again with Stefano Maffuli of the Open Source Initiative, and Brewster Kahle (Internet Archive), and on day two we were invited to the opening celebration of the Internet Archive’s new Amsterdam address, where we had a chance to hang out with Brewster, his wife Mary Austin, Stefano Maffuli, Beatrice Murch, Ben Cerveny, and several others.

A photo of the folks who attended the opening of the Internet Archive’s European HQ in Amsterdam — Opening of the Internet Archive’s European HQ in Amsterdam

We are grateful to our collaborators and friends across the open source and AI communities for their ongoing support and encouragement. Over the course of two days we shared ideas, compared notes on challenges, and discovered new ways our work is being used in research and industry.

A big thank you to the Linux Foundation for hosting an excellent event, and to everyone who stopped to talk with us, share feedback, or suggest ways to work together. We are already looking forward to next year’s edition.

This release was authored by:

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

Thijs Dalhuijsen

Thijs Dalhuijsen is Engineering Manager at Common Crawl.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Trip Report: AI_dev (Linux Foundation) August 2025

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use