June 28, 2024

Dialog and Discovery at AI_dev 2024

Note: this post has been marked as obsolete.

This month members from the Common Crawl Foundation attended the AI_dev: Open Source GenAI & ML Summit in Paris, where discussions focused on AI advancements, ethics, and Open Source solutions.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

The audience at the keynote speech by **Ibrahim Haddad**, Executive Director of LF AI & Data (Linux Foundation).

This month (19th and 20th of June 2024) Common Crawl Foundation members Thom Vaughan and Pedro Ortiz Suarez attended the conference AI_dev: Open Source GenAI & ML Summit in Paris, France, organized by LF AI & Data (Linux Foundation).

We had the great privilege of meeting some of the brightest minds in the fields of Artificial Intelligence, Machine Learning, and Open Source Software.

The conference speakers discussed the enormous potential of Artificial Intelligence and its applications across numerous industries, ranging from advancements in Machine Learning algorithms and practical applications, to the ethics of AI deployment.

The conference featured workshops and technical sessions covering a range of topics, but all with focus on Open Source solutions.

Talk: “Navigating the Ethical Landscape: Responsible AI in Practice”

‍

This panel included:

Adrián González Sánchez (Professor at HEC Montréal and Instituto de Empresa Madrid)
Mirko Boehm (Community Development, Linux Foundation Europe)
Oita Coleman (Senior Advisor at Open Voice TrustMark Initiative)
Pedro Ortiz Suarez (Senior Research Scientist at Common Crawl)

‍

The panel moderator and presenter was Anni Lai (Head of Open Source Operations at LF AI & Data Foundation).

As a panelist on this talk, Pedro highlighted how Common Crawl’s vast repository of web data can be used responsibly to train large language models. He stressed the need for transparency, fairness, and accountability in data collection and usage, as well as mentioning some potential difficulties.

The panel's comments emphasized the importance of diverse and unbiased training data to ensure AI systems are fair and unprejudiced.

Left to right: Common Crawl Foundation members **Thom Vaughan** (Principal Technologist) and **Pedro** **Ortiz** **Suarez** (Senior Research Scientist), at the Linux Foundation’s conference **“AI_dev: Open Source GenAI & ML Summit”** in Paris, 2024.

We look forward to implementing the insights we gained at the conference and continuing our work to make data more accessible and equitable for everyone.

If you have any questions or want to discuss any of these topics further, please feel free to join our discussions on Google Groups and Discord.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Dialog and Discovery at AI_dev 2024

Talk: “Navigating the Ethical Landscape: Responsible AI in Practice”

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use