The Stanford Human-Centered AI Institute (HAI), co-founded by Dr. Fei-Fei Li, is a prominent institution that aims to improve the human condition through AI. Li's foundational work with ImageNet, a massive visual database, revolutionized computer vision by demonstrating the importance of large-scale, high-quality data. By providing a common benchmark for researchers, ImageNet helped catalyze the deep learning revolution, proving that a vast amount of curated data could make algorithms more accurate.
This very principle (the power of open, accessible data to drive innovation) is also at the core of the Common Crawl Foundation, a non-profit founded by Gil Elbaz. Both Li and Elbaz share a connection to the California Institute of Technology (Caltech); Li earned her PhD there, while Elbaz is an alumnus with a double major in Engineering & Applied Science and Economics.
Elbaz founded Applied Semantics which was the only company acquired by Google (2003) before their IPO. During his tenure at Google, Elbaz continued to work on the Applied Semantics technology and AdSense. AdSense helped establish Google’s position as a leader in online advertising and has been responsible for a substantial amount of revenue since its launch in 2005. In 2007, Gil Elbaz founded Common Crawl with the mission to democratize access to web information, providing a petabyte-scale web crawl that is free for public use.
The shared academic foundation and commitment to open data among these two figures highlights a central theme: the collective, open source approach to data is essential for meaningful progress in artificial intelligence and machine learning, as well as in strengthening the ties between language, community, and culture.
This philosophy brings the Common Crawl Foundation to Stanford HAI for their seminar, "Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data". The seminar, taking place on October 22, 2025, will focus on crucial topics such as privacy, safety, and security. By presenting insights from a new data product, Common Crawl will advocate for greater transparency and informed solutions for the future of public web data, continuing to build on the legacy of open data that pioneers like Dr. Fei-Fei Li and Gil Elbaz have championed.
Please join us in person or virtually at this seminal event.
Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.