Table of Contents:
- Common Crawl Celebrates Our 100th Crawl Since 2008!
- AI and the Right to Learn on an Open Internet
- Recent Research Using Common Crawl Data
- Updates to Our Data Products – Help Wanted!
- Volunteer for Common Crawl!
Common Crawl Celebrates Our 100th Crawl since 2008
Our latest crawl, May 2024, marks a milestone for Common Crawl – our 100th crawl since we began crawling in 2008! Many people have been involved in making this happen over the years, and we’d like to thank all of the emeritus members of our team: Ahad Rahna, Lisa Green, Allison Domicone, Jordan Mendelson, Stephen Merity, Julien Nioche, Sara Crouse, and Alex Xue. Thank you from all of the current members of our team!
AI and the Right to Learn on an Open Internet
On April 30th, Common Crawl Foundation hosted an event in New York for a select group of leaders in AI, technology, media, and content. The conference, co-hosted with Professor Jeff Jarvis, was intended to foster an open dialogue between stakeholders about how to achieve a common goal of supporting a right to learn on an open Internet. The one-day event, held at the Craig Newmark Graduate School of Journalism at CUNY, featured opening remarks, firestarter mini-sessions, panel discussions, demo time, and networking opportunities. Topics of discussion ranging from the risks to the Open Internet and fair use and large language model training to smart uses of AI in journalism and business models and solutions. Sponsors of the conference were Kearney, Tola Capital, and CCIA.
Recent Research Using Common Crawl Data
- Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains - WWW ‘24 Conference
- When Online Content Disappears – Pew Research
- Misinformation Resilient Search Rankings with Webgraph-based Interventions - CMU
- Harmony in the Australian Domain Space - UT Sidney
- WebGraph: The Next Generation (Is in Rust) - Inria
- mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
- Common Crawl's BibTeX index (recently updated)
- Google Scholar search for Common Crawl since 2024
Updates to Our Data Products – Help Wanted!
Our summer intern, Ford Heilizer, has been hard at work making a tool that transforms our usual WARC/WAT/WET data into a table. If the first thing that you do when you download our data is to stick everything in a table, please contact us at info@commoncrawl.org. We'd love any advice you have to offer!
We are also thinking about a project to make a 1:1 round-trippable format of WARC to files in a ZIP, with the WARC metadata saved in spreadsheets. We hope this new format will be useful for users who want to process just a couple of WARCs worth of data on a laptop. If this interests you, please contact us!
We made a couple of small updates to our existing interfaces.
If you use the Web Graph, https://index.commoncrawl.org/graphinfo.json now contains the list of crawls in each Web Graph. If you use the cdx index, https://index.commoncrawl.org/collinfo.json now has 2 new fields, “from” and “to”, giving the exact dates when the crawling started and ended.
Volunteer for Common Crawl!
Common Crawl has had some significant contributions made by volunteers over the years, whether they’ve been technologists who love the data, people who have used the data and want to contribute some code as a result, or researchers who have written a paper and open sourced some code.
We also have a list of relatively simple tasks to get you started. Please contact us at info@commoncrawl.org if interested.