We are very excited to announce that the winners of the Norvig Web Data Science Award Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente!
The Norvig Web Data Science Award was created by Common Crawl and SURFsara to encourage research in web data science and named in honor of distinguished computer scientist Peter Norvig.
There were many excellent submissions that demonstrated how you can extract valuable insight and knowledge from web crawl data. Be sure to check out the work of the winning team, Traitor – Associating Concepts Using The World Wide Web, and the other finalists on the award website. You will find descriptions of the projects as well as links to the code that was used. We hope that these projects will serve as an inspiration for what kind of work can be done with the Common Crawl corpus. All code is open source and we are looking forward to seeing it reused and adapted for other projects.
A huge thank you to our distinguished panel of judges: Peter Norvig, Ricardo Baeza-Yates, Hilary Mason, Jimmy Lin, and Evert Lammerts!
Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.
Erratum:
Nodes in Domain-Level Webgraphs Not Sorted and May Include Duplicates
The nodes in domain-level Web Graphs may not be properly sorted lexicographically by node label (reversed domain name). It's also possible that few nodes are duplicated, that is two nodes share the same label. For more details, see the Issue Report in the cc-webgraph repository.
The issue affects all domain-level Web Graphs until the issue has been fixed for the May, June/July, August 2022 Web Graph (cc-main-2022-may-jun-aug-domain) and the following Web Graph releases.
