Earlier this year, IBM Research published the GneissWeb dataset, a state-of-the-art, well-tested recipe for evaluating web document quality and document categories for AI model training. We, at Common Crawl, were excited about the possibility of making these document annotations accessible to users of our web dataset, whether they be large language models (LLM) or machine learning (ML) trainers, or humanities scholars.
Using the GneissWeb bloom filter made publicly available by IBM, along with IBM’s Data Prep Kit (now a Linux Foundation AI & Data project) and the GneissWeb groups’ category classifiers, we were able to create an annotation for every document (URL) in our crawls. If you think of the index as a table with one row for every URL in the dataset, the annotation allows a user of our crawl to pick out a subset of documents that passes GneissWeb’s quality standard. GneissWeb also creates category labels (including medical, education, technology, and science). These can be combined with existing document annotations, such as a language, or a top-level domain (*.uk
= United Kingdom).
In addition to our URL index, Common Crawl also publishes a host index. The host index allows our dataset users to select web hosts that are “mostly in English” or “have an above-average search-engine style rank”. We have created a GneissWeb-based host annotation, such that a dataset user could look at all web hosts where more than half of the GneissWeb high-quality pages are categorized as medical.
GneissWeb signals could also be used to examine hosts with a high search-engine style rank but low GneissWeb score, and vice versa. This capability opens a lot of opportunities to improve the quality of labels and ranks.
These annotations, at the URL and host level, are available both on the Hugging Face website, and in Common Crawl’s S3 bucket.
We are looking forward to the community at large using these quality annotations and categories to advance AI/ML training and humanities uses in an open and responsible way.
Links to Hugging Face and our data bucket are upcoming.
References
https://arxiv.org/abs/2502.14907
https://research.ibm.com/blog/gneissweb-for-granite-training
https://huggingface.co/ibm-granite/GneissWeb.bloom
https://github.com/data-prep-kit/data-prep-kit
https://github.com/data-prep-kit/data-prep-kit/blob/dev/recipes/GneissWeb/GneissWeb.ipynb
Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.