January 13, 2026

GneissWeb Annotations Examples

A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

Thijs Dalhuijsen is a Senior Software Engineer at Common Crawl.

Last fall we announced a collection of category and quality annotations, calculated from the IBM Data Prep Kit GneissWeb classifiers.

The methodology we followed was to take all the pages of the FineWeb dataset, see if they were included in GneissWeb, and calculate the four classification scores (medical, science, technology, educational) on the specified records. See more details in our previous post.

Thanks to this annotation it is now possible to, for instance, select a subset from one of our crawls containing only hosts with a high probability of containing scientific medical records.

Here, this is illustrated by a practical DuckDB SQL example:

INSTALL httpfs;
LOAD httpfs;
CALL load_aws_credentials();
SET s3_region='us-east-1';

SELECT
  h.surt_host_name,
  h.hcrank,
  h.hcrank10,
  g.gneissweb_education  AS education,
  g.gneissweb_medical    AS medical,
  g.gneissweb_science    AS science,
  g.gneissweb_technology AS technology

FROM read_parquet('s3://commoncrawl/projects/host-index-testing/v2/crawl=CC-MAIN-2021-49/*.parquet') AS h

JOIN read_parquet('s3://commoncrawl/projects/gneissweb-annotation-testing-v1/hosts/crawl=CC-MAIN-2021-49/*.parquet') AS g
  ON h.surt_host_name = g.surt_host_name

WHERE g.gneissweb_medical > 0.5

ORDER BY h.hcrank DESC

LIMIT 10;

‍

Running this example, we get:

$ python gneissweb_medical.py
 
| surt_host_name     |      hcrank |   hcrank10 |   education |   medical |     science |   technology |
|--------------------|-------------|------------|-------------|-----------|-------------|--------------|
| gov,cdc            | 2.13851e+07 |      5.816 |   0.147358  |  0.690235 | 0.217257    |   0.0845532  |
| gov,nih,nlm,ncbi   | 2.11571e+07 |      5.754 |   0.126709  |  0.803236 | 0.919274    |   0.367261   |
| com,nature         | 2.04899e+07 |      5.572 |   0.0674915 |  0.789117 | 0.864001    |   0.294394   |
| com,walmart        | 2.03215e+07 |      5.526 |   0.0229006 |  0.880103 | 0.000637725 |   0.0612721  |
| com,springer,link  | 2.01602e+07 |      5.483 |   0.148069  |  0.672549 | 0.634984    |   0.179368   |
| gov,fda            | 2.0146e+07  |      5.479 |   0.0649566 |  0.652312 | 0.182145    |   0.117804   |
| com,healthline     | 2.00746e+07 |      5.459 |   0.0383815 |  0.787791 | 0.0785336   |   0.0353599  |
| us,mn,state,health | 2.00538e+07 |      5.454 |   0.187174  |  0.750825 | 0.100403    |   0.0726731  |
| com,webmd          | 1.99372e+07 |      5.422 |   0.0492467 |  0.792948 | 0.0899973   |   0.0407723  |
| org,healthaffairs  | 1.98987e+07 |      5.412 |   0.111606  |  0.714271 | 0.0158997   |   0.00757655 |

‍

We have made this data available not only at the host level, but also with increased granularity at the URL level, making it possible to select only the relevant pages from within a crawl.

The schema of this new dataset is compatible with our annotation system, so you can also use our cc-index-annotations project on GitHub to query it:‍

git clone https://github.com/commoncrawl/cc-index-annotations 
cd cc-index-annotations
pip install -r requirements.txt
make gneissweb
cd examples/gneissweb
python annotate.py left_web_host_index.yaml join_s3_gneissweb_host.yaml action_gneissweb_medical.yaml

‍

All crawls from CC-MAIN-2013-20 to CC-MAIN-2024-18 (inclusive) are available in these new datasets, and we have also made them available on Common Crawl’s S3 bucket and on Hugging Face:

Host Level Annotations

https://huggingface.co/datasets/commoncrawl/gneissweb-annotation-host-testing-v1
s3://commoncrawl/projects/gneissweb-annotation-testing-v1/hosts

URL Level Annotations

https://huggingface.co/datasets/commoncrawl/gneissweb-annotation-url-testing-v1
s3://commoncrawl/projects/gneissweb-annotation-testing-v1/urls

‍

We’d love to hear from you if you use these annotations in any of your projects, and don’t hesitate to reach out to us with questions via our Discord or Google Group.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

GneissWeb Annotations Examples

Host Level Annotations

URL Level Annotations

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use