July 22, 2024

Common Crawl Statistics Now Available on Hugging Face

Note: this post has been marked as obsolete.

We're excited to announce that Common Crawl’s statistics are now available on Hugging Face!

Ford Heilizer

Ford is a Software Engineering Intern at the Common Crawl Foundation, pursuing a Batchelor of Science degree in Computer Science from the University of Southern California.

We're excited to announce that Common Crawl’s statistics are now available on Hugging Face!

Common Crawl has long been an incredibly valuable resource, offering a vast archive of web crawl data that is accessible to the public. Our mission has always been to democratize access to web crawl data, and by providing detailed statistics about our crawls, we aim to empower users with even deeper insights into the data collected.

The Common Crawl Statistics dataset includes metrics such as the number of URLs, domains, bytes, and content types crawled over specific periods. This dataset is important for users who need comprehensive, structured insights into the composition and trends within the web data collected by Common Crawl.

So far, included on Hugging Face are the following statistics files:

Charsets
Crawl Metrics
Crawl Overlaps
Crawl Size
Top 500 Domains
Languages
MIME Types
Top–Level Domains

Charsets

The table shows the percentage of how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The character set or encoding of HTML pages only is identified by Apache Tika™'s AutoDetectReader.

Crawl Metrics

Crawler–related metrics are extracted from the crawler log files and include:

The size of the URL database (CrawlDb)
The fetch list size (number of URLs scheduled for fetching)
The response status of the fetch:some text
- Success
- Redirect
- Denied (forbidden by HTTP 403 or robots.txt)
- Failed (404, host not found, etc.)
Usage of HTTP/HTTPS URL protocols (schemes)

Crawl Overlaps

Overlaps between monthly crawl archives are calculated and listed as Jaccard similarity of unique URLs or content digests. The cardinality of the monthly crawls and the union of two crawls are Hyperloglog estimates. Note that the content overlaps are small and in the same order of magnitude as the 1% error rate of the Hyperloglog cardinality estimates.

Crawl Size

The number of released pages per month fluctuates over time due to changes to the number of available seeds, scheduling policy for page revists and crawler operating issues. Because of duplicates the numbers of unique URLs or unique content digests (here Hyperloglog estimates) are lower than the number of page captures. The size of various aggregation levels (host, domain, top–level domain / public suffix) is shown in the table.

Top 500 Domains

This table shows the top 500 registered domains (in terms of page captures) of the last main/monthly crawl. Note that the ranking by page captures only partially corresponds to the importance of domains, as the crawler respects the robots.txt and tries hard not to overload web servers. Highly ranked domains tend to be underrepresented. If you're looking for a list of domain or hostnames ranked by PageRank or Harmonic Centrality, consider using one of the Web Graph datasets instead.

Languages

The language of a document is identified by Compact Language Detector 2 (CLD2). It is able to identify 160 different languages and up to three languages per document. The table lists the percentage covered by the primary language of a document (returned first by CLD2). So far, only HTML pages are passed to the language detector.

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls. While the first table is based on the Content–Type HTTP header, the second uses the MIME type detected by Apache Tika™ based on the actual content.

Top–Level Domains

Top–Level Domains (abbrev. "TLD"/"TLDs") are a significant indicator for the representativeness of the data, whether the data set or particular crawl is biased towards certain countries, regions or languages. Note, that top–level domain is defined here as the right–most element of a hostname (com in www.example.com). Country–code Second–Level Domains ("ccSLD") and public suffixes are not covered by these metrics.

Explore it now!

For more detailed statistics, please visit our official statistics page.

Please reach out to info@commoncrawl.org with any feedback or questions. Our Discord is also a great way to connect with us.

This release was authored by:

Ford is a Software Engineering Intern at the Common Crawl Foundation, pursuing a Batchelor of Science degree in Computer Science from the University of Southern California.

Ford Heilizer

Pedro is a French-Colombian mathematician, computer scientist, and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université.

Pedro Ortiz Suarez

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Erratum:

Content is truncated

The Data

Resources

Community

About