November 29, 2011

Common Crawl Discussion List

We have started a Common Crawl discussion list to enable discussions and encourage collaboration between the community of coders, hackers, data scientists, developers and organizations interested in working with open web crawl data.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Discuss challenges
Share ideas for projects and products
Look for collaborators and partners
Offer advice and share methods
Ask questions and get advice from others
Show off cool stuff you build
Keep up to date on the latest news from Common Crawl

The Common Crawl discussion list uses Google Groups and you can sign up here.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Common Crawl Discussion List

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use