March 31, 2025

Introducing Common Crawl AI Agent by ReadyAI

We are pleased to announce the launch of an experimental AI Agent, developed by our friends at ReadyAI. The agent offers a conversational interface designed to help users explore Common Crawl’s data, use cases, and community initiatives.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

We are pleased to introduce an experimental Common Crawl AI Agent, developed by our friends at ReadyAI. This AI Agent uses an LLM plus RAG (Retrieval-Augmented Generation) to be able to answer questions by searching content in our website, plus one hop away on the web, and from our public mailing list archive.

Try it out here: https://commoncrawl.org/ai-agent

Examples of things it’s pretty good at answering:

Questions about Common Crawl’s data formats
Questions about Common Crawl’s indexes, both cdx and columnar
Questions about example uses of Common Crawl data
Generic questions about web archiving

‍

The end of most answers contains a link to a specific webpage with more information about the answer.

Like all LLM+RAG systems, it has a few limitations:

One of the example queries is how many harvard.edu pages CC has crawled. The AI Agent gives an answer from a few months ago – but this is a number that changes every month. Why did the AI Agent say that? Well, that’s one number from our email list archive - the nuance of the number changing every month is difficult to teach to the AI Agent.
If you ask a question that’s totally out of scope, like “What is the Frumious Bandersnatch”, the AI Agent will answer based on what the LLM knows, even though the RAG system (searching our website+1 and our mailing list) doesn’t know anything about Lewis Carroll’s poetry.

‍

ReadyAI has been updating the RAG data in real-time, and we’re looking forward to future improvements.

We’ve had fun experimenting with this AI Agent, and we’d love to hear what you think about it.

Please feel free to join our Discord server or Google Group to let us know how you get on.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Introducing Common Crawl AI Agent by ReadyAI

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use