< Back to Blog
March 31, 2025

Introducing Common Crawl AI Agent by ReadyAI

Note: this post has been marked as obsolete.
We are pleased to announce the launch of an experimental AI Agent, developed by our friends at ReadyAI. The agent offers a conversational interface designed to help users explore Common Crawl’s data, use cases, and community initiatives.
Common Crawl Foundation
Common Crawl Foundation
Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

We are pleased to introduce an experimental Common Crawl AI Agent, developed by our friends at ReadyAI. This AI Agent uses an LLM plus RAG (Retrieval-Augmented Generation) to be able to answer questions by searching content in our website, plus one hop away on the web, and from our public mailing list archive.

Try it out here: https://commoncrawl.org/ai-agent

Examples of things it’s pretty good at answering:

  • Questions about Common Crawl’s data formats
  • Questions about Common Crawl’s indexes, both cdx and columnar
  • Questions about example uses of Common Crawl data
  • Generic questions about web archiving

The end of most answers contains a link to a specific webpage with more information about the answer.

Like all LLM+RAG systems, it has a few limitations:

  • One of the example queries is how many harvard.edu pages CC has crawled. The AI Agent gives an answer from a few months ago – but this is a number that changes every month. Why did the AI Agent say that? Well, that’s one number from our email list archive - the nuance of the number changing every month is difficult to teach to the AI Agent.
  • If you ask a question that’s totally out of scope, like “What is the Frumious Bandersnatch”, the AI Agent will answer based on what the LLM knows, even though the RAG system (searching our website+1 and our mailing list) doesn’t know anything about Lewis Carroll’s poetry.

ReadyAI has been updating the RAG data in real-time, and we’re looking forward to future improvements.

We’ve had fun experimenting with this AI Agent, and we’d love to hear what you think about it.

Please feel free to join our Discord server or Google Group to let us know how you get on.

This release was authored by:
No items found.