
We are pleased to introduce an experimental Common Crawl AI Agent, developed by our friends at ReadyAI. This AI Agent uses an LLM plus RAG (Retrieval-Augmented Generation) to be able to answer questions by searching content in our website, plus one hop away on the web, and from our public mailing list archive.
Try it out here: https://commoncrawl.org/ai-agent
Examples of things it’s pretty good at answering:
- Questions about Common Crawl’s data formats
- Questions about Common Crawl’s indexes, both cdx and columnar
- Questions about example uses of Common Crawl data
- Generic questions about web archiving
The end of most answers contains a link to a specific webpage with more information about the answer.
Like all LLM+RAG systems, it has a few limitations:
- One of the example queries is how many harvard.edu pages CC has crawled. The AI Agent gives an answer from a few months ago – but this is a number that changes every month. Why did the AI Agent say that? Well, that’s one number from our email list archive - the nuance of the number changing every month is difficult to teach to the AI Agent.
- If you ask a question that’s totally out of scope, like “What is the Frumious Bandersnatch”, the AI Agent will answer based on what the LLM knows, even though the RAG system (searching our website+1 and our mailing list) doesn’t know anything about Lewis Carroll’s poetry.
ReadyAI has been updating the RAG data in real-time, and we’re looking forward to future improvements.
We’ve had fun experimenting with this AI Agent, and we’d love to hear what you think about it.
Please feel free to join our Discord server or Google Group to let us know how you get on.