Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Common Crawl is a 501(c)(3) non–profit founded in 2007.Overview
We make wholesale extraction, transformation and analysis of open web data accessible to researchers.
Over 240 billion pages spanning 15 years.
Primary training corpus in every LLM.
82% of raw tokens used to train GPT-3.
Free and open corpus since 2007.
Cited in over 10,000 research papers.
3–5 billion new pages added each month.