| CommonCrawl

Homepage

Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.

 

Recent Data

The latest dataset from August 2014, containing approximately 2.8 billion webpages, is located

in Amazon Public Data Sets at /common-crawl/crawl-data/CC-MAIN-2014-35.

See our blog post for more details.