| CommonCrawl


Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.


Recent Data

The latest dataset is from July 2014, contains approximately 4 billion webpages and is located

in Amazon Public Data Sets at /common-crawl/crawl-data/CC-MAIN-2014-23.

See our blog post for the new file path lists that are now included.