| CommonCrawl

Homepage

Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.

 

Recent Data

The latest dataset is from April 2014, contains approximately 2.6 billion webpages and is located

in Amazon Public Data Sets at /common-crawl/crawl-data/CC-MAIN-2014-15.

See our blog post about the new file path lists that are now included.