Common Crawl maintains a free,open repository of web crawl data that can be used by anyone.
Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers.
The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.
Common Crawl Foundation
Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.