The Common Crawl corpus contains petabytes of data collected over 7 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
Access to the Common Crawl corpus hosted by Amazon is free. You may use Amazon’s cloud platform to run analysis jobs directly against it or you can download parts or all of it.
You can search for pages in our corpus using the Common Crawl URL Index.