The Common Crawl corpus contains petabytes of data, regularly collected since 2008.
The corpus contains raw web page data, metadata extracts, and text extracts.
Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
Learn how to Get Started.
Access to the corpus hosted by Amazon is free.
You may use Amazon’s cloud platform to run analysis jobs directly against it or you can download it, whole or in part.
You can search for pages in our corpus using the Common Crawl URL Index.
Check out the Example Projects, view Use Cases, or Statistics for our crawls.