January 23, 2012

Common Crawl on AWS Public Data Sets

Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services' Public Data Sets.

Common Crawl Foundation

Common Crawl - Open Source Web Crawling data‍

Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services' Public Data Sets. This is great news because it means that the Common Crawl data corpus is now much more readily accessible and visible to the public. The greater accessibility and visibility is a significant help in our mission of enabling a new wave of innovation, education, and research.

Amazon Web Services (AWS) provides a centralized repository of public data sets that can be integrated in AWS cloud-based applications. AWS makes available such estimable large data sets as the mapping of the Human Genome and the US Census. Previously, such data was often prohibitively difficult to access and use. With the Amazon Elastic Compute Cloud, it takes a matter of minutes to begin computing on the data. Demonstrating their commitment to an open web, AWS hosts public data sets at no charge for the community, so users pay only for the compute and storage they use for their own applications. What this means for you is that our data - all 5 billion web pages of it - just got a whole lot slicker and easier to use. We greatly appreciate Amazon’s support for the open web in general, and we’re especially appreciative of their support for Common Crawl. Placing our data in the public data sets not only benefits the larger community, but it also saves us money. As a nonprofit in the early phases of existence, this is crucial. A huge thanks to Amazon for seeing the importance in the work we do and for so generously supporting our shared goal of enabling increased open innovation!

This release was authored by:

No items found.

Common Crawl on AWS Public Data Sets

The Data

Overview

Web Graphs

Latest Crawl

Resources

Get Started

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Discord Server

Collaborators

About

Team

Mission

Impact

Privacy Policy

Terms of Use