Search results

Common Crawl - Blog - Common Crawl on AWS Public Data Sets

Common Crawl on AWS Public Data Sets. Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services' Public Data Sets. Common Crawl Foundation.…

Common Crawl - Get Started

(Northern Virginia) AWS Region. You may process the data in the AWS cloud or download it for free over HTTP(S) with a good Internet connection. Choose a crawl.…

Common Crawl - Blog - Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data

Common Crawl joined AWS’s Open Data Sponsorships. program, hosted on S3, with free access to everyone. Since then, the dataset has expanded (by petabytes!) and our community of users has seen extraordinary growth.…

Common Crawl - Blog - Amazon Web Services sponsoring $50 in credit to all contest entrants!

Did you know that every entry to the First Ever Common Crawl Code Contest gets $50 in Amazon Web Services (AWS) credits?…

Common Crawl - Use Cases

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS, AWS re:Invent 2018. Jed Sundwall, Sebastian Nagel, Dave Rocamora. Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju. Alexander Bezzubov.…

Common Crawl - Blog - The Environmental Impact of the Cloud - the Common Crawl Case Study

AWS Carbon Footprint Tool. , which can be found in the Billing and Cost Management section of the AWS Console. The screenshot below is from the Common Crawl account used to run the crawls and other processes, such as the. Web Graph generation.…

Common Crawl - Blog - News Dataset Available

The data is available on AWS S3 in the. commoncrawl. bucket at. crawl-data/CC-NEWS/. WARC files are released on a daily basis, identifiable by file name prefix which includes year and month.…

Common Crawl - Blog - March/April 2024 Newsletter

AWS Performance Improvements. New Collaborators. New Staff Members. New Board Member. Discord Server. Updated Legal Information. Crawl & Graph Errata. Improved Cadence. Acknowledgements. Web Graphs. Our.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2024

You can download the graph and the ranks of all 362.2 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-may-jun-jul/host/. (this requires an account on AWS).…

Common Crawl - Blog - Common Crawl Code Contest Extended Through the Holiday Weekend

$500 AWS credit. O'Reilly Data Science Kit. Nexus 7 tablet. GitHub pro account. Box full of awesome swag from: GitHub, Kaggle, EFF, Creative Commons, Hortonworks, and more. A 1/3 chance to win an all access pass to Strata + Hadoop World.…

Common Crawl - Blog - Index to WARC Files and URLs in Columnar Format

AWS Athena. The latter makes it possible to run SQL queries on the columnar data even without launching a server. Below you'll find examples how to query the data with Athena. Examples and instructions for. SparkSQL. are in preparation.…

Common Crawl - Blog - Still time to participate in the Common Crawl code contest

$500 in AWS credit. O'Reilly Data Science Starter Kit. TCHO Chocolates. A box full of awesome swag including: a Kaggle hoodie, a Github coffee mug and stickers, a Hortonworks elephant, and several great t-shirts.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2024 and January 2025

You can download the graph and the ranks of all 277.7 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-25-nov-dec-jan/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024

You can download the graph and the ranks of all 335.3 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-nov-feb-apr/host/ (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

You can download the graph and the ranks of all 348.4 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2023-24-sep-nov-feb/host/ (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April, and May 2024

You can download the graph and the ranks of all 336.6 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-feb-apr-may/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/Sep/Nov 2023

You can download the graph and the ranks of all 319.1 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2023-may-sep-nov/host/ (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2025

You can download the graph and the ranks of all 293.3 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2025-jan-feb-mar/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2024

You can download the graph and the ranks of all 361.6 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-jun-jul-aug/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2024

You can download the graph and the ranks of all 299.9 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-aug-sep-oct/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, and December 2024

You can download the graph and the ranks of all 283.7 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-oct-nov-dec/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2024

You can download the graph and the ranks of all 306.5 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-jul-aug-sep/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2025

You can download the graph and the ranks of all 309.2 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2025-feb-mar-apr/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs March, April, and May 2025

You can download the graph and the ranks of all 326.8 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2025-mar-apr-may/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Mar/May/Oct 2023

You can download the graph and the ranks of all 378.7 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2023-mar-may-oct/host/ (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, November 2024

You can download the graph and the ranks of all 298.2 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-sep-oct-nov/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2024

You can download the graph and the ranks of all 371.7 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-apr-may-jun/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023

You can download the graph and the ranks of all 325 million hosts from AWS S3 at. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2024 and January/February 2025

You can download the graph and the ranks of all 267.4 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/host/. (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November/December 2021 and January 2022

You can download the graph and the ranks of all 384 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2021-22-oct-nov-jan/host/ (this requires an account on AWS).…

Common Crawl - Blog - TalentBin Adds Prizes To The Code Contest

$500 in AWS credit. O'Reilly Data Science Starter Kit. Nexus 7 tablet. Bag of awesome swag. A 1 in 3 chance of winning an all access pass to Strata + Hadoop World.…

Common Crawl - Blog - URL Search Tool!

Would you like to win $100 in AWS credit for sharing how URL Search makes your life easier? The first five people who share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!…

Common Crawl - Blog - MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

If you don't already have an account with Amazon Web Services, you can sign up for one at the following URL: https://aws-portal.amazon.com/gp/aws/developer/registration/index.html.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018

You can download the graph and the ranks of all 903 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-aug-sep-oct/host/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/June/July 2018

You can download the graph and the ranks of all 886 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-may-jun-jul/host/.…

Common Crawl - Blog - Introducing the Host Index

Queryable via AWS tools or downloadable. Greg Lindahl. Greg is the Chief Technology Officer at the Common Crawl Foundation. We are pleased to announce a public test of a new web dataset, the. Host Index.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020

You can download the graph and the ranks of all 539 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019

You can download the graph and the ranks of all 820 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/host/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/June/July 2019

You can download the graph and the ranks of all 445 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-may-jun-jul/host/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019

You can download the graph and the ranks of all 407 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-19-nov-dec-jan/host/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020

You can download the graph and the ranks of all 1.24 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-20-nov-dec-jan/host/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/Apr 2018

You can download the graph and the ranks of all 2 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-feb-mar-apr/host/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June/July and August 2022

You can download the graph and the ranks of all 449 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-may-jun-aug/host/ (this requires an account on AWS).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019

You can download the graph and the ranks of all 492 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-feb-mar-apr/host/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021

You can download the graph and the ranks of all 490 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-21-oct-nov-jan/host/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/May 2020

You can download the graph and the ranks of all 927 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/host/.…

Common Crawl - Blog - Oct/Nov 2023 Performance Issues

Common Crawl is a part of AWS Open Data Sponsorship program, and our data is available freely in a S3 bucket named “commoncrawl”. Our datasets have become very popular over time, with downloads doubling every 6 months for several years in a row.…

Common Crawl - Team - Jason Grey

In 1998, he developed an early internet and CD-ROM search engine for 3M using Java Applets, and in 2008, he designed a large-scale web crawling and search solution for highly localized news using early versions of Hadoop, Nutch, SOLR, and AWS.…

Common Crawl - Blog - Now Available: Host- and Domain-Level Web Graphs

You can download the graph and the ranks of all 1.3 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-may-jun-jul/hostgraph/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July/August and September 2021

You can download the graph and the ranks of all 766 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2021-jun-jul-sep/host/.…

Common Crawl - Blog - Twelve steps to running your Ruby code across five billion web pages

If you don't already have an Amazon account, go to this page and sign up: https://aws-portal.amazon.com/gp/aws/developer/registration/index.html. Your keys should be accessible here: https://aws-portal.amazon.com/gp/aws/securityCredentials.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April and May 2021

You can download the graph and the ranks of all 515 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2021-feb-apr-may/host/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

You can download the graph and the ranks of all 5.1 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-aug-sep-oct/hostgraph/.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018

You can download the graph and the ranks of all 2.75 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/host/.…

Common Crawl - Blog - Analysis of the NCSU Library URLs in the Common Crawl Index

While setting up a parallel Hadoop job running in AWS EC2 is cheaper than crawling the Web, it still is rather expensive for most.…

Common Crawl - Blog - The Open Cloud Consortium’s Open Science Data Cloud

The OSDC has carved out a space between small public infrastructures like AWS, and the very large, dedicated infrastructures needed for projects like the large hadron collider.…

Common Crawl - Blog - Web Data Commons Extraction Framework for the Distributed Processing of CC Data

AWS. ). The basic architectural idea of the extraction tool is to have a queue taking care of the proper handling of all files which should be processed.…

Common Crawl - Blog - Common Crawl's First In-House Web Graph

The host-level graph as well as the rankings are placed on AWS S3 on the path: Alternatively, you can use: as prefix to access the files from everywhere. Download files of the Common Crawl Feb/Mar/Apr 2017 host-level webgraph.…

Common Crawl - Blog - May/June 2025 Newsletter

It is queryable via AWS tools or downloadable. For more details about the public test of this dataset and how to give feedback, see our. blog post. Refreshed Version of Our Whirlwind Tour.…

Common Crawl - Blog - Analyzing a Web graph with 129 billion edges using FlashGraph

Da Zheng is a senior applied scientist in AWS AI, interested in building frameworks for data analysis and deep learning. FlashGraph. is a SSD-based graph processing framework for analyzing massive graphs.…

Search results

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use