Search results

Common Crawl - Blog - Amazon Web Services sponsoring $50 in credit to all contest entrants!

Amazon Web Services sponsoring $50 in credit to all contest entrants! Did you know that every entry to the First Ever Common Crawl Code Contest gets $50 in Amazon Web Services (AWS) credits?…

Common Crawl - Blog - Common Crawl Code Contest Extended Through the Holiday Weekend

Common Crawl Code Contest Extended Through the Holiday Weekend. Do you have a project that you are working on for the Common Crawl Code Contest that is not quite ready? If so, you are not the only one.…

Common Crawl - Blog - TalentBin Adds Prizes To The Code Contest

TalentBin Adds Prizes To The Code Contest. The prize package for the Common Crawl Code Contest now includes three Nexus 7 tablets thanks to TalentBin! Common Crawl Foundation.…

Common Crawl - Blog - Still time to participate in the Common Crawl code contest

Still time to participate in the Common Crawl code contest. There is still plenty of time left to participate in the Common Crawl code contest! …

Common Crawl - Blog - Winners of the Code Contest!

Winners of the Code Contest! We’re very excited to announce the winners of the First Ever Common Crawl Code Contest! We were thrilled by the response to the contest and the many great entries.…

Common Crawl - Blog - Common Crawl's Brand Spanking New Video and First Ever Code Contest!

Common Crawl's Brand Spanking New Video and First Ever Code Contest! At Common Crawl we've been busy recently!…

Common Crawl - Blog - Strata Conference + Hadoop World

First Ever Code Contest. If you’ve been thinking about submitting an entry, you couldn’t ask for a better reason to do so: you’ll have the chance to win an all-access pass to Strata Conference + Hadoop World 2012! The Data. Overview. Web Graphs.…

Common Crawl - Erratum - Content is truncated

Content is truncated. Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g. radio streams).…

Common Crawl - Contact Us

Contact Us. To communicate with Common Crawl team and the larger community, please see the. Common Crawl Discussion Group and Mailing List. For physical mail correspondence: Common Crawl Foundation. 9663 Santa Monica Blvd. #425. Beverly Hills, CA 90210.…

Common Crawl - Blog - Twelve steps to running your Ruby code across five billion web pages

content.…

Common Crawl - Blog - December 2016 Crawl Archive Now Available

We hope to have greater coverage of multi-lingual content in this and future crawls.…

Common Crawl - Blog - Data Sets Containing Robots.txt Files and Non-200 Responses

Together with the crawl archive for August 2016 we release two data sets containing robots.txt files and server responses with HTTP status code other than 200 (404s, redirects, etc.)…

Common Crawl - Blog - February/March 2021 crawl archive now available

The data was crawled between February 24th and March 9th and contains 2.7 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.2 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.…

Common Crawl - Blog - Towards Social Discovery - New Content Models; New Data; New Toolsets

Towards Social Discovery - New Content Models; New Data; New Toolsets. This is a guest blog post by Matthew Berk, Founder of Lucky Oyster. Matthew has been on the front lines of search technology for the past decade. Matthew Berk.…

Common Crawl - Blog - February 2020 crawl archive now available

It contains 2.6 billion web pages or 240 TiB of uncompressed content, crawled between February 16th and 29th. It includes page captures of 1 billion URLs unknown in any of our prior crawl archives. Sebastian Nagel.…

Common Crawl - Blog - Web Image Size Prediction for Efficient Focused Image Crawling

In the context of using Web image content for analysis and retrieval, it is typically necessary to perform large-scale image crawling.…

Common Crawl - Blog - August Crawl Archive Introduces Language Annotations

It contains 2.65 billion web pages and 220 TiB of uncompressed content, crawled between August 14th and 22th.…

Common Crawl - Blog - Web Archiving File Formats Explained

This can include information like server response codes, content types, languages, and more.…

Common Crawl - Blog - Balancing Discovery and Privacy: A Look Into Opt–Out Protocols

This is how we think about it (and this is just one opinion of many): Web–scraping, also known as data–scraping or content–scraping, occurs when a bot downloads content without authorization, frequently in order to use it maliciously.…

Common Crawl - Blog - February 2017 Crawl Archive Now Available

The archive contains 3.08 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for February 2017 is now available!…

Common Crawl - Blog - January 2017 Crawl Archive Now Available

The archive contains more than 3.14 billion web pages and about 250 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for January 2017 is now available!…

Common Crawl - Blog - From SEO to AIO: Why Your Content Needs to Exist in AI Training Data

From SEO to AIO: Why Your Content Needs to Exist in AI Training Data.…

Common Crawl - Blog - Answers to Recent Community Questions

*Is the code open source? *Where can people obtain access to the Hadoop classes and other code? *Where can people learn more about the stack and the processing architecture? *How do you deal with spam and deduping?…

Common Crawl - Blog - May/June 2024 Newsletter

On April 30th, Common Crawl Foundation hosted an event in New York for a select group of leaders in AI, technology, media, and content.…

Common Crawl - Blog - August 2017 Crawl Archive Now Available

The archive contains 3.28 billion+ web pages and over 280 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for August 2017 is now available!…

Common Crawl - News Crawl

Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events.…

Common Crawl - Blog - September 2016 Crawl Archive Now Available

We plan to extend this approach in depth (allowing more URLs per sitemap) and breadth (adding sitemaps from more hosts), provided that it does not impact the quality of crawled content in terms of duplicates and/or spam.…

Common Crawl - Blog - Common Crawl Statistics Now Available on Hugging Face

The Common Crawl Statistics dataset includes metrics such as the number of URLs, domains, bytes, and content types crawled over specific periods.…

Common Crawl - Erratum - Missing Language Classification

Starting with crawl CC-MAIN-2018-39 we added a language classification field (‘content-languages’) to the columnar indexes, WAT files, and WARC metadata for all subsequent crawls.…

Common Crawl - Blog - MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.…

Common Crawl - Blog - November 2019 crawl archive now available

It contains 2.55 billion web pages or 250 TiB of uncompressed content, crawled between November 11th and 23rd with a short operational break on Nov 16th. It includes page captures of 1.1 billion URLs not contained in any crawl archive before.…

Common Crawl - Blog - December 2024 Crawl Archive Now Available

The data was crawled between December 1st and December 15th, and contains 2.64 billion web pages (or 394 TiB of uncompressed content).…

Common Crawl - Blog - May/June 2020 crawl archive now available

It contains 2.75 billion web pages or 255 TiB of uncompressed content, crawled between May 24th and June 7th. It includes page captures of 1.2 billion URLs unknown in any of our prior crawl archives. Sebastian Nagel.…

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 13 2015

Partial justification of this belief: (a) there already exist blueprints of universal problem solvers developed in my lab, in the new millennium, which are theoretically optimal in some abstract sense although they consist of just a few formulas.…

Common Crawl - Terms of Use

Pursuant to Title 17, United States Code, Section 512I(3), a notification of claimed infringement must be a written communication addressed to the designated agent as set forth below (the "Notice"), and must include substantially all of the following: (a) a…

Common Crawl - Get Started

Example Code. If you’re more interested in diving into code, we’ve provided introductory. Examples. that use the Hadoop or Spark frameworks to process the data, and many more examples can be found in our. Tutorials Section. and on our. GitHub.…

Common Crawl - Blog - Introducing the Host Index

Introducing the Host Index: a new dataset with one row per web host per crawl, combining crawl stats, status codes, languages, and bot defence data. Queryable via AWS tools or downloadable. Greg Lindahl.…

Common Crawl - Blog - Common Crawl Foundation at COLM 2025

In contrast to other major AI or NLP conferences, COLM is still rather small with approximately 1,500 participants (doubled compared to the first edition) and features only a single track of talks and poster sessions.…

Common Crawl - Blog - News Dataset Available

Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events.…

Common Crawl - Blog - Navigating the WARC file format

If you're more interested in diving into code, we've provided. three introductory examples in Java. that use the Hadoop framework to process WAT, WET and WARC. WARC Format.…

Common Crawl - Blog - IETF 123 Report

A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement. Thom Vaughan. Thom is a Principal Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - January 2025 Crawl Archive Now Available

We're pleased to announce our first crawl of 2025, containing 3.0 billion pages, and 460 TiB uncompressed content. Thom Vaughan. Thom is a Principal Engineer at the Common Crawl Foundation. The crawl archive for January 2025 is now available.…

Common Crawl - Blog - September 2018 crawl archive now available

It contains 2.8 billion web pages and 220 TiB of uncompressed content, crawled between September 17th and 26th. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for September 2018 is now available!…

Common Crawl - Blog - Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data

Q: How can I identify whether my code is using unauthenticated S3 access?…

Common Crawl - Blog - November 2017 Crawl Archive Now Available

The archive contains 3.2 billion web pages and 260 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for November 2017 is now available!…

Common Crawl - Blog - June 2017 Crawl Archive Now Available

The archive contains 3.16 billion+ web pages and over 260 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for June 2017 is now available!…

Common Crawl - Blog - July 2017 Crawl Archive Now Available

The archive contains 2.89 billion+ web pages and over 240 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for July 2017 is now available!…

Common Crawl - Blog - The Winners of The Norvig Web Data Science Award

You will find descriptions of the projects as well as links to the code that was used. We hope that these projects will serve as an inspiration for what kind of work can be done with the Common Crawl corpus.…

Common Crawl - Blog - December 2017 Crawl Archive Now Available

The archive contains 2.9 billion web pages and over 240 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for December 2017 is now available!…

Common Crawl - Blog - A Further Look Into the Prevalence of Various ML Opt–Out Protocols

We used the code in the. cc-pyspark. repository to process our data. First, we wrote a.…

Common Crawl - Blog - May 2017 Crawl Archive Now Available

The archive contains 2.96 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for May 2017 is now available!…

Common Crawl - Blog - May/June 2025 Newsletter

Jen English is a seasoned professional with a core competency in web content curation, web crawling, taxonomies, and ontology creation. Table of Contents. Common Crawl’s New Host Index. Refreshed Version of Our Whirlwind Tour.…

Common Crawl - Blog - September 2017 Crawl Archive Now Available

The archive contains 3.01 billion web pages and over 250 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for September 2017 is now available!…

Common Crawl - Blog - August 2025 Crawl Archive Now Available

We are pleased to announce the release of our August 2025 crawl, containing 2.44 billion web pages (or 424 TiB of uncompressed content). Thom Vaughan. Thom is a Principal Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - October 2025 Crawl Archive Now Available

We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content. Hande Çelikkanat. Hande is a Senior ML Engineer with the Common Crawl Foundation.…

Common Crawl - Blog - April 2017 Crawl Archive Now Available

The archive contains 2.94 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for April 2017 is now available!…

Common Crawl - Blog - October 2017 Crawl Archive Now Available

The archive contains 3.65 billion web pages and over 300 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for October 2017 is now available!…

Common Crawl - Blog - March 2017 Crawl Archive Now Available

The archive contains 3.07 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for March 2017 is now available!…

Common Crawl - Blog - September 2025 Crawl Archive Now Available

We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content. Thom Vaughan. Thom is a Principal Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - URL Search Tool!

Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified.…

Search results

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

Opt-out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use