October 4, 2016

News Dataset Available

We are pleased to announce the release of a new dataset containing news articles from news sites all over the world.

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

We are pleased to announce the release of a new dataset containing news articles from news sites all over the world.

The data is available on AWS S3 in the commoncrawl bucket at crawl-data/CC-NEWS/. WARC files are released on a daily basis, identifiable by file name prefix which includes year and month. We provide lists of the published WARC files, organized by year and month from 2016 to-date. Alternatively, authenticated AWS users can get listings using the AWS Command Line Interface and the command:

aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/

The listed WARC files (e.g., crawl-data/CC-NEWS/2016/09/CC-NEWS-20160926211809-00000.warc.gz) may be accessed in the same way as the WARC files from the main dataset; see how to access and process Common Crawl data.

Why a new dataset?

News is a text genre that is often discussed on our user and developer mailing list. Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events. By decoupling the news from the main dataset, as a smaller sub-dataset, it is feasible to publish the WARC files shortly after they are written.

While the main dataset is produced using Apache Nutch, the news crawler is based on StormCrawler, an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. Using StormCrawler allows us to test and evaluate a different crawler architecture towards the following long-term objectives:

continuously release freshly crawled data
incorporate new seeds quickly and efficiently
reduce computing costs with constant/ongoing use of hardware.

The source code of the news crawler is available on our Github account. Please, report issues there and share your suggestions for improvements with us. Note that the news dataset is released at an early stage in its development: with further iteration, we intend to improve it in both coverage and quality in upcoming months.

We are grateful to Julien Nioche (DigitalPebble Ltd), who, as lead developer of StormCrawler, had the initial idea to start the news crawl project. Julien provided the first news crawler version for free, and volunteered to support initial crawler setup and testing.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

News Dataset Available

Why a new dataset?

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use