News Crawl

News is a text genre that is often discussed on our user and developer mailing list.

Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events. By decoupling the news from the main dataset, as a smaller sub-dataset, it is feasible to publish the WARC files shortly after they are written.

Using StormCrawler

While the main dataset is produced using Apache Nutch, the news crawler is based on StormCrawler, an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. Using StormCrawler allows us to test and evaluate a different crawler architecture towards the following long-term objectives:

Continuously release freshly crawled data

Incorporate new seeds quickly and efficiently

Reduce computing costs with constant/ongoing use of hardware

How to report bugs?

The source code of the news crawler is available on our Github account. Please, report issues there and share your suggestions for improvements with us.

We are grateful to Julien Nioche (DigitalPebble Ltd), who, as lead developer of StormCrawler, had the initial idea to start the news crawl project. Julien provided the first news crawler version for free, and volunteered to support initial crawler setup and testing.

News Crawl

Using StormCrawler

How to report bugs?

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use