We are pleased to announce the release of a new dataset containing news articles from news sites all over the world.
The data is available on AWS S3 in the commoncrawl bucket at /crawl-data/CC-NEWS/. WARC files are released on a daily basis, identifiable by file name prefix which contains year, month and day. A full list of the published WARC files to-date can be obtained with the AWS Command Line Interface and the command:
aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/
The listed WARC files (e.g., s3://commoncrawl/crawl-data/CC-NEWS/2016/09/CC-NEWS-20160926211809-00000.warc.gz) may be accessed in the same way as the WARC files from the main dataset; see how to access and process Common Crawl data. You can access the data even without a AWS account by adding the command-line option
Why a new dataset?
News is a text genre that is often discussed on our user and developer mailing list. Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events. By decoupling the news from the main dataset, as a smaller sub-dataset, it is feasible to publish the WARC files shortly after they are written.
While the main dataset is produced using Apache Nutch, the news crawler is based on StormCrawler, an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. Using StormCrawler allows us to test and evaluate a different crawler architecture towards the following long-term objectives:
- continuously release freshly crawled data
- incorporate new seeds quickly and efficiently
- reduce computing costs with constant/ongoing use of hardware.
The source code of the news crawler is available on our Github account. Please, report issues there and share your suggestions for improvements with us. Note that the news dataset is released at an early stage in its development: with further iteration, we intend to improve it in both coverage and quality in upcoming months.
We are grateful to Julien Nioche (DigitalPebble Ltd), who, as lead developer of StormCrawler, had the initial idea to start the news crawl project. Julien provided the first news crawler version for free, and volunteered to support initial crawler setup and testing.