The crawl archive for September 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-40/ contains more than 1.72 billion web pages.
To extend the seed list, we mined sitemaps from the robots.txt dataset and sorted the list of sitemap URLs based on host-level page ranks from Common Search. The highest-ranked 150,000 sitemaps were added to the crawl seed list. For the majority of sitemaps, a maximum of 5,000 potential new URLs per-sitemap were allowed. For the top 5,000 hosts/sitemaps, up to 200,000 potential new URLs were allowed. As a result, the September crawl archive contains 150 million previously unknown URLs. We plan to extend this approach in depth (allowing more URLs per sitemap) and breadth (adding sitemaps from more hosts), provided that it does not impact the quality of crawled content in terms of duplicates and/or spam.
To assist with exploring and using the dataset, we provide gzipped files that list:
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.
The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-40/.
For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.
WARC archives containing containing robots.txt files and responses without content (404s, redirects, etc.) are also provided:
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.
We are pleased to announce the release of a new dataset containing news articles from news sites all over the world.
The data is available on AWS S3 in the commoncrawl bucket at /crawl-data/CC-NEWS/. WARC files are released on a daily basis, identifiable by file name prefix which contains year, month and day. A full list of the published WARC files to-date can be obtained with the AWS Command Line Interface and the command:
aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/
The listed WARC files (e.g., s3://commoncrawl/crawl-data/CC-NEWS/2016/09/CC-NEWS-20160926211809-00000.warc.gz) may be accessed in the same way as the WARC files from the main dataset; see how to access and process Common Crawl data. You can access the data even without a AWS account by adding the command-line option
Why a new dataset?
News is a text genre that is often discussed on our user and developer mailing list. Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events. By decoupling the news from the main dataset, as a smaller sub-dataset, it is feasible to publish the WARC files shortly after they are written.
While the main dataset is produced using Apache Nutch, the news crawler is based on StormCrawler, an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. Using StormCrawler allows us to test and evaluate a different crawler architecture towards the following long-term objectives:
- continuously release freshly crawled data
- incorporate new seeds quickly and efficiently
- reduce computing costs with constant/ongoing use of hardware.
The source code of the news crawler is available on our Github account. Please, report issues there and share your suggestions for improvements with us. Note that the news dataset is released at an early stage in its development: with further iteration, we intend to improve it in both coverage and quality in upcoming months.
We are grateful to Julien Nioche (DigitalPebble Ltd), who, as lead developer of StormCrawler, had the initial idea to start the news crawl project. Julien provided the first news crawler version for free, and volunteered to support initial crawler setup and testing.