October 2016 Crawl Archive Now Available

The crawl archive for October 2016 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-44/. It contains more than 3.25 billion web pages.

Similar to the September crawl, we used sitemaps to improve the crawl seed list, including sitemaps named in the robots.txt file of the top-million domains from Alexa, and sitemaps from the top 150,000 hosts in Common Search’s host-level page ranks. The maximum number of URL’s extracted per domain was 200,000. The resulting crawl included 2 billion new URLs, not contained in previous crawls.

We are grateful to webxtrakt for donating a list of 14 million verified, DNS-resolvable domain names of European country-code TLDs (eu, .fr, .be, .de, .ch, .nl, .pl). We included these domains into the October crawl and we hope for a ongoing partnership with webxtract to improve the coverage of the crawls.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-44/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

September 2016 Crawl Archive Now Available

The crawl archive for September 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-40/ contains more than 1.72 billion web pages.

To extend the seed list, we mined sitemaps from the robots.txt dataset and sorted the list of sitemap URLs based on host-level page ranks from Common Search. The highest-ranked 150,000 sitemaps were added to the crawl seed list. For the majority of sitemaps, a maximum of 5,000 potential new URLs per-sitemap were allowed. For the top 5,000 hosts/sitemaps, up to 200,000 potential new URLs were allowed. As a result, the September crawl archive contains 150 million previously unknown URLs. We plan to extend this approach in depth (allowing more URLs per sitemap) and breadth (adding sitemaps from more hosts), provided that it does not impact the quality of crawled content in terms of  duplicates and/or spam.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-40/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

WARC archives containing containing robots.txt files and responses without content (404s, redirects, etc.) are also provided:

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

News Dataset Available

We are pleased to announce the release of a new dataset containing news articles from news sites all over the world.

The data is available on AWS S3 in the commoncrawl bucket at /crawl-data/CC-NEWS/. WARC files are released on a daily basis, identifiable by file name prefix which contains year, month and day. A full list of the published WARC files to-date can be obtained with the AWS Command Line Interface and the command:

aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/

 

The listed WARC files (e.g., s3://commoncrawl/crawl-data/CC-NEWS/2016/09/CC-NEWS-20160926211809-00000.warc.gz) may be accessed in the same way as the WARC files from the main dataset; see how to access and process Common Crawl data.

Why a new dataset?

News is a text genre that is often discussed on our user and developer mailing list. Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events. By decoupling the news from the main dataset, as a smaller sub-dataset, it is feasible to publish the WARC files shortly after they are written.

While the main dataset is produced using Apache Nutch, the news crawler is based on StormCrawler, an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. Using StormCrawler allows us to test and evaluate a different crawler architecture towards the following long-term objectives:

  • continuously release freshly crawled data
  • incorporate new seeds quickly and efficiently
  • reduce computing costs with constant/ongoing use of hardware.

The source code of the news crawler is available on our Github account. Please, report issues there and share your suggestions for improvements with us. Note that the news dataset is released at an early stage in its development: with further iteration, we intend to improve it in both coverage and quality in upcoming months.

We are grateful to Julien Nioche (DigitalPebble Ltd), who, as lead developer of StormCrawler, had the initial idea to start the news crawl project. Julien provided the first news crawler version for free, and volunteered to support initial crawler setup and testing.

August 2016 Crawl Archive Now Available

The crawl archive for August 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-36/ contains more than 1.61 billion web pages.

To extend the seed list, we’ve added 50 million hosts from the Common Search host-level pagerank data set. While many of these hosts may already be known, and some may not provide crawlable content, the number of crawled hosts has grown by 18 million (or 50%) and there are 8 million more unique domains (plus 35%).

Together with the August 2016 crawl archive we also release two data sets containing robots.txt files and responses without content (404s, redirects, etc.). More information can be found in a separate blog post.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-36/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

Data Sets Containing Robots.txt Files and Non-200 Responses

Together with the crawl archive for August 2016 we release two data sets containing

  • robots.txt files (or what servers return in response to a GET request /robots.txt)
  • server responses with HTTP status code other than 200 (404s, redirects, etc.)

The data may be useful to anyone interested in web science, with various applications in the field. For instance, redirects are substantial elements of web graphs where they are equivalent to ordinary links. The data may also be useful to people developing crawlers, as it enables testing of robots.txt parsers against a huge data set.

This data is provided separately from the crawl archive because it does not apply to data analysis for natural language content: robots.txt files are read by crawlers; and content generated together with 404s (and redirects, etc.) is usually auto-generated and contains only standardized phrases such as “page not found” or “document has moved”.

The new data sets are available as WARC files in subdirectories of the August 2016 crawl archives:

  • s3://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/*/robotstxt/ for the robots.txt responses, and
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/*/crawldiagnostics/ for 404s, redirects, etc.

Replace the star * by all segments to get the full list of folders. Alternatively, we provide lists of all robots.txt WARC files or all WARC files containing non-200 HTTP status code responses.

Please, share your feedback on this new data sets and let us know whether we shall continue and update this data sets regularly every month.

July 2016 Crawl Archive Now Available

The crawl archive for July 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-30/ contains more than 1.73 billion web pages.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-30/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

June 2016 Crawl Archive Now Available

The crawl archive for June 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-26/ contains more than 1.23 billion web pages.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-26/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

May 2016 Crawl Archive Now Available

The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive, which is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-22/.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-22/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

We are grateful to our friends at Moz for donating a seed list of 400 Million URL’s to enhance the Common Crawl. The seeds from Moz were used for the May crawl in addition to the seeds from the preceding crawl. Moz URL data will be incorporated into future crawls as well.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

April 2016 Crawl Archive Now Available

The crawl archive for April 2016 is now available! More than 1.33 billion webpages are in the archive, which is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-18/.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-18/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Note that the April crawl is based on the same URL seed list as the preceding crawl of February 2016. However, the manner in which the crawler follows redirects is changed: redirects are not followed immediately; instead redirect targets from the current crawl are recorded and followed by the subsequent crawl. This approach serves to avoid duplicates with exactly the same URL contained in multiple segments (e.g., one of the commoncrawl.org pages). The February crawl contains almost 10% such duplicates.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

Welcome, Sebastian!

It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April. Sebastian brings to Common Crawl a unique blend of experience, skills, knowledge (and enthusiasm!) to complement his role and the organization.

Sebastian has a PhD in Computational Linguistics and several years of experience as a programmer working in search and data. In addition to hands-on experience maintaining and improving a Nutch-based crawler like that of Common Crawl, Sebastian is a core committer to and current chair of the open-source Apache Nutch project. Sebastian’s knowledge of machine learning techniques and natural language processing components of web crawling will help Common Crawl continually improve on and optimize the crawl process and its results.

With Sebastian on board, we have both the competence and momentum to take Common Crawl to the next level.