August 2014 Crawl Data Available

The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-35/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

read more

Web Data Commons Extraction Framework for the Distributed Processing of CC Data

Robert Meusel

Robert Meusel

 

This is a guest blog post by Robert Meusel

 Robert Meusel is a researcher at the University of Mannheim in the Data and Web Science Research Group and a key member of the Web Data Commons project. The post below describes a new tool produced by Web Data Commons for extracting data from the Common Crawl data. 

 

 

The Web Data Commons project extracts structured data from the Common Crawl corpora and offers the extracted data for public download. We have extracted one of the largest hyperlink graphs that is currently available to the public. We also extract and offer large corpora of Microdata, Microformats and RDFa annotations as well as relational HTML tables. If you ask us, why we do this? Because we share the opinion that data should be available to everybody and because we want to make it easier to exploit the wealth of information that is available on the Web.

For performing the extractions, we need to go through all the hundreds of tera-bytes of crawl data offered by the Common Crawl Foundation. As a project without any direct funding or salaried persons, we needed a time-, resource- and cost-efficient way to process the CommonCrawl corpora. We thus developed a data extraction tool which allows us to process the Common Crawl corpora in a distributed fashion using Amazon cloud services (AWS).

The basic architectural idea of the extraction tool is to have a queue taking care of the proper handling of all files which should be processed. Each worker receives a new file from the queue whenever it is ready and informs the queue about the status (success of failure) of the processing. Successfully processed files are removed from the queue, failures are assigned to another worker or eliminated when a fixed number of workers could not process it.

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed).  Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

With each new extraction, we improved the extraction tool and turned it more and more into a flexible framework into which we now simply plug the needed file processors (for one single file) and which takes care of everything else.

This framework was now officially released under the terms of the Apache license. The framework takes care of everything that is related to file handling, distribution, and scalability and leaves to the user only the task of writing the code needed for extracting the desired information from a single out of the all CC files.

More information about the framework, a detailed guide on how to run it, and a tutorial showing how to customize the framework for your extraction tasks is found at

http://webdatacommons.org/framework

We encourage all interested parties to make use of the framework. We will continuously improve the framework and are happy about everybody who gives us feedback about her experiences with the framework.

read more

July 2014 Crawl Data Available

The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-23/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

We’ve also released a Python library, gzipstream, that should enable easier access and processing of the Common Crawl dataset. We’d love for you to try it out!

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Note: the original estimate for this crawl was 4 billion, but after full analytics were run, this estimate was revised.

read more

April 2014 Crawl Data Available

The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-15/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

read more

Navigating the WARC file format

Wait, what’s WAT, WET and WARC?

Recently CommonCrawl has switched to the Web ARChive (WARC) format. The WARC format allows for more efficient storage and processing of CommonCrawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.

This document aims to give you an introduction to working with the new format, specifically the difference between:

  • WARC files which store the raw crawl data
  • WAT files which store computed metadata for the data stored in the WARC
  • WET files which store extracted plaintext from the data stored in the WARC

If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.

If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

WARC Format

The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.

In the example below, we can see the crawler contacted http://102jamzorlando.cbslocal.com/tag/nba/page/2/ and received a HTML page in response. We can also see the page was served from the nginx web server and that a special header has been added, X-hacker, purely for the purposes of advertising to a very specific audience of programmers who might look at the HTTP headers!

WARC/1.0
WARC-Type: response
WARC-Date: 2013-12-04T16:47:32Z
WARC-Record-ID: 
Content-Length: 73873
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: 
WARC-Concurrent-To: 
WARC-IP-Address: 23.0.160.82
WARC-Target-URI: http://102jamzorlando.cbslocal.com/tag/nba/page/2/
WARC-Payload-Digest: sha1:FXV2BZKHT6SQ4RZWNMIMP7KMFUNZMZFB
WARC-Block-Digest: sha1:GMYFZYSACNBEGHVP3YFQNOSTV5LPXNAU

HTTP/1.0 200 OK
Server: nginx
Content-Type: text/html; charset=UTF-8
Vary: Accept-Encoding
Vary: Cookie
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
Content-Encoding: gzip
Date: Wed, 04 Dec 2013 16:47:32 GMT
Content-Length: 18953
Connection: close


...HTML Content...

WAT Response Format

WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.

This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, use one of the many JSON pretty print tools available.

The HTTP response metadata is most likely to be of interest to CommonCrawl users. The skeleton of the JSON format is outlined below.

  • Envelope
    • WARC-Header-Metadata
    • Payload-Metadata
      • HTTP-Response-Metadata
        • Headers
          • HTML-Metadata
            • Head
              • Title
              • Scripts
              • Metas
              • Links
            • Links
    • Container

WET Response Format

As many tasks only require textual information, the CommonCrawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://advocatehealth.com/condell/emergencyservices3
WARC-Date: 2013-12-04T15:30:35Z
WARC-Record-ID: 
WARC-Refers-To: 
WARC-Block-Digest: sha1:3SJBHMFPOCUJEHJ7OMGVCRSHQTWLJUUS
Content-Type: text/plain
Content-Length: 5765


...Text Content...

Processing the file format

We’ve provided three introductory examples in Java for the Hadoop framework. The code also contains wrapper tools for making working with the Web Archive Commons library easier in Hadoop.

These introductory examples include:

  • Count the number of times varioustags are used across HTML on the internet using the WARC files
  • Counting the number of different server types found in the HTTP headers using the WAT files
  • Word count over the extracted plaintext found in the WET files

If you’re using a different language, there are a number of open source libraries that handle processing these WARC files and the content they contain. These include:

If in doubt, the tools provided as part of the IIPC’s Web Archive Commons library are the preferred implementation.

Stephen Merity

This is a guest blog post by Stephen Merity

Stephen Merity is a Computational Science and Engineering master’s candidate at Harvard University. His graduate work centers around machine learning and data analysis on large data sets. Prior to Harvard, Stephen worked as a software engineer for Freelancer.com and as a software engineer for online education start-up Grok Learning. Stephen has a Bachelor of Information Technology (Honours First Class with University Medal) from the University of Sydney in Australia.

read more

March 2014 Crawl Data Now Available

The March crawl of 2014 is now available! The new dataset contains approximately 2.8 billion webpages and is about 223TB in size. The new data is located in the aws-publicdatasets at /common-crawl/crawl-data/CC-MAIN-2014-10/

We went a little deeper on this crawl than during our 2013 crawls so you’ll see more pages per domain.We’re working hard to get a few machines always crawling domains with large numbers of pages to go even deeper while still maintaining our politeness policy.

Thanks again to Blekko for their ongoing donation of URLs for our crawl.

read more