Web Data Commons Extraction Framework for the Distributed Processing of CC Data

Robert MeuselThis is a guest blog post by Robert Meusel.
Robert Meusel is a researcher at the University of Mannheim in the Data and Web Science Research Group and a key member of the Web Data Commons project. The post below describes a new tool produced by Web Data Commons for extracting data from the Common Crawl data.


The Web Data Commons project extracts structured data from the Common Crawl corpora and offers the extracted data for public download. We have extracted one of the largest hyperlink graphs that is currently available to the public. We also extract and offer large corpora of Microdata, Microformats and RDFa annotations as well as relational HTML tables. If you ask us, why we do this? Because we share the opinion that data should be available to everybody and because we want to make it easier to exploit the wealth of information that is available on the Web.

For performing the extractions, we need to go through all the hundreds of tera-bytes of crawl data offered by the Common Crawl Foundation. As a project without any direct funding or salaried persons, we needed a time-, resource- and cost-efficient way to process the CommonCrawl corpora. We thus developed a data extraction tool which allows us to process the Common Crawl corpora in a distributed fashion using Amazon cloud services (AWS).

The basic architectural idea of the extraction tool is to have a queue taking care of the proper handling of all files which should be processed. Each worker receives a new file from the queue whenever it is ready and informs the queue about the status (success of failure) of the processing. Successfully processed files are removed from the queue, failures are assigned to another worker or eliminated when a fixed number of workers could not process it.

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed).  Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

With each new extraction, we improved the extraction tool and turned it more and more into a flexible framework into which we now simply plug the needed file processors (for one single file) and which takes care of everything else.

This framework was now officially released under the terms of the Apache license. The framework takes care of everything that is related to file handling, distribution, and scalability and leaves to the user only the task of writing the code needed for extracting the desired information from a single out of the all CC files.

More information about the framework, a detailed guide on how to run it, and a tutorial showing how to customize the framework for your extraction tasks is found at

http://webdatacommons.org/framework

We encourage all interested parties to make use of the framework. We will continuously improve the framework and are happy about everybody who gives us feedback about her experiences with the framework.

Web Data Commons

For the last few months, we have been talking with Chris Bizer and Hannes Mühleisen at the  Freie Universität Berlin about their work and we have been greatly looking forward the announcement of the Web Data Commons.  This morning they and their collaborators Andreas Harth and Steffen Stadtmüller released the announcement below.

Please read the announcement and check out the detailed information on the website. I am sure you will agree that this is important work and that you will find their results interesting.

 

Hi all,

we are happy to announce WebDataCommons.org, a joined project of Freie
Universität Berlin and the Karlsruhe Institute of Technology to extract all
Microformat, Microdata and RDFa data from the Common Crawl web corpus, the
largest and most up-to-data web corpus that is currently available to the
public.

WebDataCommons.org provides the extracted data for download in the form of
RDF-quads. In addition, we produce basic statistics about the extracted
data.

Up till now, we have extracted data from two Common Crawl web corpora: One
corpus consisting of 2.5 billion HTML pages dating from 2009/2010 and a
second corpus consisting of 1.4 billion HTML pages dating from February
2012.

The 2009/2010 extraction resulted in 5.1 billion RDF quads which describe
1.5 billion entities and originate from 19.1 million websites.
The February 2012 extraction resulted in 3.2 billion RDF quads which
describe 1.2 billion entities and originate from 65.4 million websites.

More detailed statistics about the distribution of formats, entities and
websites serving structured data, as well as growth between 2009/2010 and
2012 is provided on the project website:

http://webdatacommons.org/

It is interesting to see form the statistics that the RDFa and Microdata
deployment has grown a lot over the last years, but that Microformat data
still makes up the majority of the structured data that is embedded into
HTML pages (when looking at the amount of quads as well as the amount of
websites).

We hope that will be useful to the community by:
+ easing the access to Mircodata, Mircoformat and RDFa data, as you do not
need to crawl the Web yourself anymore in order to get access to a fair
portion of the structured data that is currently available on the Web.
+ laying the foundation for the more detailed analysis of the deployment of
the different technologies.
+ providing seed URLs for focused Web crawls that dig deeper into the
websites that offer a specific type of data.

Web Data Commons is a joint effort of Christian Bizer and Hannes Mühleisen
(Web-based Systems Group at Freie Universität Berlin) and Andreas Harth and
Steffen Stadtmüller (Institute AIFB at the Karlsruhe Institute of
Technology).

Lots of thanks to:
+ the Common Crawl project for providing their great web crawl and thus
enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data
parsers.
+ the PlanetData and the LOD2 EU research projects which supported the
extraction.

For the future, we plan to update the extracted datasets on a regular basis
as new Common Crawl corpora are becoming available. We also plan to provide
the extracted data in the in the form of CSV-tables for common entity types
(e.g. product, organization, location, …) in order to make it easier to
mine the data.

Cheers,

Christian Bizer, Hannes Mühleisen, Andreas Harth and Steffen Stadtmüller