Web Data Commons - RDFa, Microdata, and Microformat Data Sets

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages using markup standards such as RDFa, Microdata and Microformats. The Web Data Commons project extracts this data from several billion web pages. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.

2014-08-27: We have released an easy to customize version of the WDC Extraction Framework including a tutorial, which explains the usage and customization in detail. See also our guest post at the Common Crawl Blog.

Contents

1. About Web Data Commons

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads. In addition, we calculate and publish statistics about the deployment of the different formats as well as the vocabularies that are used together with each format.

Up till now, we have extracted all RDFa, Microdata and Microformats data from the following releases of the Common Crawl web corpora:

For the future, we plan to rerun our extraction on a regular basis as new Common Crawl corpora are becoming available.

Below, you find information about the extracted data formats and detailed statistics about the extraction results. In addition we have analyzed trends in the deployment of the most widely spread formats as well as in the deployment of selected RDFa and Microdata classes. This analysis can be found here.

2. Extracted Data Formats

The table below provides an overview of the different structured data formats that we extract from the Common Crawl. The table contains references to the specifications of the formats as well as short descriptions of the formats. Web Data Commons packages the extracted data for each format separately for download. The table also defines the format identifiers that are used in the following.

RDFa is a specification for attributes to express structured data in any markup language, e.g HTML. The underlying abstract representation is RDF, which lets publishers build their own vocabulary, extend others, and evolve their vocabulary with maximal interoperability over time.

Geo a 1:1 representation of the "geo" property from the vCard standard, reusing the geo property and sub-properties as-is from the hCard microformat. It can be used to markup latitude/longitude coordinates in HTML.

3.2. Extraction Results from the November 2015 Common Crawl Corpus

The November 2015 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/crawl-data/CC-MAIN-2015-48/ .

Extraction Statistics

Crawl Date

November 2015

Total Data

151 Terabyte

(compressed)

Parsed HTML URLs

1,770,525,212

URLs with Triples

541,514,775

Domains in Crawl

14,409,425

Domains with Triples

2,724,591

Typed Entities

6,107,584,968

Triples

24,377,132,352

Format Breakdown

As the charts show a large fraction of websites make already use of embedded JSON-LD. In most cases (>90%)
the websites use the syntax to enable Google to create a search box within the search results, as
annonced by Google in September 2014.
A interesting discussion about the topic can be found in the Google+ Posting by Aaron Bradley.

Format Breakdown

Extraction Costs

The costs for parsing the 40.1 Terabytes of compressed input data of the August 2012 Common Crawl corpus, extracting the RDF data and storing the extracted data on S3 totaled 398 USD in Amazon EC2 fees. We used 100 spot instances of type c1.xlarge for the extraction which altogether required 5,636 machine hours.

3.3b Extraction Results from the February 2012 Common Crawl Corpus

Common Crawl did publish a pre-release version of its 2012 corpus in February. The pages contained in the pre-release are a subset of the pages contained in the August 2012 Common Crawl corpus. We also extracted the structured data from this pre-release. The resulting statistics are found here, but are superseded by the August 2012 statistics.

3.4. Extraction Results from the 2009/2010 Common Crawl Corpus

The 2009/2010 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/crawl-002/ .

Format Breakdown

Extraction Costs

The costs for parsing the 28.9 Terabytes of compressed input data of the 2009/2010 Common Crawl corpus, extracting the RDF data and storing the extracted data on S3 totaled 576 EUR (excluding VAT) in Amazon EC2 fees. We used 100 spot instances of type c1.xlarge for the extraction which altogether required 3,537 machine hours.

3.5. Trends

In the following, we analyze trends in the deployment of the most widely spread formats as well as in the deployment of selected RDFa and Microdata classes based on the 2012, 2013, 2014, 2015, 2016 and 2017 data sets.
It is important to mention that the corresponding CommonCrawl web corpora have different sizes (2 billion to 3 billion HTML pages), cover different amounts of websites (12 million PLDs to 40 million PLDs, selection by importance of PLD) and also only partly overlap in the covered HTML pages. Thus, the following trends must be interpreted with caution.

Adoption by Format

The diagram below shows the total number of pay-level domains (PLD) making use of one of the four most widely spread markup formats (RDFa, Microdata, Microformat hCard and Embedded JSON-LD) within the five crawls. Although it seems the total number of domains using Microformats hCard has decreased till 2015 one has to keep in mind, that the first crawl contains 50% more HTML pages than the latter two while the last crawl is almost double the size. For Microdata and especially schema.org, we find an increase in deployment since 2012. The second diagram shows how the amount of triples that we extracted from the crawls for has developed between 2012 and 2017.

Adoption of Selected Schema.org Classes

Below, we analyze the development of the adoption of schema.org classes embedded using the Microdata syntax. The two diagrams below show again the deployment of those classes by number of deploying PLDs and number of entities extracted from the crawls. We can see a continuous increase in the number of PLDs adopting the schema.org classes. This is also reflected in the number of entities within the datasets, where the slight decrease in 2015 of the two classes PostalAddress and LocalBusiness might originate from the characteristics of the third crawl, which contains a similar number of pages as the second crawl, but covers a larger number of PLDs. This has likely resulted in a more shallow coverage of websites that contain schema.org data and thus in a smaller number of extracted entities.

Adoption of Selected RDFa Classes

n the following we report trends in the adoption of selected RDFa classes. The first diagram shows the number of PLDs using each class. The second diagram shows the total number of entities of each class contained in the WDC RDFa data sets. We see that the number of websites deploying the Facebook Open Graph Protocol classes og:article and og:product as well as foaf:Document stays approximately constant. The deployment of og:website and gd:breadcrumb is increasing.

4. Example Data

For each data format, we provide a small subset of the extracted data below for testing purposes. The data is encoded as N-Quads, with the forth element used to represent the provenance of each triple (the URL of the page the triple was extracted from). Be advised to use a parser which is able to skip invalid lines, since they could present in the data files.

5. Note about the N-Quads Download Files

It is important to note that the N-Quads download files do not conform completely with the N-Quads specification concerning blank node identifiers. The specification
requires labels of distinct blank nodes to be unique with respect to the complete N-Quads document. In our N-Quads files, the blank node labels are unique only with
respect to the HTML page from which the data was extracted. This means that different blank nodes in a download file may have the same label. For distinguishing
between these nodes, the blank node label needs to be considered together with the URL of the page from which the data was extracted (the fourth element of the quad).
This issue is due to 100 machines working in parallel on the extraction of the data from the web corpus without communicating with each other. We may fix this issue
in upcoming WDC releases by renaming the blank nodes.

6. Conversion to Other Formats

We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code
for converting the download files into the CSV
and JSON formats, which are supported by a wider range of spreadsheet applications, relational databases are data mining tools.
The conversion tool takes the following parameters:

Parameter Name

Description

out

Folder where the outputfile(s) are written to

in

Folder containing WDC download files in N-Quads format

threads

Number of threads used for the conversion

convert

Indicates the output format. Supported formats: JSON, CSV

density

Indicates the minimum density of properties in order be included in the output file. Range: 0 - 1. A density of 0.2 indicates that the properties that have more than 20% non-null values will be included in the output .

multiplePropValues

Indicates if the converted result should contain all property values for a certain property of a subject or if one value per property is enough. Range: [true/false]

Below you can find an example command which transforms the files found in the input directory to JSON files using 5 threads and density as well as property value filtering.

File Structure

CSV file format

Each file starts with three fixed headers [graph, subject, type] followed by the property set of headers. Every line after the header represents one entity.
You can find a sample CSV file with the structure of the conversion output here.

JSON file format

Each file contains a list of JSON objects with three fixed properties [graph, subject, type] followed by the property set describing the concrete entity. Every JSON object in the file represents one entity.
You can find a sample JSON file with the structure of the conversion output here.

Conversion Process

In the following, we document the conversion process that is performed by the tool: The first step of the conversion process is to sort the input .nq file by subject and URL. For this purpose a temporary file containing the sorted entities is created and deleted by the end of the conversion. The size of the
temporary file is equal to the size of the input .nq file.
Next, the retrieved entities are written in the converted file. In the case of the CSV file format, all the distinct predicates are stored during parsing for the header row to be filled. In the case of the JSON file format
the entities are transformed from Java objects to JSON ones with the help of the Gson library. The provided tool supports parallel execution on the directory level, meaning that multiple
files can be converted simultaneously. In addition, the conversion tool provides density and property value filtering. In the case of density filtering the user can set a density threshold in order to filter uncommon properties. Please note that
in the average case the maximum property density is calculated to be 35%, so a relatively high threshold could lead to empty property value results. In the case of property value filtering, the user can choose if the converted file
should keep track of all the multiple values of a certain property belonging to a certain subject or if one value is enough for his/ her purposes.

7. Extraction Process

Since the Common Crawl data sets are stored in the AWS Simple Storage Service (S3), it made sense to perform the extraction in the Amazon cloud (EC2). The main criteria here is the cost to achieve a certain task. Instead of using the ubiquitous Hadoop framework, we found using the Simple Queue Service (SQS) for our extraction process increased efficiency. SQS provides a message queue implementation, which we use to co-ordinate the extraction nodes. The Common Crawl dataset is readily partitioned into compressed files of around 100MB each. We add the identifiers of each of these files as messages to the queue. A number of EC2 nodes monitor this queue, and take file identifiers from it. The corresponding file is then downloaded from S3. Using the ARC file parser from the Common Crawl codebase, the file is split into individual web pages. On each page, we run our RDF extractor based on the Anything To Triples (Any23) library. The resulting RDF triples are then written back to S3 together with the extraction statistics, which are later collected. The advantage of this queue is that messages have to be explicitly marked as processed, which is done after the entire file has been extracted. Should any error occur, the message is requeued after some time and processed again.

Any23 parses web pages for structured data by building a DOM tree and then evaluates XPath expressions to find structured data. While profiling, we found this tree generation to account for much of the parsing cost, and we have thus searched for a way to reduce the number of times this tree is built. Our solution is to run (Java) regular expressions against each webpages prior to extraction, which detect the presence of a microformat in a HTML page, and then only run the Any23 extractor when the regular expression find potentional matches. The formats html-mf-hcard, html-mf-hcalendar, html-mf-hlisting, html-mf-hresume, html-mf-hreview and html-mf-recipe define unique enough class names, so that the presence of the class name in the HTML document is ample indication of the Microformat being present. For the remaining formats, the following table shows the used regular expressions.

8. Source Code

The source code can be checked out from our Subversion repository. Afterwards, create your own configuration by copying src/main/resources/ccrdf.properties.dist to src/main/resources/ccrdf.properties, then fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing mvn install should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. More information about running the extractor is provided in the file readme.txt .