Announcing the Common Crawl Index!

ilyaThis is a guest post by Ilya Kreymer
Ilya is a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl. He previously worked at the Internet Archive and led the Wayback Machine development, which included building large indexes of WARC files. Ilya is currently developing a new set of archive replay and access tools and an impressive new on-demand archiving service, webrecorder.io, that allows anyone to create a high-fidelity web archive of their own. Check out his exciting projects, including our new index and query api in the post below.


We are pleased to announce a new index and query api system for Common Crawl.

The raw index data is available, per crawl, at:
s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/

There is now an index for the Jan 2015 and Feb 2015 crawls. Going forward, a new index will be available at the same time as each new crawl.

To make working the index a bit simpler, an api and service for querying the index is available at: http://index.commoncrawl.org. The index is fully functional but we are looking for feedback to improve the usefulness and usability of the index for future updates.

Index Format
The index format is relatively simple: It consists of a compressed plaintext index (with one line for each entry) compressed into gzipped chunks, and a secondary index of the compressed chunks. This index is often called the ‘ZipNum’ CDX format and it is the same format that is used by the Wayback Machine at the Internet Archive.

Index Query API
To make working with the index a bit easier, the main index site (http://index.commoncrawl.org) contains a readily accessible api for querying the index.

The api is a variation of the ‘cdx server api’ or ‘capture index server api’ that was originally built for the wayback machine.

The site is built using pywb (https://github.com/ikreymer/pywb), a new collection of web archive replay and query tools, including the index query component.

The index can be queried by making a request to a specific collection.

For example, the following query looks up “wikipedia.org” in the CC-MAIN-2015-11 (Feb 2015) crawl:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org

The above query will only retrieve captures from the exact url “wikipedia.org/”, but a frequent use case may be to retrieve all urls from a path or all subdomains.

This can be done by using a wildcard queries:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org/*
or
https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/

Pagination
For most prefix or domain prefix queries such as these, it is not feasible to retrieve all the results at once, and only the first page of results (by default, up to 15000) are returned.

The total number of pages can be retrieved with the showNumPages query:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true

This query returns:

{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

This indicates that there are 989 total pages, at 5 compressed index blocks per page!

Thus, to get all of *.wikipedia.org, one would need to perform the query for each page:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=0

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=988

This allows for the query process to be performed in parallel. For example, it should be possible to run a MapReduce job which computes the number of pages, creates a list of urls, and then runs the query in parallel.

Command-Line Client
For smaller use cases, a simple client side library is available to simplify this process, https://github.com/ikreymer/cdx-index-client This is a simple python script which uses the pagination api to perform a parallel query on a local machine.

First, a good idea is to verify the number of pages:
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org –show-num-pages
809

To perform the query, simply run and
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org -z -d ./wikipedia-index

This query will fetch all pages of the *.wikipedia.org index into a ./wikipedia-index directory and keep the data compressed (-z flag). For a full set of options, you may run
./cdx-index-client.py –help

The script will print out an update of the progress:

2015-04-07 08:35:18,686: [INFO]: Fetching 989 pages of *.wikipedia.org
2015-04-07 08:35:45,734: [INFO]: 1 page(s) of 989 finished
2015-04-07 08:35:46,577: [INFO]: 2 page(s) of 989 finished
2015-04-07 08:35:46,579: [INFO]: 3 page(s) of 989 finished

Adjusting Page Size:
It is also possible to adjust the page size to increase or decrease the number of “blocks” in the page. (Each block is a compressed chunk and can not be split further)
The pageSize query param can be used to set the page size in blocks (the default is 5 blocks per page). For example:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true
{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true&pageSize=1
{“blocks”: 4942, “pages”: 4942, “pageSize”: 1}

In general, blocks / pageSize + 1 = pages. Adjusting the page size can help adjust the parallelization and load of the query as needed.

Capture Index JSON (CDXJ) Line Format
The raw index format (stored and returned from the query api) is as follows:

org,wikipedia)/ 20150227035757 {“url”: “http://www.wikipedia.org/”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “length”: “11996”, “offset”: “853671193”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”}

This format consists of a ‘url<space>timestamp<space>’ header followed by a json dictionary. The header is used to ensure the lines are sorted by url key and timestamp.

Adding the output=json option to the query will ensure the full line is json. Example:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org&output=json&limit=1
{“urlkey”: “org,wikipedia)/”, “timestamp”: “20150227035757”, “url”: “http://www.wikipedia.org/”, “length”: “11996”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “offset”: “853671193”}

Currently, the index contains the urlkey (a canonicalized, reverse-order form of the url), the timestamp, the url, and the WARC filename, offset and length, as well as a checksum (digest) of the content. The digest can be used to identify duplicate captures, but also adds significantly to the index size and may be removed in future versions. Other fields may be added to the json dictionary as needed also.

It is possible to only select certain fields from the query with the fl field. For example, the following query will return only the url:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=http://wikipedia.org/&fl=url
http://wikipedia.org/

or via command-line tool:

./cdx-index-client -c CC-MAIN-2015-11 http://wikipedia.org –fl url

Multiple fields can be also specified, eg. fl=url,length to return only url and warc record length.

For a full reference of available query params, consult the latest CDX Server API reference

Additional Java Tools
For Java users wishing to access the raw index, the IIPC webarchive-commons has support for reading the ZipNum format. Additionally, the openwayback-cdx-server provides the Java implementation of the original cdx server api. However, some modifications would be required to that codebase to support the cdx json format and it has not been tested with this index.

Building the Index / Running CDX Index Server
All the tools for building the index are also available at: https://github.com/ikreymer/webarchive-indexing

The index was built using EMR and the mrjob python library, and the indexing tools from pywb project. This should enable others to build the index in the future, or create customized versions of the index as needed. Please refer to the project for additional reference and do not hesitate to contact with any specific questions.

The service running at http://index.commoncrawl.org is also available at:

https://github.com/ikreymer/cc-index-server

To run locally, the secondary index (for binary search) for each collection will need to be fetched locally, while most of the index will be read from S3.

Request for Feedback and Future Plans
Although the index format is pretty well-tested, there is lots of room for customization, especially of the index query api, as well as what fields to include in the index. Feedback in the form of bug reports/feature requests/questions/suggestions about any aspect of the index is definitely welcome to make the index even more easy to use. Please do not hesitate to get back with any feedback about the index.

After some additional testing of the newly released indexes, we plan to build an index for older crawls as well. A cumulative index of all data ever crawled by CommonCrawl is also under consideration if there is enough interest. We look forward to hearing about any use cases or other feedback that you may have about the indexing project.

Please help us continue our support of great efforts like this by making a donation to the Common Crawl Foundation and follow us @CommonCrawl on Twitter for the latest in Big Open Data.

February 2015 Crawl Archive Available

The crawl archive for February 2015 is now available! This crawl archive is over 145TB in size and over 1.9 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-11/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

We’re also happy to introduce the new Common Crawl Index by Ilya Kreymer, creator of https://webrecorder.io/. The February 2015 and January 2015 indexes are already featured and the aim will be for indexes to be released alongside crawl archives, offering a new way to explore the dataset. Whilst full details will be released in an upcoming blog post, we’re telling you about it now as we’re interested in hearing feedback from the community!

Please donate to Common Crawl if you appreciate our free datasets! We’re seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

January 2015 Crawl Archive Available

The crawl archive for January 2015 is now available! This crawl archive is over 139TB in size and contains 1.82 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-06/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Please donate to Common Crawl if you appreciate our free datasets! We’re seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Please contact [email protected] for sponsorship information and packages.

December 2014 Crawl Archive Available

The crawl archive for December 2014 is now available! This crawl archive is over 160TB in size and contains 2.08 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-52/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

November 2014 Crawl Archive Available

The crawl archive for November 2014 is now available! This crawl archive is over 135TB in size and contains 1.95 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-49/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

October 2014 Crawl Archive Available

The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-42/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

September 2014 Crawl Archive Available

The crawl archive for September 2014 is now available! This crawl archive is over 220TB in size and contains 2.98 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-41/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

August 2014 Crawl Data Available

The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. The new data is located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-35/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Web Data Commons Extraction Framework for the Distributed Processing of CC Data

Robert MeuselThis is a guest blog post by Robert Meusel.
Robert Meusel is a researcher at the University of Mannheim in the Data and Web Science Research Group and a key member of the Web Data Commons project. The post below describes a new tool produced by Web Data Commons for extracting data from the Common Crawl data.


The Web Data Commons project extracts structured data from the Common Crawl corpora and offers the extracted data for public download. We have extracted one of the largest hyperlink graphs that is currently available to the public. We also extract and offer large corpora of Microdata, Microformats and RDFa annotations as well as relational HTML tables. If you ask us, why we do this? Because we share the opinion that data should be available to everybody and because we want to make it easier to exploit the wealth of information that is available on the Web.

For performing the extractions, we need to go through all the hundreds of tera-bytes of crawl data offered by the Common Crawl Foundation. As a project without any direct funding or salaried persons, we needed a time-, resource- and cost-efficient way to process the CommonCrawl corpora. We thus developed a data extraction tool which allows us to process the Common Crawl corpora in a distributed fashion using Amazon cloud services (AWS).

The basic architectural idea of the extraction tool is to have a queue taking care of the proper handling of all files which should be processed. Each worker receives a new file from the queue whenever it is ready and informs the queue about the status (success of failure) of the processing. Successfully processed files are removed from the queue, failures are assigned to another worker or eliminated when a fixed number of workers could not process it.

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed).  Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

With each new extraction, we improved the extraction tool and turned it more and more into a flexible framework into which we now simply plug the needed file processors (for one single file) and which takes care of everything else.

This framework was now officially released under the terms of the Apache license. The framework takes care of everything that is related to file handling, distribution, and scalability and leaves to the user only the task of writing the code needed for extracting the desired information from a single out of the all CC files.

More information about the framework, a detailed guide on how to run it, and a tutorial showing how to customize the framework for your extraction tasks is found at

http://webdatacommons.org/framework

We encourage all interested parties to make use of the framework. We will continuously improve the framework and are happy about everybody who gives us feedback about her experiences with the framework.

July 2014 Crawl Data Available

The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. The new data is located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-23/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

We’ve also released a Python library, gzipstream, that should enable easier access and processing of the Common Crawl dataset. We’d love for you to try it out!

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Note: the original estimate for this crawl was 4 billion, but after full analytics were run, this estimate was revised.