ilyaThis is a guest post by Ilya Kreymer
Ilya is a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl. He previously worked at the Internet Archive and led the Wayback Machine development, which included building large indexes of WARC files. Ilya is currently developing a new set of archive replay and access tools and an impressive new on-demand archiving service, webrecorder.io, that allows anyone to create a high-fidelity web archive of their own. Check out his exciting projects, including our new index and query api in the post below.


We are pleased to announce a new index and query api system for Common Crawl.

The raw index data is available, per crawl, at:
s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/

There is now an index for the Jan 2015 and Feb 2015 crawls. Going forward, a new index will be available at the same time as each new crawl.

To make working the index a bit simpler, an api and service for querying the index is available at: http://index.commoncrawl.org. The index is fully functional but we are looking for feedback to improve the usefulness and usability of the index for future updates.

Index Format
The index format is relatively simple: It consists of a compressed plaintext index (with one line for each entry) compressed into gzipped chunks, and a secondary index of the compressed chunks. This index is often called the ‘ZipNum’ CDX format and it is the same format that is used by the Wayback Machine at the Internet Archive.

Index Query API
To make working with the index a bit easier, the main index site (http://index.commoncrawl.org) contains a readily accessible api for querying the index.

The api is a variation of the ‘cdx server api’ or ‘capture index server api’ that was originally built for the wayback machine.

The site is built using pywb (https://github.com/ikreymer/pywb), a new collection of web archive replay and query tools, including the index query component.

The index can be queried by making a request to a specific collection.

For example, the following query looks up “wikipedia.org” in the CC-MAIN-2015-11 (Feb 2015) crawl:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org

The above query will only retrieve captures from the exact url “wikipedia.org/”, but a frequent use case may be to retrieve all urls from a path or all subdomains.

This can be done by using a wildcard queries:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org/*
or
http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/

Pagination
For most prefix or domain prefix queries such as these, it is not feasible to retrieve all the results at once, and only the first page of results (by default, up to 15000) are returned.

The total number of pages can be retrieved with the showNumPages query:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true

This query returns:

{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

This indicates that there are 989 total pages, at 5 compressed index blocks per page!

Thus, to get all of *.wikipedia.org, one would need to perform the query for each page:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=0

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=988

This allows for the query process to be performed in parallel. For example, it should be possible to run a MapReduce job which computes the number of pages, creates a list of urls, and then runs the query in parallel.

Command-Line Client
For smaller use cases, a simple client side library is available to simplify this process, https://github.com/ikreymer/cdx-index-client This is a simple python script which uses the pagination api to perform a parallel query on a local machine.

First, a good idea is to verify the number of pages:
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org –show-num-pages
809

To perform the query, simply run and
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org -z -d ./wikipedia-index

This query will fetch all pages of the *.wikipedia.org index into a ./wikipedia-index directory and keep the data compressed (-z flag). For a full set of options, you may run
./cdx-index-client.py –help

The script will print out an update of the progress:

2015-04-07 08:35:18,686: [INFO]: Fetching 989 pages of *.wikipedia.org
2015-04-07 08:35:45,734: [INFO]: 1 page(s) of 989 finished
2015-04-07 08:35:46,577: [INFO]: 2 page(s) of 989 finished
2015-04-07 08:35:46,579: [INFO]: 3 page(s) of 989 finished

Adjusting Page Size:
It is also possible to adjust the page size to increase or decrease the number of “blocks” in the page. (Each block is a compressed chunk and can not be split further)
The pageSize query param can be used to set the page size in blocks (the default is 5 blocks per page). For example:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true
{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true&pageSize=1
{“blocks”: 4942, “pages”: 4942, “pageSize”: 1}

In general, blocks / pageSize + 1 = pages. Adjusting the page size can help adjust the parallelization and load of the query as needed.

Capture Index JSON (CDXJ) Line Format
The raw index format (stored and returned from the query api) is as follows:

org,wikipedia)/ 20150227035757 {“url”: “http://www.wikipedia.org/”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “length”: “11996”, “offset”: “853671193”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”}

This format consists of a ‘url<space>timestamp<space>’ header followed by a json dictionary. The header is used to ensure the lines are sorted by url key and timestamp.

Adding the output=json option to the query will ensure the full line is json. Example:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org&output=json&limit=1
{“urlkey”: “org,wikipedia)/”, “timestamp”: “20150227035757”, “url”: “http://www.wikipedia.org/”, “length”: “11996”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “offset”: “853671193”}

Currently, the index contains the urlkey (a canonicalized, reverse-order form of the url), the timestamp, the url, and the WARC filename, offset and length, as well as a checksum (digest) of the content. The digest can be used to identify duplicate captures, but also adds significantly to the index size and may be removed in future versions. Other fields may be added to the json dictionary as needed also.

It is possible to only select certain fields from the query with the fl field. For example, the following query will return only the url:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=http://wikipedia.org/&fl=url
http://wikipedia.org/

or via command-line tool:

./cdx-index-client -c CC-MAIN-2015-11 http://wikipedia.org –fl url

Multiple fields can be also specified, eg. fl=url,length to return only url and warc record length.

For a full reference of available query params, consult the latest CDX Server API reference

Additional Java Tools
For Java users wishing to access the raw index, the IIPC webarchive-commons has support for reading the ZipNum format. Additionally, the openwayback-cdx-server provides the Java implementation of the original cdx server api. However, some modifications would be required to that codebase to support the cdx json format and it has not been tested with this index.

Building the Index / Running CDX Index Server
All the tools for building the index are also available at: https://github.com/ikreymer/webarchive-indexing

The index was built using EMR and the mrjob python library, and the indexing tools from pywb project. This should enable others to build the index in the future, or create customized versions of the index as needed. Please refer to the project for additional reference and do not hesitate to contact with any specific questions.

The service running at http://index.commoncrawl.org is also available at:

https://github.com/ikreymer/cc-index-server

To run locally, the secondary index (for binary search) for each collection will need to be fetched locally, while most of the index will be read from S3.

Request for Feedback and Future Plans
Although the index format is pretty well-tested, there is lots of room for customization, especially of the index query api, as well as what fields to include in the index. Feedback in the form of bug reports/feature requests/questions/suggestions about any aspect of the index is definitely welcome to make the index even more easy to use. Please do not hesitate to get back with any feedback about the index.

After some additional testing of the newly released indexes, we plan to build an index for older crawls as well. A cumulative index of all data ever crawled by CommonCrawl is also under consideration if there is enough interest. We look forward to hearing about any use cases or other feedback that you may have about the indexing project.

Please help us continue our support of great efforts like this by making a donation to the Common Crawl Foundation and follow us @CommonCrawl on Twitter for the latest in Big Open Data.