Index to WARC Files and URLs in Columnar Format

We’re happy to announce the release of an index to WARC files and URLs in a columnar format. The columnar format (we use Apache Parquet) allows to efficiently query or process the index and saves time and computing resources. Especially, if only few columns are accessed, recent big data tools will run impressively fast. So far, we’ve tested two of them: Apache Spark and AWS Athena. The latter makes it possible to run SQL queries on the columnar data even without launching a server. Below you’ll find examples how to query the data with Athena. Examples and instructions for SparkSQL are in preparation. But you are free to use any other tool: the columnar index is free to access or download for anybody. You’ll find all files on:
s3://commoncrawl/cc-index/table/cc-main/warc/

Running SQL Queries with Athena

AWS Athena is a serverless service to analyze data on S3 using SQL. With Presto under the hood you even get a long list of extra functions including lambda expressions. Usage of Athena is not free but it has an attractive price model, you pay only for the scanned data (currently $5.0 per TiB). The index table of a single monthly crawl has about 300 GB. That defines the upper bound, but most queries require only part of the data to be scanned.

Let’s start and register the Common Crawl index as database table in Athena:

1. open the Athena query editor. Make sure you’re in the us-east-1 region where all the Common Crawl data is located. You need an AWS account to access Athena, please follow the AWS Athena user guide how to register and set up Athena.

2. to create a database (here called “ccindex”) enter the command

CREATE DATABASE ccindex


and press “Run query”

3. make sure that the database “ccindex” is selected and proceed with “New Query”

4. create the table by executing the following SQL statement:

CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
  url_surtkey                   STRING,
  url                           STRING,
  url_host_name                 STRING,
  url_host_tld                  STRING,
  url_host_2nd_last_part        STRING,
  url_host_3rd_last_part        STRING,
  url_host_4th_last_part        STRING,
  url_host_5th_last_part        STRING,
  url_host_registry_suffix      STRING,
  url_host_registered_domain    STRING,
  url_host_private_suffix       STRING,
  url_host_private_domain       STRING,
  url_protocol                  STRING,
  url_port                      INT,
  url_path                      STRING,
  url_query                     STRING,
  fetch_time                    TIMESTAMP,
  fetch_status                  SMALLINT,
  content_digest                STRING,
  content_mime_type             STRING,
  content_mime_detected         STRING,
  content_charset               STRING,
  content_languages             STRING,
  warc_filename                 STRING,
  warc_record_offset            INT,
  warc_record_length            INT,
  warc_segment                  STRING)
PARTITIONED BY (
  crawl                         STRING,
  subset                        STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';


It will create a table “ccindex” with a schema that fits the data on S3. The two “PARTITIONED BY” columns are actually subdirectories, one for every monthly crawl and the WARC subset. Partitions allow us to update the table every month and also help to limit the costs to query the data. Please note that the table schema may evolve over time, the most recent schema version is available on github.

5. to make Athena recognize the data partitions on S3, you have to execute the SQL statement:

MSCK REPAIR TABLE ccindex


Note that this command is also necessary to make newer crawls appear in the table. Every month we’ll add a new partition (a “directory”, e.g., crawl=CC-MAIN-2018-09/). The new partition is not visible and searchable unless it has been discovered by the repair table command. If you run the command you’ll see which partitions have been newly discovered, e.g.:

Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=crawldiagnostics
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=robotstxt
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=warc

Now you’re ready to run the first query. We’ll count the number of pages per domain within a single top-level domain. As before press “Run query” after you’ve entered the query into the query editor frame:

SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC


The result appears seconds later and only 2.12 MB of data have been scanned! Pretty fine, the query has cost less than one cent. We’ve filtered the data by a partition (a monthly crawl) and selected a small (.no) top-level domain. It’s a good practice to start developing more complex queries with such filters applied to keep the costs for trials low.

But let’s continue with a second example which demonstrates the power of Presto functions – we try to find domains which provide multi-lingual content. On possible way is to look for ISO-639-1 language codes in the URL, e.g., en in https://example.com/about/en/page.html. You can find the full SQL expression on github. For demonstration purposes we restrict the search to a single and small TLD (.va for Vatican State). The magic is done by

UNNEST(regexp_extract_all(url_path, '(?<=/)(?:[a-z][a-z])(?=/)')) AS t (url_path_lang)


which first extracts all two-letter path elements (e.g., /en/) and unrolls the elements into a new column “url_path_lang” (if two or more path elements are found, you get multiple rows). Now we count pages and unique languages and let Presto/Athena also create a histogram of language codes:

You can find more SQL examples and resources on the cc-index-table project page on github. We’ll also working to provide examples to process the table using SparkSQL. First experiments are also promising: you get results within minutes even on a small Spark cluster. That’s not seconds as for Athena but you’re more flexible, esp. regarding the output format – Athena supports only CSV. Please also check the Athena release notes and the current list of limitations to find out which Presto version is used and which functions are supported.

We hope the new data format will help you to get value from the Common Crawl archives, in addition to the existing services.

Announcing the Common Crawl Index!

ilyaThis is a guest post by Ilya Kreymer
Ilya is a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl. He previously worked at the Internet Archive and led the Wayback Machine development, which included building large indexes of WARC files. Ilya is currently developing a new set of archive replay and access tools and an impressive new on-demand archiving service, webrecorder.io, that allows anyone to create a high-fidelity web archive of their own. Check out his exciting projects, including our new index and query api in the post below.


We are pleased to announce a new index and query api system for Common Crawl.

The raw index data is available, per crawl, at:
s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/

There is now an index for the Jan 2015 and Feb 2015 crawls. Going forward, a new index will be available at the same time as each new crawl.

To make working the index a bit simpler, an api and service for querying the index is available at: http://index.commoncrawl.org. The index is fully functional but we are looking for feedback to improve the usefulness and usability of the index for future updates.

Index Format
The index format is relatively simple: It consists of a compressed plaintext index (with one line for each entry) compressed into gzipped chunks, and a secondary index of the compressed chunks. This index is often called the ‘ZipNum’ CDX format and it is the same format that is used by the Wayback Machine at the Internet Archive.

Index Query API
To make working with the index a bit easier, the main index site (http://index.commoncrawl.org) contains a readily accessible api for querying the index.

The api is a variation of the ‘cdx server api’ or ‘capture index server api’ that was originally built for the wayback machine.

The site is built using pywb (https://github.com/ikreymer/pywb), a new collection of web archive replay and query tools, including the index query component.

The index can be queried by making a request to a specific collection.

For example, the following query looks up “wikipedia.org” in the CC-MAIN-2015-11 (Feb 2015) crawl:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org

The above query will only retrieve captures from the exact url “wikipedia.org/”, but a frequent use case may be to retrieve all urls from a path or all subdomains.

This can be done by using a wildcard queries:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org/*
or
https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/

Pagination
For most prefix or domain prefix queries such as these, it is not feasible to retrieve all the results at once, and only the first page of results (by default, up to 15000) are returned.

The total number of pages can be retrieved with the showNumPages query:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true

This query returns:

{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

This indicates that there are 989 total pages, at 5 compressed index blocks per page!

Thus, to get all of *.wikipedia.org, one would need to perform the query for each page:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=0

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=988

This allows for the query process to be performed in parallel. For example, it should be possible to run a MapReduce job which computes the number of pages, creates a list of urls, and then runs the query in parallel.

Command-Line Client
For smaller use cases, a simple client side library is available to simplify this process, https://github.com/ikreymer/cdx-index-client This is a simple python script which uses the pagination api to perform a parallel query on a local machine.

First, a good idea is to verify the number of pages:
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org –show-num-pages
809

To perform the query, simply run and
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org -z -d ./wikipedia-index

This query will fetch all pages of the *.wikipedia.org index into a ./wikipedia-index directory and keep the data compressed (-z flag). For a full set of options, you may run
./cdx-index-client.py –help

The script will print out an update of the progress:

2015-04-07 08:35:18,686: [INFO]: Fetching 989 pages of *.wikipedia.org
2015-04-07 08:35:45,734: [INFO]: 1 page(s) of 989 finished
2015-04-07 08:35:46,577: [INFO]: 2 page(s) of 989 finished
2015-04-07 08:35:46,579: [INFO]: 3 page(s) of 989 finished

Adjusting Page Size:
It is also possible to adjust the page size to increase or decrease the number of “blocks” in the page. (Each block is a compressed chunk and can not be split further)
The pageSize query param can be used to set the page size in blocks (the default is 5 blocks per page). For example:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true
{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true&pageSize=1
{“blocks”: 4942, “pages”: 4942, “pageSize”: 1}

In general, blocks / pageSize + 1 = pages. Adjusting the page size can help adjust the parallelization and load of the query as needed.

Capture Index JSON (CDXJ) Line Format
The raw index format (stored and returned from the query api) is as follows:

org,wikipedia)/ 20150227035757 {“url”: “http://www.wikipedia.org/”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “length”: “11996”, “offset”: “853671193”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”}

This format consists of a ‘url<space>timestamp<space>’ header followed by a json dictionary. The header is used to ensure the lines are sorted by url key and timestamp.

Adding the output=json option to the query will ensure the full line is json. Example:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org&output=json&limit=1
{“urlkey”: “org,wikipedia)/”, “timestamp”: “20150227035757”, “url”: “http://www.wikipedia.org/”, “length”: “11996”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “offset”: “853671193”}

Currently, the index contains the urlkey (a canonicalized, reverse-order form of the url), the timestamp, the url, and the WARC filename, offset and length, as well as a checksum (digest) of the content. The digest can be used to identify duplicate captures, but also adds significantly to the index size and may be removed in future versions. Other fields may be added to the json dictionary as needed also.

It is possible to only select certain fields from the query with the fl field. For example, the following query will return only the url:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=http://wikipedia.org/&fl=url
http://wikipedia.org/

or via command-line tool:

./cdx-index-client -c CC-MAIN-2015-11 http://wikipedia.org –fl url

Multiple fields can be also specified, eg. fl=url,length to return only url and warc record length.

For a full reference of available query params, consult the latest CDX Server API reference

Additional Java Tools
For Java users wishing to access the raw index, the IIPC webarchive-commons has support for reading the ZipNum format. Additionally, the openwayback-cdx-server provides the Java implementation of the original cdx server api. However, some modifications would be required to that codebase to support the cdx json format and it has not been tested with this index.

Building the Index / Running CDX Index Server
All the tools for building the index are also available at: https://github.com/ikreymer/webarchive-indexing

The index was built using EMR and the mrjob python library, and the indexing tools from pywb project. This should enable others to build the index in the future, or create customized versions of the index as needed. Please refer to the project for additional reference and do not hesitate to contact with any specific questions.

The service running at http://index.commoncrawl.org is also available at:

https://github.com/ikreymer/cc-index-server

To run locally, the secondary index (for binary search) for each collection will need to be fetched locally, while most of the index will be read from S3.

Request for Feedback and Future Plans
Although the index format is pretty well-tested, there is lots of room for customization, especially of the index query api, as well as what fields to include in the index. Feedback in the form of bug reports/feature requests/questions/suggestions about any aspect of the index is definitely welcome to make the index even more easy to use. Please do not hesitate to get back with any feedback about the index.

After some additional testing of the newly released indexes, we plan to build an index for older crawls as well. A cumulative index of all data ever crawled by CommonCrawl is also under consideration if there is enough interest. We look forward to hearing about any use cases or other feedback that you may have about the indexing project.

Please help us continue our support of great efforts like this by making a donation to the Common Crawl Foundation and follow us @CommonCrawl on Twitter for the latest in Big Open Data.

Analysis of the NCSU Library URLs in the Common Crawl Index

Jason RonalloLast week we announced the Common Crawl URL Index. The index has already proved useful to many people and we would like to share an interesting use of the index that was very well described in a great blog post by Jason Ronallo.

Jason is the Associate Head of Digital Library Initiatives at North Carolina State University Libraries. He used the Common Crawl Index to look at NCSU Library URLs in the Common Crawl Index. You can see his description of his work and results below and on his blog. Be sure to follow Jason on Twitter and on his blog to keep up to date with other interesting work he does!

 

Common Crawl URL Index

The Common Crawl now has a URL index available. While the Common Crawl has been making a large corpus of crawl data available for over a year now, if you wanted to access the data you’d have to parse through it all yourself. While setting up a parallel Hadoop job running in AWS EC2 is cheaper than crawling the Web, it still is rather expensive for most. Now with the URL index it is possible to query for domains you are interested in to discover whether they are in the Common Crawl corpus. Then you can grab just those pages out of the crawl segments.

Scott Robertson, who was responsible for putting the index together, writes in the github README about the file format used for the index and the algorithm for querying it. If you’re interested you can read the details there.

If you just want to see how to get the data now, the repository provides a couple python scripts for querying the index. I used the remote_read script. You’ll need to clone the git repository to get the script along with the library files:

1
git clone https://github.com/trivio/common_crawl_index.git

Then enter the cloned repository and make the file executable:

1
2
cd common_crawl_index
chmod u+x bin/remote_read

Since the data set is hosted for free as part of AWS open data sets, it appears that they allow anonymous access. This means that you may not have to sign up for an Amazon Web Services account. The current remote_read script does not have this anonymous access turned on, but there is an open issue and patch submitted to allow anonymous access. You may want to get that version of the remote_read script and use it until that issue is closed.

If you have an account you want to use, you’ll update these lines in remote_read with your own AWS key and secret.

1
2
3
4
5
6
  mmap = BotoMap(
    '<YOUR AWS KEY>',
    '<YOUR AWS SECRET>',
    'commoncrawl',
    '/projects/url-index/url-index.1356128792'
  )

Finally you’ll have to install boto:

1
pip install boto

Now you can run the script:

querying the Common Crawl URL index
1
bin/remote_read edu.ncsu.lib

Note that because of how the index is constructed you’ll be querying for domains in reverse order. This allows you scope your queries to match everything from a TLD down to a specific subdomain. This will return every URL matching under http://lib.ncsu.edu as well as any subdomains like http://d.lib.ncsu.edu.

As I write this, the index is only partial, while folks provide feedback on the index, so your current results may not reflect everything that is currently in the Common Crawl corpus.

NCSU Libraries’ URLs in the Common Crawl Index

You can see the results for my query for edu.ncsu.lib. Here’s a snippet from the beginning of the set:

Query for edu.ncsu.lib in Common Crawl URL IndexFull Text
1
2
3
4
5
6
7
edu.ncsu.lib.blogs/:http {'arcFileParition': 200, 'compressedSize': 2062, 'arcSourceSegmentId': 1346876860565, 'arcFileDate': 1346911439829, 'arcFileOffset': 1518768}
edu.ncsu.lib.d/:http {'arcFileParition': 2132, 'compressedSize': 855, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908147933, 'arcFileOffset': 2759941}
edu.ncsu.lib.d/collections/:http {'arcFileParition': 2132, 'compressedSize': 5165, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908633502, 'arcFileOffset': 81186482}
edu.ncsu.lib.d/collections/catalog/0228376:http {'arcFileParition': 2132, 'compressedSize': 5738, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908633502, 'arcFileOffset': 60135728}
edu.ncsu.lib.d/collections/catalog/bh2127pnc001:http {'arcFileParition': 2132, 'compressedSize': 6003, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908633502, 'arcFileOffset': 27791779}
edu.ncsu.lib.d/collections/catalog/unccmc00145-002-ff0003-002-004_0002:http {'arcFileParition': 2132, 'compressedSize': 6456, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346909064600, 'arcFileOffset': 7308443}
edu.ncsu.lib.databases/:http {'arcFileParition': 76, 'compressedSize': 5688, 'arcSourceSegmentId': 1346823846039, 'arcFileDate': 1346870337194, 'arcFileOffset': 37040682}

The result is a line delimited file with information about one URL on each line. A space separates the URL from some JSON-like data. (You’ll need to convert the single quotes to double quotes for it to parse as JSON, or just eval the data with Python if you are filled with trust or like to live dangerously.) Again, the URL hostname is in reverse order followed by the path in normal order and finally the protocol. The data is a pointer to the location for the file within a segment of the common crawl dataset. This information can be used toretrieve the page from AWS S3.

What I’m interested in is what NCSU Libraries URLs are represented in the index. In total the URL index has 4033 URLs that all look to be from a crawl in early September. Here’s the breakdown for subdomains:

Query for edu.ncsu.lib in Common Crawl URL IndexFull Text
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
 2278: www.lib.ncsu.edu
  801: repository.lib.ncsu.edu
  326: geodata.lib.ncsu.edu
  289: news.lib.ncsu.edu
  149: lib.ncsu.edu
   54: wikis.lib.ncsu.edu
   27: images.lib.ncsu.edu
   16: ncslaap.lib.ncsu.edu
   15: historicalstate.lib.ncsu.edu
   11: insidewood.lib.ncsu.edu
   10: ncarchitects.lib.ncsu.edu
    5: d.lib.ncsu.edu
    2: media.lib.ncsu.edu
    2: insight.lib.ncsu.edu
    2: hegel.lib.ncsu.edu
    2: libb.ncsu.edu
    2: library.ncsu.edu
    1: reserves.lib.ncsu.edu
    1: ww.lib.ncsu.edu
    1: linkinghub.elsevier.com.www.lib.ncsu.edu
    1: wais.lib.ncsu.edu
    1: vega.lib.ncsu.edu
    1: tricks.lib.ncsu.edu
    1: smithers.lib.ncsu.edu
    1: sirius.lib.ncsu.edu
    1: sfx.lib.ncsu.edu
    1: search.lib.ncsu.edu
    1: rppnt.lib.ncsu.edu
    1: rpp.lib.ncsu.edu
    1: web.ebscohost.com.www.lib.ncsu.edu
    1: isiknowledge.com.www.lib.ncsu.edu
    1: bst.sagepub.com.www.lib.ncsu.edu
    1: ncsulib4.lib.ncsu.edu
    1: ncsulib2.lib.ncsu.edu
    1: sag.sagepub.com.www.lib.ncsu.edu
    1: nclive.lib.ncsu.edu
    1: sciencedirect.com.www.lib.ncsu.edu
    1: ncarchitect.lib.ncsu.edu
    1: metcalf.lib.ncsu.edu
    1: springerlink.com.www.lib.ncsu.edu
    1: luna.lib.ncsu.edu
    1: libntapp1.lib.ncsu.edu
    1: jacob.lib.ncsu.edu
    1: proquest.umi.com.www.lib.ncsu.edu
    1: www3.interscience.wiley.com.www.lib.ncsu.edu
    1: pubs.acs.org.www.lib.ncsu.edu
    1: link.aps.org.www.lib.ncsu.edu
    1: avmajournals.avma.org.www.lib.ncsu.edu
    1: gopher.lib.ncsu.edu
    1: vetpathology.org.www.lib.ncsu.edu
    1: ftp.lib.ncsu.edu
    1: etd.lib.ncsu.edu
    1: ematb.lib.ncsu.edu
    1: dli.lib.ncsu.edu
    1: dewey.lib.ncsu.edu
    1: databases.lib.ncsu.edu
    1: wwwnew.lib.ncsu.edu
    1: blogs.lib.ncsu.edu
    1: bliss.lib.ncsu.edu
Total URLs: 4033
Total hostnames: 59

Analyzing the Results

The results here are interesting as I’m always trying to raise the discoverability of NCSU Libraries’ digital collections. At the top of the list is the main web site for NCSU Libraries. The hostnames www.lib.ncsu.edu and lib.ncsu.edu both point to the same resources. Looking closer we find that of the 2427 URLs there, many are for digital collections related pages. 636 are under the Special Collections Research Center, and some of these are pages for some legacy collections. 407 URLs are for pages in our collection guides application, many of them for individual guides or, strangely the EAD XML for the guides. Some of those collection guides do link to online digital collections.

The institutional repository (Dspace instances) is also well represented at the top of this list. The Technical Reports Repository accounts for 159 of those URLs, and the NCSU Institutional Repository accounts for just 3. The digital collections in the repository, mainly special collections, accounts for 626 URLs. 719 of the 801 repository URLs are directly to the PDFs. Evidently the PDFs rank higher than the landing pages.

NCSU Libraries has been providing Geospatial Data Services and paying attention to SEO for those pages for a long time, so it isn’t completely surprising that this directory of files has gotten indexed: http://geodata.lib.ncsu.edu. (Note that this server may not be accessible from off-campus.) Many of the URLs under www.lib.ncsu.edu are also GIS pages, so GIS data services and collections pages are even better represented–and human-friendly–than at first appears.

Other digital collections projects like Historical State, Inside Wood, North Carolina Architects & Builders, and NCSU Libraries’ Rare and Unique Materials are represented, but nowhere near exhaustively. Historical State now canonicalizes its URLs for individual resources to point to the Rare and Unique Materials site, but Common Crawl may not be paying attention that that hint. (Hopefully, at some point I’ll be able to do a similar analysis for historicalstate.lib.nsu.edu as I’ve done in the following.)

For http://d.lib.ncsu.edu these are the URLs listed:

So it appears that the Common Crawl probably hasn’t (at least in this half of the index!), decided to crawl this site to any extent. Instead it appears it is only deciding to crawl pages that have been linked up. Once the rest of the index comes out, I’ll have to take a look, and consider how to improve that number. The key though is obviously getting more links into the site.

Further down in the list there are a bunch of funny looking URLs. I think these are all proxy URLs for user authentication to restricted resources.

  • linkinghub.elsevier.com.www.lib.ncsu.edu
  • web.ebscohost.com.www.lib.ncsu.edu
  • isiknowledge.com.www.lib.ncsu.edu

http://gopher.lib.ncsu.edu no longer seems to exist, so I don’t know where they got that page.

Double Checking in Web Data Commons

While the Common Crawl URL index is useful if you need the whole page, in many cases just extracted embedded semantic markup may be enough. The Web Data Commons is already extracting Microdata and RDFa data, and makes indexes available, though it takes a bit more effort to parse through their indexes. (I’d like a service or script to query for an N-Quad context and get back all the related triples. Anyone know if there is already such a service? Do I have to write one?) They do have a helpful page on how to download the extracted data in whole or in part.

The http://d.lib.ncsu.edu/collections/ site publishes Microdata and Schema.org. Looking in the Web Data Commons Microdata index I found the the N-Quad file with triples extracted from ncsu.edu pages. They list only the same URLs as the Common Crawl URL index reports. This leads me to believe that these may be the only URLs in the Common Crawl index right now even though that index is incomplete.

What can libraries and archives do with this?

First, how much of your content is in the Common Crawl corpus? I’d be interested in hearing what your results are like.

We need to figure out how to get more cultural heritage content crawled and indexed by the Common Crawl. Without our stuff in the Common Crawl we are missing many opportunities to broaden the reach of our content. It doesn’t appear that Common Crawl accepts sitemaps. It works off of page rank and the link graph of popular sites. While my sites for rare and unique digital collections get most of their traffic from search engines, mainly Google, an increasing amount of traffic is due to referrals. Referrals, links from other sites, seem like the key for getting our stuff into the corpus. Efforts to add links to library special or digital collections to appropriate Wikipedia articles and the like would seem to be a good starting point.

Social sites are in the corpus and may also be a good way to get inbound links to our collections. There are 134,928+ Pinterest URLs in the Common Crawl index, and folks are actively pinning content from d.lib.ncsu.edu. Will the content pinned and repinned on Pinterest begin showing up in the crawl? Where else are crawlers likely to find links from people who make use of our content?

If more cultural heritage content is a part of the index, then there are all sorts of things we can begin to do. For web archiving projects it would be possible to begin with data in the corpus, potentially saving some crawling expense. New targeted search engines (or aggregations) can be created for different slices of content. Implement Microdata (or RDFa Lite) with Schema.org vocabularies and richer metadata can be extracted from your pages by the Web Data Commons and understood by many. This data can then be used in a variety of interfaces to save the time of the user in finding the content they really want.

What are some other ways that libraries, archives, and museums might be able to use the Common Crawl?

You can see the simple Ruby scripts I used for parsing the Common Crawl URL index out and the Web Data Commons N-Quads in this gist.

Common Crawl URL Index

We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. You can read his guest blog post below and be sure to check out the triv.io site to learn more about how they help groups solve big data problems.

Common Crawl URL Index
by Scott Robertson

Common Crawl is my goto data set. It’s a huge collection of pages crawled from the internet and made available completely unfettered. Their choice to largely leave the data alone and make it available “as is”, is brilliant.

It’s almost like I did the crawling myself, minus the hassle of creating a crawling infrastructure, renting space in a data center and dealing with spinning platters covered in rust that freeze up you when you least want them to. I exaggerate. In this day and age I would spend hours, days maybe weeks agonizing over cloud infrastructure choices and worrying about my credit card bills if I wanted to create something on that scale.

If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.

Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

Keeping with Common Crawl tradition we’re making the entire index available as a giant download. Fear not, there’s no need to rack up bandwidth bills downloading the entire thing. We’ve implemented it as a prefixed b-tree so you can access parts of it randomly from S3 using byte range requests. At the same time, you’re free to download the entire beast and work with it directly if you desire.

Information about the format, and samples of accessing it using python are available on github. Feel free to post questions in the issue tracker and wikis there.

The index itself is located public datasets bucket at s3://commoncrawl/projects/url-index/url-index.1356128792.

This is the first release of the index. The main goals of the design is to allow querying of the index via byte-range queries and to make it easy to implement in any language. We hope you dear reader, will be encouraged to jump in and contribute code to access the index under your favorite language.

For now we’ve avoided clever encoding schemes and compression. We’re expecting that to change as the community has a chance to work with the data and contribute their expertise. Join the discussion we’re happy to have you.