Blog

April 2014 Crawl Data Available

Posted by on Jul 16, 2014 in Blog | Comments Off

The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-15/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Navigating the WARC file format

Posted by on Apr 2, 2014 in Blog, Code, Guest Post | Comments Off

Wait, what’s WAT, WET and WARC?

Recently CommonCrawl has switched to the Web ARChive (WARC) format. The WARC format allows for more efficient storage and processing of CommonCrawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.

This document aims to give you an introduction to working with the new format, specifically the difference between:

  • WARC files which store the raw crawl data
  • WAT files which store computed metadata for the data stored in the WARC
  • WET files which store extracted plaintext from the data stored in the WARC

If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.

If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

WARC Format

The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.

In the example below, we can see the crawler contacted http://102jamzorlando.cbslocal.com/tag/nba/page/2/ and received a HTML page in response. We can also see the page was served from the nginx web server and that a special header has been added, X-hacker, purely for the purposes of advertising to a very specific audience of programmers who might look at the HTTP headers!

WARC/1.0
WARC-Type: response
WARC-Date: 2013-12-04T16:47:32Z
WARC-Record-ID: 
Content-Length: 73873
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: 
WARC-Concurrent-To: 
WARC-IP-Address: 23.0.160.82
WARC-Target-URI: http://102jamzorlando.cbslocal.com/tag/nba/page/2/
WARC-Payload-Digest: sha1:FXV2BZKHT6SQ4RZWNMIMP7KMFUNZMZFB
WARC-Block-Digest: sha1:GMYFZYSACNBEGHVP3YFQNOSTV5LPXNAU

HTTP/1.0 200 OK
Server: nginx
Content-Type: text/html; charset=UTF-8
Vary: Accept-Encoding
Vary: Cookie
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
Content-Encoding: gzip
Date: Wed, 04 Dec 2013 16:47:32 GMT
Content-Length: 18953
Connection: close


...HTML Content...

WAT Response Format

WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.

This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, use one of the many JSON pretty print tools available.

The HTTP response metadata is most likely to be of interest to CommonCrawl users. The skeleton of the JSON format is outlined below.

  • Envelope
    • WARC-Header-Metadata
    • Payload-Metadata
      • HTTP-Response-Metadata
        • Headers
          • HTML-Metadata
            • Head
              • Title
              • Scripts
              • Metas
              • Links
            • Links
    • Container

WET Response Format

As many tasks only require textual information, the CommonCrawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://advocatehealth.com/condell/emergencyservices3
WARC-Date: 2013-12-04T15:30:35Z
WARC-Record-ID: 
WARC-Refers-To: 
WARC-Block-Digest: sha1:3SJBHMFPOCUJEHJ7OMGVCRSHQTWLJUUS
Content-Type: text/plain
Content-Length: 5765


...Text Content...

Processing the file format

We’ve provided three introductory examples in Java for the Hadoop framework. The code also contains wrapper tools for making working with the Web Archive Commons library easier in Hadoop.

These introductory examples include:

  • Count the number of times varioustags are used across HTML on the internet using the WARC files
  • Counting the number of different server types found in the HTTP headers using the WAT files
  • Word count over the extracted plaintext found in the WET files

If you’re using a different language, there are a number of open source libraries that handle processing these WARC files and the content they contain. These include:

If in doubt, the tools provided as part of the IIPC’s Web Archive Commons library are the preferred implementation.

Stephen Merity

This is a guest blog post by Stephen Merity

Stephen Merity is a Computational Science and Engineering master’s candidate at Harvard University. His graduate work centers around machine learning and data analysis on large data sets. Prior to Harvard, Stephen worked as a software engineer for Freelancer.com and as a software engineer for online education start-up Grok Learning. Stephen has a Bachelor of Information Technology (Honours First Class with University Medal) from the University of Sydney in Australia.

March 2014 Crawl Data Now Available

Posted by on Mar 26, 2014 in Blog | Comments Off

The March crawl of 2014 is now available! The new dataset contains approximately 2.8 billion webpages and is about 223TB in size. The new data is located in the aws-publicdatasets at /common-crawl/crawl-data/CC-MAIN-2014-10/

We went a little deeper on this crawl than during our 2013 crawls so you’ll see more pages per domain.We’re working hard to get a few machines always crawling domains with large numbers of pages to go even deeper while still maintaining our politeness policy.

Thanks again to Blekko for their ongoing donation of URLs for our crawl.

Common Crawl’s Move to Nutch

Posted by on Feb 20, 2014 in Blog | Comments Off

Last year we transitioned from our custom crawler to the Apache Nutch crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.

Our old crawler was highly tuned to our data center environment where every machine was identical with large amounts of memory, hard drives and fast networking.

We needed something that would allow us to do web-scale crawls of billions of webpages and would work in a cloud environment where we might run on a heterogenous machines with differing amounts of memory, CPU and disk space depending on the price plus VMs that might go up and down and varying levels of networking performance.

About Nutch

Apache Nutch has an interesting past. In 2002 Mike Cafarella and Doug Cutting started the Nutch project in order to build a web crawler for the Lucene search engine. When looking for ways to scale Nutch to allow it to crawl the whole web, Google released a paper on GFS. Less than a year later, the Nutch Distributed File System was born and in 2005, Nutch had a working implementation of MapReduce. This implementation would later become the foundation for Hadoop.

Benefits of Nutch

Nutch runs completely as a small number of Hadoop MapReduce jobs that delegate most of the core work of fetching pages, filtering  and normalizing URLs and parsing responses to plug-ins.

The plug-in architecture of Nutch allowed us to isolate most of the customizations we needed for our own particular processes into plug-ins without making changes to the Nutch code itself. This makes life a lot easier when it comes to merging in changes from the larger Nutch community which in turn simplifies maintenance.

The performance of Nutch is comparable to our old crawler. For our Spring 2013 crawl for instance, we’d regularly crawl at aggregate speeds of 40,000 pages per second. Our performance is limited largely by the politeness policy we set to minimize our impact on web servers and the number of simultaneous machines we run on.

Drawbacks

There are some drawbacks to Nutch. The URLs that Nutch fetches is determined ahead of time. This means that while you’re fetching documents, it won’t discover new URLs and immediately fetch them within the same job. Instead after the fetch job is complete, you run a parse job, extract the URLs, add them to the crawl database and then generate a new batch of URLs to crawl.

Unfortunately when you’re dealing with billions of URLs, reading and writing this crawl database quickly becomes a large job. The Nutch 2.x branch is supposed to help with this, but it isn’t quite there yet.

Conclusion

Overall the transition to Nutch has been a fantastically positive experience for Common Crawl. We look forward to a long happy future with Nutch.

Notes

If you want to take a look at some of the changes we’ve made to Nutch, they code is available on github at https://github.com/Aloisius/nutch in the cc branch. The official Nutch project is hosted at Apache at http://nutch.apache.org/.

Lexalytics Text Analysis Work with Common Crawl Data

Posted by on Feb 4, 2014 in Blog | Comments Off

 

Oskar Singer

 

This is a guest blog post by Oskar Singer

Oskar Singer is a Software Developer and Computer Science student at University of Massachusetts Amherst.  He recently did some very interesting text analytics work during his internship at Lexalytics . The post below describes the work, how Common Crawl data was used, and includes a link to code.

 

 

At Lexalytics, I have been working with our head of software engineering, Paul Barba, on improving our accuracy with Twitter data for POS-tagging, entity extraction, parsing and ultimately sentiment analysis through building an interesting model-based approach for handling misspelled words.  

Our approach involves a spell checker that automatically corrects the input text internally for the benefit of the engine and outputs the original text for the benefit of the engine user, so this must be a different kind of automated spell-correction.

 

The First Attempt:

Our first attempt was to take the top scoring word from the list of unranked correction suggestions provided by Hunspell, an open-source spell checking library. We calculated each suggestion’s score as word frequency from Common Crawl data divided by string edit distance with consideration for keyboard distance.

The resulting corrections were scored against hand-corrected tweets by counting the number of tokens that differed. Hunspell scored worse than the original tweets. It corrected usernames and hashtags and gave totally unreasonable suggestions. My favorite Hunspell correction was the mapping from “ur” (as in the short-form for “your” or “you’re”) to “Ur” (as in the ancient Mesopotamian city-state).

Hunspell also missed mistakes like misused homophones, which did not count as a misspelling when considered in isolation. This last issue seemed to be the primary issue with our data, so the problem required a method with the ability to consider context.

 

The Second (and final) Attempt:

We title the next attempt “the Switchabalizer”, and it can be summarized as a multinomial, sliding-window, Naive-Bayes word classifier. On a high level, we classify each of the target words in a piece of text, based on the preceding and succeeding words, as itself or one of its homophones.

The training process starts with a list of bigrams from the Common Crawl data paired with their occurrence counts. We use this data to calculate P(wi-1 | wi) = #(wi-1wi)/#(wi-1) and P(wi+1 | wi) = #(wiwi+1)/#(wi+1) where wi is the current word, wi-1 is the preceding word and wi+1 is the succeeding word. These probabilities are serialized and archived so they can be deserialized into C++ data structures instead of recalculated for each instantiation of the spell check object.  In other words, we’re building a set of probabilities that each switchable “generated” the words preceding and succeeding wi.

The inference process starts with a set S of sets and an inverted index. Each s ∈ S represents a group of commonly confused homophones (e.g. two, too, 2, to), and no word is a member of multiple s ∈ S. The inverted index maps each word w in the union of all s ∈ S to the s in which w holds membership. Each word wi in the ordered sequence of words in a document is checked for an entry in the inverted index. If an entry V is found, the algorithm replaces wi with argmaxv∈V P(v) = P(wi-1 | v) + P(wi+1 | v).

 

Testing:

As a matter of efficiency, we assumed that Wikipedia articles have perfect use of the target homophones. I wrote a Python script that took in text, randomly replaced target homophones with members of their switchable set, then output the result.

We ran the Switchabalizer on this data and compared to the original Wikipedia data. Comparing the corrections to the words changed by our test generator, Hunspell, even when forced to ignore usernames, had a 216% error rate (i.e. it made false corrections), and the Switchabalizer had a 20% error rate. Although the test data does not match the target data, the massive and varied data set provided by Common Crawl should ensure good results from the Switchabalizer on many types of data, hopefully even the near-nonsense from the bowels of Twitter.

 

Conclusion:

The Switchabalizer approach is clearly superior to a traditional spell checker for our targeted issues, but still requires significant testing, tuning and improvement. The following section provides some possibilities for improvement and expansion. We hope this approach can be of use to other people with the same problem, and we would like to thank Common Crawl for the fantastic resource that they provide!

 

Future Work:

Possible future experiments include further testing on different types of data, integration of higher-order n-gram features, implementation of a discriminative model, implementation for other languages, and corrections of common misspellings like “ur”, which cannot be included in sets of switchables without risking the model mapping words to non-words.

The commented Python scripts that generate the testing data and perform feature extraction/training/feature selection can be found on my github account at https://github.com/oskarsinger/PythonScriptsFromLexalytics/tree/master/AutomatedSpellCheck/

 

 

 

 

.

Winter 2013 Crawl Data Now Available

Posted by on Jan 8, 2014 in Blog | Comments Off

The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013 (see previous blog post for more detail on that dataset). The new dataset was collected at the end of 2013, contains approximately 2.3 billion webpages and is 148TB in size.  The new data is located in the aws-publicdatasets at /common-crawl/crawl-data/CC-MAIN-2013-48/ 

 In 2013, we made changes to our crawling and post-processing systems. As detailed in the previous blog post, we switched file formats to the international standard WARC and WAT files. We also began using Apache Nutch to crawl – stay tuned for an upcoming blog post on our use of Nutch. The new crawling method relies heavily on the generous data donations from blekko and we are extremely grateful for blekko’s ongoing support!

In 2014 we plan to crawl much more frequently and publish fresh datasets at least once a month.  

New Crawl Data Available!

Posted by on Nov 27, 2013 in Blog | 12 comments

We are very please to announce that new crawl data is now available!  The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed).

We’ve made some changes to the data formats and the directory structure. Please see the details below and please share your thoughts and questions on the Common Crawl Google Group.

Format Changes

We have switched from ARC files to WARC files to better match what the industry has standardized on. WARC files allow us to include HTTP request information in the crawl data, add metadata about requests, and cross-reference the text extracts with the specific response that they were generated from. There are also many good open source tools for working with WARC files.

We have switched the metadata files from JSON to WAT files.  The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.


We have switched our text file format from Hadoop sequence files to WET files (WARC Encapsulated Text) that properly reference the original requests. This makes it far easier for your processes to disambiguate which text extracts belong to which specific page fetches.

Directory Structure

New crawl data is located in the aws-publicdatasets bucket under the base path /common-crawl/crawl-data/ path.

Under this base path, crawl data is organized hierarchically as follows:

  • CRAWL-NAME-YYYY-MM – The name of the crawl and year + week# initiated on

    • segments

      • SEGMENTNAME – A segment directory, typically a unix timestamp

        • warc – contains the WARC files with the HTTP request and responses for each fetch

          • CRAWL-NAME-YYYMMMDDSS-SEQ-MACHINE.warc.gz – individual WAT files
        • wat – contains WARC-encoded WAT files which describe the metadata of each request/response


          • CRAWL-NAME-YYYMMMDDSS-SEQ-MACHINE.warc.wat.gz – individual WAT files
        • wet – contains WARC-encoded WET files with text extractions from the HTTP responses

          • CRAWL-NAME-YYYMMMDDSS-SEQ-MACHINE.warc.wet.gz – individual WAT files

The 2013 wide web crawl data is located at /common-crawl/crawl-data/CC-MAIN-2013-20/ which represents the main CC crawl initiated during the 20th week of 2013.

Resources

More information about WARC can be found at http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf

Internet Archive publishes tools to process WARC and WAT files at https://github.com/internetarchive/ia-hadoop-tools and https://github.com/internetarchive/ia-web-commons

WET files can be treated as WARC files as they are simply conversion records as detailed in the WARC specification above.

More information about WAT files can be found at https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Metadata+File+Specification.

Python WARC tools http://code.hanzoarchives.com/warc-tools


Erlang WARC sdk http://www.webarchivingbucket.com/#wsdk


A tool for exploring WARC files https://wiki.umiacs.umd.edu/adapt/index.php/WarcManager

A handy collection of links to tools for working with WARC files http://www.netpreserve.org/web-archiving/tools-and-software

 

Hyperlink Graph from Web Data Commons

Posted by on Nov 13, 2013 in Blog | Comments Off

The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus.

Altogether, they found 128 billion hyperlinks connecting 3.5 billion pages.

They have published resulting graph today together with some results from the analysis of the graph.

http://webdatacommons.org/hyperlinkgraph/
http://webdatacommons.org/hyperlinkgraph/topology.html

To the best of our knowledge, this graph is the largest hyperlink graph that is available to the public!

Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

Posted by on Aug 14, 2013 in Blog, Media | Comments Off

Sebastian Spiegler is the head of the data team and SwiftKey and a volunteer at Common Crawl. Yesterday we posted Sebastian’s statistical analysis of the 2012 Common Crawl corpus. Today we are following it up with a great video featuring Sebastian talking about why crawl data is valuable, his research, and why open data is important. 

 The video is an excellent illustration of how startups can benefit from Common Crawl data and we hope that it inspires other startups to use our data!  

 

 

A Look Inside Our 210TB 2012 Web Corpus

Posted by on Aug 13, 2013 in Blog | 2 comments

Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler

Sebastian is a highly talented data scientist who works at the London based startup SwiftKey and volunteers at Common Crawl.  He did an exploratory analysis of the 2012 Common Crawl data and produced an excellent summary paper on exactly what kind of data it contains: Statistics of the Common Crawl Corpus 2012.

 From the conclusion section of the paper:

The 2012 Common Crawl corpus is an excellent opportunity for individuals or businesses to cost- effectively access a large portion of the internet: 210 terabytes of raw data corresponding to 3.83 billion documents or 41.4 million distinct second- level domains. Twelve of the top-level domains have a representation of above 1% whereas documents from .com account to more than 55% of the corpus. The corpus contains a large amount of sites from youtube.com, blog publishing services like blogspot.com and wordpress.com as well as online shopping sites such as amazon.com. These sites are good sources for comments and reviews. Almost half of all web documents are utf-8 encoded whereas the encoding of the 43% is unknown. The corpus contains 92% HTML documents and 2.4% PDF files. The remainder are images, XML or code like JavaScript and cascading style sheets.

View or download a pdf of Sebastian’s paper here. If you want to dive deeper you can find the non-aggregated data at s3://aws-publicdatasets/common-crawl/index2012 and the code on GitHub

Professor Jim Hendler Joins the Common Crawl Advisory Board!

Posted by on Mar 22, 2013 in Blog | Comments Off

I am extremely happy to announce that Professor Jim Hendler has joined the Common Crawl Advisory Board.  Professor Hendler is the Head of the Computer Science Department at Rensselaer Polytechnic Institute (RPI) and also serves as the Professor of Computer and Cognitive Science at RPI’s Tetherless World Constellation.

 Jim Hendler is a highly respected leader and an early innovator of the Semantic Web. In fact, he has been writing about it for over a decade – since before most of us had even heard the term. The 2001 article in Scientific American that he coauthored with Tim Berners Lee and Ora Lassila has been cited over 15,000 times and to this day is one of the very best explanations of the potential of the Semantic Web.  He is one of the editors of Synthesis Lectures on the Semantic Web where he recently published Aaron Swartz’s A Programmable Web: An Unfinished Work. Aaron Swartz’s book is available as a free download. I strongly encourage everyone to read it and to spread the word about it so it reaches as many people as possible.

 Professor Hendler is also a strong advocate for open government data and has pushed that movement forward through his work with the data.gov project and his Linking Open Government Data project.  His Twitter feed is an excellent source of information  about open government data and about all of the important and exciting work he does.

 

Having Professor Hendler’s insight and guidance will be a tremendous benefit to Common Crawl and everyone on the team is very excited that he has joined us!

URL Search Tool!

Posted by on Mar 5, 2013 in Blog | 4 comments

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index!

URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data  for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus.

 

 

URL Search

 

 

We are excited to see examples of URL Search in action. Are you working with Common Crawl data? Would you like to win $100 in AWS credit for sharing how URL Search makes your life easier? The first five people who share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!

Email a link to the GitHub repo to lisa@commoncrawl.org for consideration. The code must be accompanied by a ReadMe file that explains. If you would like to write a guest blog post about your work we would be happy to publish it on the Common Crawl blog. 

The Winners of The Norvig Web Data Science Award

Posted by on Feb 25, 2013 in Blog, Code | Comments Off

We are very excited to announce that the winners of the Norvig Web Data Science Award Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente! 

The Norvig Web Data Science Award was created by Common Crawl and SURFsara to encourage research in web data science and named in honor of distinguished computer scientist Peter Norvig

There were many excellent submissions that demonstrated how you can extract valuable insight and knowledge from web crawl data. Be sure to check out the work of the winning team, Traitor – Associating Concepts Using The World Wide Web, and the other finalists on the award website. You will find descriptions of the projects as well as links to the code that was used. We hope that these projects will serve as an inspiration for what kind of work can be done with the Common Crawl corpus. All code is open source and we are looking forward to seeing it reused and adapted for other projects. 

A huge thank you to our distinguished panel of judges:  Peter NorvigRicardo Baeza-YatesHilary MasonJimmy Lin, and Evert Lammerts

Analysis of the NCSU Library URLs in the Common Crawl Index

Posted by on Jan 16, 2013 in Blog, Code, Guest Post | Comments Off

Jason Ronallo

Jason Ronallo

Last week we announced the Common Crawl URL Index. The index has already proved useful to many people and we would like to share an interesting use of the index that was very well described in a great blog post by Jason Ronallo.

Jason is the Associate Head of Digital Library Initiatives  at North Carolina State University Libraries. He used the Common Crawl Index to look at NCSU Library URLs in the Common Crawl Index. You can see his description of his work and results below and on his blog. Be sure to follow Jason on Twitter and on his blog to keep up to date with other interesting work he does!

 

 

 

Common Crawl URL Index

The Common Crawl now has a URL index available. While the Common Crawl has been making a large corpus of crawl data available for over a year now, if you wanted to access the data you’d have to parse through it all yourself. While setting up a parallel Hadoop job running in AWS EC2 is cheaper than crawling the Web, it still is rather expensive for most. Now with the URL index it is possible to query for domains you are interested in to discover whether they are in the Common Crawl corpus. Then you can grab just those pages out of the crawl segments.

Scott Robertson, who was responsible for putting the index together, writes in the github README about the file format used for the index and the algorithm for querying it. If you’re interested you can read the details there.

If you just want to see how to get the data now, the repository provides a couple python scripts for querying the index. I used the remote_read script. You’ll need to clone the git repository to get the script along with the library files:

1
git clone https://github.com/trivio/common_crawl_index.git

Then enter the cloned repository and make the file executable:

1
2
cd common_crawl_index
chmod u+x bin/remote_read

Since the data set is hosted for free as part of AWS open data sets, it appears that they allow anonymous access. This means that you may not have to sign up for an Amazon Web Services account. The current remote_read script does not have this anonymous access turned on, but there is an open issue and patch submitted to allow anonymous access. You may want to get that version of the remote_read script and use it until that issue is closed.

If you have an account you want to use, you’ll update these lines in remote_read with your own AWS key and secret.

1
2
3
4
5
6
  mmap = BotoMap(
    '<YOUR AWS KEY>',
    '<YOUR AWS SECRET>',
    'aws-publicdatasets',
    '/common-crawl/projects/url-index/url-index.1356128792'
  )

Finally you’ll have to install boto:

1
pip install boto

Now you can run the script:

querying the Common Crawl URL index
1
bin/remote_read edu.ncsu.lib

Note that because of how the index is constructed you’ll be querying for domains in reverse order. This allows you scope your queries to match everything from a TLD down to a specific subdomain. This will return every URL matching under http://lib.ncsu.edu as well as any subdomains like http://d.lib.ncsu.edu.

As I write this, the index is only partial, while folks provide feedback on the index, so your current results may not reflect everything that is currently in the Common Crawl corpus.

NCSU Libraries’ URLs in the Common Crawl Index

You can see the results for my query for edu.ncsu.lib. Here’s a snippet from the beginning of the set:

Query for edu.ncsu.lib in Common Crawl URL IndexFull Text
1
2
3
4
5
6
7
edu.ncsu.lib.blogs/:http {'arcFileParition': 200, 'compressedSize': 2062, 'arcSourceSegmentId': 1346876860565, 'arcFileDate': 1346911439829, 'arcFileOffset': 1518768}
edu.ncsu.lib.d/:http {'arcFileParition': 2132, 'compressedSize': 855, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908147933, 'arcFileOffset': 2759941}
edu.ncsu.lib.d/collections/:http {'arcFileParition': 2132, 'compressedSize': 5165, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908633502, 'arcFileOffset': 81186482}
edu.ncsu.lib.d/collections/catalog/0228376:http {'arcFileParition': 2132, 'compressedSize': 5738, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908633502, 'arcFileOffset': 60135728}
edu.ncsu.lib.d/collections/catalog/bh2127pnc001:http {'arcFileParition': 2132, 'compressedSize': 6003, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908633502, 'arcFileOffset': 27791779}
edu.ncsu.lib.d/collections/catalog/unccmc00145-002-ff0003-002-004_0002:http {'arcFileParition': 2132, 'compressedSize': 6456, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346909064600, 'arcFileOffset': 7308443}
edu.ncsu.lib.databases/:http {'arcFileParition': 76, 'compressedSize': 5688, 'arcSourceSegmentId': 1346823846039, 'arcFileDate': 1346870337194, 'arcFileOffset': 37040682}

The result is a line delimited file with information about one URL on each line. A space separates the URL from some JSON-like data. (You’ll need to convert the single quotes to double quotes for it to parse as JSON, or just eval the data with Python if you are filled with trust or like to live dangerously.) Again, the URL hostname is in reverse order followed by the path in normal order and finally the protocol. The data is a pointer to the location for the file within a segment of the common crawl dataset. This information can be used toretrieve the page from AWS S3.

What I’m interested in is what NCSU Libraries URLs are represented in the index. In total the URL index has 4033 URLs that all look to be from a crawl in early September. Here’s the breakdown for subdomains:

Query for edu.ncsu.lib in Common Crawl URL IndexFull Text
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
 2278: www.lib.ncsu.edu
  801: repository.lib.ncsu.edu
  326: geodata.lib.ncsu.edu
  289: news.lib.ncsu.edu
  149: lib.ncsu.edu
   54: wikis.lib.ncsu.edu
   27: images.lib.ncsu.edu
   16: ncslaap.lib.ncsu.edu
   15: historicalstate.lib.ncsu.edu
   11: insidewood.lib.ncsu.edu
   10: ncarchitects.lib.ncsu.edu
    5: d.lib.ncsu.edu
    2: media.lib.ncsu.edu
    2: insight.lib.ncsu.edu
    2: hegel.lib.ncsu.edu
    2: libb.ncsu.edu
    2: library.ncsu.edu
    1: reserves.lib.ncsu.edu
    1: ww.lib.ncsu.edu
    1: linkinghub.elsevier.com.www.lib.ncsu.edu
    1: wais.lib.ncsu.edu
    1: vega.lib.ncsu.edu
    1: tricks.lib.ncsu.edu
    1: smithers.lib.ncsu.edu
    1: sirius.lib.ncsu.edu
    1: sfx.lib.ncsu.edu
    1: search.lib.ncsu.edu
    1: rppnt.lib.ncsu.edu
    1: rpp.lib.ncsu.edu
    1: web.ebscohost.com.www.lib.ncsu.edu
    1: isiknowledge.com.www.lib.ncsu.edu
    1: bst.sagepub.com.www.lib.ncsu.edu
    1: ncsulib4.lib.ncsu.edu
    1: ncsulib2.lib.ncsu.edu
    1: sag.sagepub.com.www.lib.ncsu.edu
    1: nclive.lib.ncsu.edu
    1: sciencedirect.com.www.lib.ncsu.edu
    1: ncarchitect.lib.ncsu.edu
    1: metcalf.lib.ncsu.edu
    1: springerlink.com.www.lib.ncsu.edu
    1: luna.lib.ncsu.edu
    1: libntapp1.lib.ncsu.edu
    1: jacob.lib.ncsu.edu
    1: proquest.umi.com.www.lib.ncsu.edu
    1: www3.interscience.wiley.com.www.lib.ncsu.edu
    1: pubs.acs.org.www.lib.ncsu.edu
    1: link.aps.org.www.lib.ncsu.edu
    1: avmajournals.avma.org.www.lib.ncsu.edu
    1: gopher.lib.ncsu.edu
    1: vetpathology.org.www.lib.ncsu.edu
    1: ftp.lib.ncsu.edu
    1: etd.lib.ncsu.edu
    1: ematb.lib.ncsu.edu
    1: dli.lib.ncsu.edu
    1: dewey.lib.ncsu.edu
    1: databases.lib.ncsu.edu
    1: wwwnew.lib.ncsu.edu
    1: blogs.lib.ncsu.edu
    1: bliss.lib.ncsu.edu
Total URLs: 4033
Total hostnames: 59

Analyzing the Results

The results here are interesting as I’m always trying to raise the discoverability of NCSU Libraries’ digital collections. At the top of the list is the main web site for NCSU Libraries. The hostnames www.lib.ncsu.edu and lib.ncsu.edu both point to the same resources. Looking closer we find that of the 2427 URLs there, many are for digital collections related pages. 636 are under the Special Collections Research Center, and some of these are pages for some legacy collections. 407 URLs are for pages in our collection guides application, many of them for individual guides or, strangely the EAD XML for the guides. Some of those collection guides do link to online digital collections.

The institutional repository (Dspace instances) is also well represented at the top of this list. The Technical Reports Repository accounts for 159 of those URLs, and the NCSU Institutional Repository accounts for just 3. The digital collections in the repository, mainly special collections, accounts for 626 URLs. 719 of the 801 repository URLs are directly to the PDFs. Evidently the PDFs rank higher than the landing pages.

NCSU Libraries has been providing Geospatial Data Services and paying attention to SEO for those pages for a long time, so it isn’t completely surprising that this directory of files has gotten indexed: http://geodata.lib.ncsu.edu. (Note that this server may not be accessible from off-campus.) Many of the URLs under www.lib.ncsu.edu are also GIS pages, so GIS data services and collections pages are even better represented–and human-friendly–than at first appears.

Other digital collections projects like Historical StateInside WoodNorth Carolina Architects & Builders, and NCSU Libraries’ Rare and Unique Materials are represented, but nowhere near exhaustively. Historical State now canonicalizes its URLs for individual resources to point to the Rare and Unique Materials site, but Common Crawl may not be paying attention that that hint. (Hopefully, at some point I’ll be able to do a similar analysis for historicalstate.lib.nsu.edu as I’ve done in the following.)

For http://d.lib.ncsu.edu these are the URLs listed:

So it appears that the Common Crawl probably hasn’t (at least in this half of the index!), decided to crawl this site to any extent. Instead it appears it is only deciding to crawl pages that have been linked up. Once the rest of the index comes out, I’ll have to take a look, and consider how to improve that number. The key though is obviously getting more links into the site.

Further down in the list there are a bunch of funny looking URLs. I think these are all proxy URLs for user authentication to restricted resources.

  • linkinghub.elsevier.com.www.lib.ncsu.edu
  • web.ebscohost.com.www.lib.ncsu.edu
  • isiknowledge.com.www.lib.ncsu.edu

http://gopher.lib.ncsu.edu no longer seems to exist, so I don’t know where they got that page.

Double Checking in Web Data Commons

While the Common Crawl URL index is useful if you need the whole page, in many cases just extracted embedded semantic markup may be enough. The Web Data Commons is already extracting Microdata and RDFa data, and makes indexes available, though it takes a bit more effort to parse through their indexes. (I’d like a service or script to query for an N-Quad context and get back all the related triples. Anyone know if there is already such a service? Do I have to write one?) They do have a helpful page on how to download the extracted data in whole or in part.

The http://d.lib.ncsu.edu/collections/ site publishes Microdata and Schema.org. Looking in the Web Data Commons Microdata index I found the the N-Quad file with triples extracted from ncsu.edu pages. They list only the same URLs as the Common Crawl URL index reports. This leads me to believe that these may be the only URLs in the Common Crawl index right now even though that index is incomplete.

What can libraries and archives do with this?

First, how much of your content is in the Common Crawl corpus? I’d be interested in hearing what your results are like.

We need to figure out how to get more cultural heritage content crawled and indexed by the Common Crawl. Without our stuff in the Common Crawl we are missing many opportunities to broaden the reach of our content. It doesn’t appear that Common Crawl accepts sitemaps. It works off of page rank and the link graph of popular sites. While my sites for rare and unique digital collections get most of their traffic from search engines, mainly Google, an increasing amount of traffic is due to referrals. Referrals, links from other sites, seem like the key for getting our stuff into the corpus. Efforts to add links to library special or digital collections to appropriate Wikipedia articles and the like would seem to be a good starting point.

Social sites are in the corpus and may also be a good way to get inbound links to our collections. There are 134,928+ Pinterest URLs in the Common Crawl index, and folks are actively pinning content from d.lib.ncsu.edu. Will the content pinned and repinned on Pinterest begin showing up in the crawl? Where else are crawlers likely to find links from people who make use of our content?

If more cultural heritage content is a part of the index, then there are all sorts of things we can begin to do. For web archiving projects it would be possible to begin with data in the corpus, potentially saving some crawling expense. New targeted search engines (or aggregations) can be created for different slices of content. Implement Microdata (or RDFa Lite) with Schema.org vocabularies and richer metadata can be extracted from your pages by the Web Data Commons and understood by many. This data can then be used in a variety of interfaces to save the time of the user in finding the content they really want.

What are some other ways that libraries, archives, and museums might be able to use the Common Crawl?

You can see the simple Ruby scripts I used for parsing the Common Crawl URL index out and the Web Data Commons N-Quads in this gist.

Common Crawl URL Index

Posted by on Jan 8, 2013 in Blog, Code, Guest Post | 8 comments

We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool.  You can read his guest blog post below and be sure to check out the triv.io site to learn more about how they help groups solve big data problems. 

Common Crawl URL Index 
by Scott Robertson

Common Crawl is my goto data set. It’s a huge collection of pages crawled from the internet and made available completely unfettered. Their choice to largely leave the data alone and make it available “as is”, is brilliant.

It’s almost like I did the crawling myself, minus the hassle of creating a crawling infrastructure, renting space in a data center and dealing with spinning platters covered in rust that freeze up  you when you least want them to. I exaggerate. In this day and age I would spend hours, days maybe weeks agonizing over cloud infrastructure choices and worrying about my credit card bills if I wanted to create something on that scale.

If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars.  Others, like the gang at Lucky Oyster , would agree.

Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the  archive based on their URL, domain, subdomain or even TLD (top level domain).

Keeping with Common Crawl tradition we’re making the entire index available as a giant download. Fear not, there’s no need to rack up bandwidth bills  downloading the entire thing. We’ve implemented it as a prefixed b-tree so  you can access parts of it randomly  from S3  using byte range requests. At the same time, you’re free to download the entire beast and work with it directly if you desire.  

Information about the format, and samples of accessing it using python are available on github. Feel free to post questions in the issue tracker and wikis there.

The index itself is located public datasets bucket at s3://aws-publicdatasets/common-crawl/projects/url-index/url-index.1356128792.

This is the first release of the index. The main goals of the design is to allow querying of the index via byte-range queries and to make it easy to implement in any language. We hope you dear reader, will be encouraged to jump in and contribute code to access the index under your favorite language.

For now we’ve avoided clever encoding schemes and compression. We’re expecting that to change as the community has a chance to work with the data and contribute their expertise. Join the discussion we’re happy to have you.