Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler!
Sebastian is a highly talented data scientist who works at the London based startup SwiftKey and volunteers at Common Crawl. He did an exploratory analysis of the 2012 Common Crawl data and produced an excellent summary paper on exactly what kind of data it contains: Statistics of the Common Crawl Corpus 2012.
From the conclusion section of the paper:
I am extremely happy to announce that Professor Jim Hendler has joined the Common Crawl Advisory Board. Professor Hendler is the Head of the Computer Science Department at Rensselaer Polytechnic Institute (RPI) and also serves as the Professor of Computer and Cognitive Science at RPI’s Tetherless World Constellation.
Jim Hendler is a highly respected leader and an early innovator of the Semantic Web. In fact, he has been writing about it for over a decade – since before most of us had even heard the term. The 2001 article in Scientific American that he coauthored with Tim Berners Lee and Ora Lassila has been cited over 15,000 times and to this day is one of the very best explanations of the potential of the Semantic Web. He is one of the editors of Synthesis Lectures on the Semantic Web where he recently published Aaron Swartz’s A Programmable Web: An Unfinished Work. Aaron Swartz’s book is available as a free download. I strongly encourage everyone to read it and to spread the word about it so it reaches as many people as possible.
Professor Hendler is also a strong advocate for open government data and has pushed that movement forward through his work with the data.gov project and his Linking Open Government Data project. His Twitter feed is an excellent source of information about open government data and about all of the important and exciting work he does.
Having Professor Hendler’s insight and guidance will be a tremendous benefit to Common Crawl and everyone on the team is very excited that he has joined us!
A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index!
URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus.
We are excited to see examples of URL Search in action. Are you working with Common Crawl data? Would you like to win $100 in AWS credit for sharing how URL Search makes your life easier? The first five people who share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!
Email a link to the GitHub repo to [email protected] for consideration. The code must be accompanied by a ReadMe file that explains. If you would like to write a guest blog post about your work we would be happy to publish it on the Common Crawl blog.
We are very excited to announce that the winners of the Norvig Web Data Science Award Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente!
There were many excellent submissions that demonstrated how you can extract valuable insight and knowledge from web crawl data. Be sure to check out the work of the winning team, Traitor – Associating Concepts Using The World Wide Web, and the other finalists on the award website. You will find descriptions of the projects as well as links to the code that was used. We hope that these projects will serve as an inspiration for what kind of work can be done with the Common Crawl corpus. All code is open source and we are looking forward to seeing it reused and adapted for other projects.
Last week we announced the Common Crawl URL Index. The index has already proved useful to many people and we would like to share an interesting use of the index that was very well described in a great blog post by Jason Ronallo.
Jason is the Associate Head of Digital Library Initiatives at North Carolina State University Libraries. He used the Common Crawl Index to look at NCSU Library URLs in the Common Crawl Index. You can see his description of his work and results below and on his blog. Be sure to follow Jason on Twitter and on his blog to keep up to date with other interesting work he does!
Common Crawl URL Index
The Common Crawl now has a URL index available. While the Common Crawl has been making a large corpus of crawl data available for over a year now, if you wanted to access the data you’d have to parse through it all yourself. While setting up a parallel Hadoop job running in AWS EC2 is cheaper than crawling the Web, it still is rather expensive for most. Now with the URL index it is possible to query for domains you are interested in to discover whether they are in the Common Crawl corpus. Then you can grab just those pages out of the crawl segments.
Scott Robertson, who was responsible for putting the index together, writes in the github README about the file format used for the index and the algorithm for querying it. If you’re interested you can read the details there.
If you just want to see how to get the data now, the repository provides a couple python scripts for querying the index. I used the
remote_read script. You’ll need to clone the git repository to get the script along with the library files:
Then enter the cloned repository and make the file executable:
Since the data set is hosted for free as part of AWS open data sets, it appears that they allow anonymous access. This means that you may not have to sign up for an Amazon Web Services account. The current
remote_read script does not have this anonymous access turned on, but there is an open issue and patch submitted to allow anonymous access. You may want to get that version of the
remote_read script and use it until that issue is closed.
If you have an account you want to use, you’ll update these lines in
remote_read with your own AWS key and secret.
1 2 3 4 5 6
Finally you’ll have to install boto:
Now you can run the script:
Note that because of how the index is constructed you’ll be querying for domains in reverse order. This allows you scope your queries to match everything from a TLD down to a specific subdomain. This will return every URL matching under http://lib.ncsu.edu as well as any subdomains like http://d.lib.ncsu.edu.
As I write this, the index is only partial, while folks provide feedback on the index, so your current results may not reflect everything that is currently in the Common Crawl corpus.
NCSU Libraries’ URLs in the Common Crawl Index
You can see the results for my query for edu.ncsu.lib. Here’s a snippet from the beginning of the set:
1 2 3 4 5 6 7
The result is a line delimited file with information about one URL on each line. A space separates the URL from some JSON-like data. (You’ll need to convert the single quotes to double quotes for it to parse as JSON, or just eval the data with Python if you are filled with trust or like to live dangerously.) Again, the URL hostname is in reverse order followed by the path in normal order and finally the protocol. The data is a pointer to the location for the file within a segment of the common crawl dataset. This information can be used toretrieve the page from AWS S3.
What I’m interested in is what NCSU Libraries URLs are represented in the index. In total the URL index has 4033 URLs that all look to be from a crawl in early September. Here’s the breakdown for subdomains:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
Analyzing the Results
The results here are interesting as I’m always trying to raise the discoverability of NCSU Libraries’ digital collections. At the top of the list is the main web site for NCSU Libraries. The hostnames www.lib.ncsu.edu and lib.ncsu.edu both point to the same resources. Looking closer we find that of the 2427 URLs there, many are for digital collections related pages. 636 are under the Special Collections Research Center, and some of these are pages for some legacy collections. 407 URLs are for pages in our collection guides application, many of them for individual guides or, strangely the EAD XML for the guides. Some of those collection guides do link to online digital collections.
The institutional repository (Dspace instances) is also well represented at the top of this list. The Technical Reports Repository accounts for 159 of those URLs, and the NCSU Institutional Repository accounts for just 3. The digital collections in the repository, mainly special collections, accounts for 626 URLs. 719 of the 801 repository URLs are directly to the PDFs. Evidently the PDFs rank higher than the landing pages.
NCSU Libraries has been providing Geospatial Data Services and paying attention to SEO for those pages for a long time, so it isn’t completely surprising that this directory of files has gotten indexed: http://geodata.lib.ncsu.edu. (Note that this server may not be accessible from off-campus.) Many of the URLs under www.lib.ncsu.edu are also GIS pages, so GIS data services and collections pages are even better represented–and human-friendly–than at first appears.
Other digital collections projects like Historical State, Inside Wood, North Carolina Architects & Builders, and NCSU Libraries’ Rare and Unique Materials are represented, but nowhere near exhaustively. Historical State now canonicalizes its URLs for individual resources to point to the Rare and Unique Materials site, but Common Crawl may not be paying attention that that hint. (Hopefully, at some point I’ll be able to do a similar analysis for historicalstate.lib.nsu.edu as I’ve done in the following.)
For http://d.lib.ncsu.edu these are the URLs listed:
This is the root page of a subdomain that includes a growing number of digital collections sites. This index page was just updated to be more than a single unstyled link.
The home page for NCSU Libraries’ Digital Collections: Rare and Unique Materials which includes over 63,700 resources. It has been a focus of my own work to try to improve the discoverability of this content on the open Web. I implemented embedded semantic markup, Microdata and Schema.org, on this site.
This is an image of Mary Travers singing live on stage. Looking in Google Analytics for this page as a landing page for referrals, Google is the top referrer. Since Google is not likely to have been crawled to discover this URL, it is more likely that the next referrer is responsible for this getting in the index. This post on the Peter, Paul & Mary Love Tumblr was reblogged and liked a number of times. That particular post is the only one from this Tumblr which is in the Common Crawl index.
This is an image of the Webb-Barron-Wells House in Wilson County, North Carolina. It has seen better days, and is not the best representation of many of the beautiful architectural photographs and drawings in the collection. This page is linked to from the Webb Surname DNA Project Album. None of this site’s pages are in the index, so that may not be where the link comes from.
This is a “First floor plumbing plan” from the Louis H. Asbury Papers, 1906-1975 with many fine drawings. I can’t seem to track down a referrer from Google Analytics that might have led to this link finding its way into the Common Crawl.
So it appears that the Common Crawl probably hasn’t (at least in this half of the index!), decided to crawl this site to any extent. Instead it appears it is only deciding to crawl pages that have been linked up. Once the rest of the index comes out, I’ll have to take a look, and consider how to improve that number. The key though is obviously getting more links into the site.
Further down in the list there are a bunch of funny looking URLs. I think these are all proxy URLs for user authentication to restricted resources.
http://gopher.lib.ncsu.edu no longer seems to exist, so I don’t know where they got that page.
Double Checking in Web Data Commons
While the Common Crawl URL index is useful if you need the whole page, in many cases just extracted embedded semantic markup may be enough. The Web Data Commons is already extracting Microdata and RDFa data, and makes indexes available, though it takes a bit more effort to parse through their indexes. (I’d like a service or script to query for an N-Quad context and get back all the related triples. Anyone know if there is already such a service? Do I have to write one?) They do have a helpful page on how to download the extracted data in whole or in part.
The http://d.lib.ncsu.edu/collections/ site publishes Microdata and Schema.org. Looking in the Web Data Commons Microdata index I found the the N-Quad file with triples extracted from ncsu.edu pages. They list only the same URLs as the Common Crawl URL index reports. This leads me to believe that these may be the only URLs in the Common Crawl index right now even though that index is incomplete.
What can libraries and archives do with this?
First, how much of your content is in the Common Crawl corpus? I’d be interested in hearing what your results are like.
We need to figure out how to get more cultural heritage content crawled and indexed by the Common Crawl. Without our stuff in the Common Crawl we are missing many opportunities to broaden the reach of our content. It doesn’t appear that Common Crawl accepts sitemaps. It works off of page rank and the link graph of popular sites. While my sites for rare and unique digital collections get most of their traffic from search engines, mainly Google, an increasing amount of traffic is due to referrals. Referrals, links from other sites, seem like the key for getting our stuff into the corpus. Efforts to add links to library special or digital collections to appropriate Wikipedia articles and the like would seem to be a good starting point.
Social sites are in the corpus and may also be a good way to get inbound links to our collections. There are 134,928+ Pinterest URLs in the Common Crawl index, and folks are actively pinning content from d.lib.ncsu.edu. Will the content pinned and repinned on Pinterest begin showing up in the crawl? Where else are crawlers likely to find links from people who make use of our content?
If more cultural heritage content is a part of the index, then there are all sorts of things we can begin to do. For web archiving projects it would be possible to begin with data in the corpus, potentially saving some crawling expense. New targeted search engines (or aggregations) can be created for different slices of content. Implement Microdata (or RDFa Lite) with Schema.org vocabularies and richer metadata can be extracted from your pages by the Web Data Commons and understood by many. This data can then be used in a variety of interfaces to save the time of the user in finding the content they really want.
What are some other ways that libraries, archives, and museums might be able to use the Common Crawl?
You can see the simple Ruby scripts I used for parsing the Common Crawl URL index out and the Web Data Commons N-Quads in this gist.
We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. You can read his guest blog post below and be sure to check out the triv.io site to learn more about how they help groups solve big data problems.
Common Crawl URL Index
by Scott Robertson
Common Crawl is my goto data set. It’s a huge collection of pages crawled from the internet and made available completely unfettered. Their choice to largely leave the data alone and make it available “as is”, is brilliant.
It’s almost like I did the crawling myself, minus the hassle of creating a crawling infrastructure, renting space in a data center and dealing with spinning platters covered in rust that freeze up you when you least want them to. I exaggerate. In this day and age I would spend hours, days maybe weeks agonizing over cloud infrastructure choices and worrying about my credit card bills if I wanted to create something on that scale.
If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.
Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.
I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).
Keeping with Common Crawl tradition we’re making the entire index available as a giant download. Fear not, there’s no need to rack up bandwidth bills downloading the entire thing. We’ve implemented it as a prefixed b-tree so you can access parts of it randomly from S3 using byte range requests. At the same time, you’re free to download the entire beast and work with it directly if you desire.
Information about the format, and samples of accessing it using python are available on github. Feel free to post questions in the issue tracker and wikis there.
The index itself is located public datasets bucket at s3://commoncrawl/projects/url-index/url-index.1356128792.
This is the first release of the index. The main goals of the design is to allow querying of the index via byte-range queries and to make it easy to implement in any language. We hope you dear reader, will be encouraged to jump in and contribute code to access the index under your favorite language.
For now we’ve avoided clever encoding schemes and compression. We’re expecting that to change as the community has a chance to work with the data and contribute their expertise. Join the discussion we’re happy to have you.
I am very excited to announce that blekko is donating search data to Common Crawl!
blekko was founded in 2007 to pursue innovations that would eliminate spam in search results. blekko has created a new type of search experience that enlists human editors in its efforts to eliminate spam and personalize search. blekko has raised $55 million in VC and currently has 48 employees, including former Google and Yahoo! Search engineers.
For details of their donation and collaboration with Common Crawl see the post from their blog below. Follow blekko on Twitter and subscribe to their blog to keep abreast of their news (lots of cool stuff going on over there!) and be sure to check out there search.
From the blekko blog:
At blekko, we believe the web and search should be open and transparent — it’s number one in the blekko Bill of Rights. To make web data accessible, blekko gives away our search results to innovative applications using our API. Today, we’re happy to announce the ongoing donation of our search engine ranking metadata for 140 million websites and 22 billion webpages to the Common Crawl Foundation.
Common Crawl has built an open crawl of the web that can be accessed and analyzed by everyone. The goal is building a truly open web, with open access to information that enables more innovation in research, business, and education. Common Crawl will use blekko’s metadata to improve its crawl quality, while avoiding webspam, porn, and the influence of excessive SEO (search engine optimization). This will ensure that Common Crawl’s resources and engineering time are spent on webpages that are written by, and are useful to, humans.
We’re putting our full-fledged support behind Common Crawl’s crawl and mission with this donation. We’re not doing this because it makes us feel good (OK, it makes us feel a little good), or because it makes us look good (OK, it makes us look a little good), we’re helping Common Crawl because Common Crawl is taking strides towards our shared vision of an open and transparent Internet.
Just take a look at this excerpt from Common Crawl’s website:
“As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.”
Who could disagree with that?
I am very excited to announce the Norvig Web Data Science Award!
Common Crawl and SARA created the award to encourage research in web data science. The award is named in honor of distinguished computer scientist Peter Norvig. Peter is a highly respected leader in several computer science fields including: internet search, artificial intelligence, natural language processing and machine learning.
The award is open to students and researchers at public universities in the Netherlands. All submissions to the award will utilize Common Crawl’s corpus of web data. Applicants can submit their entries between today (November 15 2012) and January 15 2013. Submissions will be judged by Peter along with four other eminent computer scientists and the winner will be notified by February 15 2013. For full details about the award, please see the award website.
Everyone at Common Crawl and SARA is very excited to see what the Dutch students and researchers create using Common Crawl data! If you are a affiliated with a public university in the Netherlands you should definitely apply. Those who are e not affiliated with a Dutch university will still benefit from the award because the code for all submissions will be open source licensed. There are sure to be some phenomenal projects that will serve as inspiration for what you can do with Common Crawl data and provide code you can build on. Stay tuned for updates about the submissions and for the announcement of the winner in February 2013.
This is a guest blog post by Matthew Berk, Founder of Lucky Oyster.
Matthew has been on the front lines of search technology for the past decade. Previous to founding Lucky Oyster, he was Executive Vice President of Product Engineering at Marchex, where he oversaw a team of 100+ engineers. Berk was previously a founder and CTO for Open List, the first local search engine, which was sold for $13mm in 2006. Prior to this, Berk was a Research Director for Jupiter Research, where he focused on search technology and Web infrastructure. Mr. Berk holds a master’s degree from The Johns Hopkins University, where he was an Andrew W. Mellon Fellow in Humanistic Studies, and a BA, summa cum laude, from Cornell University. Berk lives in Seattle, WA, where he actively cultivates pearls for both fun and profit.
When I first came across the field of information retrieval in the 80’s and early 90’s (back when TREC began), vectors were all the rage, and the key units were terms, texts, and corpora. Through the 90’s and with the advent of hypertext and later the explosion of the Web, that metaphor shifted to pages, sites, and links, and approaches like HITS and Page Rank leveraged hyperlinking between documents and sites as key proxies for authority and relevance.
Today we’re at a crossroads, as the nature of the content we seek to leverage through search and discovery has shifted once again, with a specific gravity now defined by entities, structured metadata, and (social) connections. In particular, and based on my work with Common Crawl data specifically, content has shifted in three critical ways:
First, publication and authorship have now been completely democratized. And I’m not just talking about individuals writing blogs, but the way in which any of us can (and do), elect to “comment”, “post”, “like”, “pin”, “tag”, “recommend” or fifty other interaction events, thereby contributing to the vast corpus of interconnected data at any time, from any device. All of these tiny acts signify, but their meanings in aggregate are not yet fully understood.
Secondly, whereas the Web throughout its growth and development represented a vast public repository of usable information, we’re now seeing the locus of ownership shift speedily away from publicly accessible repositories, to highly guarded–and valued–walled gardens. The “deep Web” was nothing compared to the “social Graph” that’s now growing rampant. Want to understand why social is such a great priority at the formerly all-seeing eye of Google? Just look at the robots.txt files at facebook.com and graph.facebook.com. The latter is hair-raisingly stark:
Unlike throughout the broader Web, the owners of the great human Graph in which roughly 1/6 of the world population are participating have no seed for SEO.
Finally, the content that’s now making its way online is radically different from the Web pages, articles, and local business listings we’re used to seeing. It’s highly structured, thanks to well-promoted models for metadata decoration like the Open Graph and Schema.org, and socially inflected to a degree that’s astonishing. For example: the songs we listen to on Pandora, the games we play online, the books we download to our Kindles, our training runs, hikes with the kids, recipes and their outcomes, and a wide variety of newly forged kinds of socially-vectored entities and activities.
Between Web search–which today by necessity includes reference (Wikipedia) and common entity search on the one hand, and the long-scrolling Wall on the other, there’s an undeveloped axis for a new model of social discovery. It’s very reminiscent of that shift from textual IR to the Web search we saw in the late 90’s.
All of this brings me back around to the Common Crawl mission and data set. Up until very recently, if you wanted to study the Web and its deeper nature and evolution, unless you were among a privileged few, access to sufficient Web crawl content was almost prohibitively expensive. The twin specters of storage and bandwidth alone, not to mention the computational horsepower required to study the data gathered, were more than enough to discourage almost anyone.
But today, thanks to groups like Common Crawl and Amazon Web Services, data and computational muscle are free and/or affordable in ways that make innovation–or even new understanding–possible at Web scale, by almost anyone. In the past few months, I’ve been leveraging these new tools to dig far deeper into the problems I laid out above than I ever imagined (see Study of ~1.3 Billion URLs: ~22% of Web Pages Reference Facebook and Data Mining the Web: $100 Worth of Priceless). And this is just the beginning….
My hope is that access to this data and these tools, including broader exposure to the kinds of work possible (see here), will inspire more and more companies, groups, and even tinkering engineers to push the envelope, and to make that ever greater graph of human knowledge and interaction ever more accessible, discoverable, and ultimately useful.
In that spirit, we welcome any questions about what we’re doing with the data, how we’re doing it, or what we aim to solve at Lucky Oyster; just send a friendly note to audacity at lucky oyster dot com. Or if you’re in Las Vegas for the upcoming re:Invent show, I’ll be presenting more of this material along with Lisa from Common Crawl.
We’re very excited to announce the winners of the First Ever Common Crawl Code Contest! We were thrilled by the response to the contest and the many great entries. Several people let us know that they were not able to complete their project in time to submit to the contest. We’re currently working with them to finish the projects outside of the contest and we’ll be showcasing some of those projects in the near future! All entries creatively showcased the immense potential of the Common Crawl data.
Huge thanks to everyone who participated, and to our prize and swag sponsors! A special thank you goes out to our panel of incredible judges, who had a very difficult time choosing the winner from among such great entries.All of the entries were in the Social Impact category, so the third grand prize (originally intended for the Job Trends category) goes to the runner-up in Peoples’ Choice. And the grand prize winners are…
People’s Choice: Linking Entities to Wikipedia
It’s not surprising that this entry was so popular in the community! It seeks to determine the meaning of a word based on the probability that it can be mapped to each of the possible Wikipedia concepts. It creates very useful building blocks for a large range of uses and it’s also exhilarating to see how it can be tweaked and tuned towards specific questions.
People’s Choice: French Open Data
Another very popular entry, this work maps the ecosphere of French open data in order to identify the players, their importance, and their relationship. The resulting graph grants insight into the world of French open data and the excellent code could easily be adapted to explore terms other than “Open Data” and/or could create subsets based on language.
Social Impact : Online Sentiment Towards Congressional Bills
This entry correlated Common Crawl data and congressional data to look at the online conversation surrounding individual pieces of legislation. Contest judge Pete Warden’s comments about this work do a great job of summing up all the excitement about this project:
“There are millions of people talking online about laws that may change their lives, but their voices are usually lost in the noise of the crowd. This work can highlight how ordinary web users think about bills in congress, and give their views influence on decision-making. By moving beyond polls into detailed analysis of people’s opinions on new laws, it shows how open data can ‘democratize’ democracy itself!”
The four following projects – listed in alphabetical order – all came very close to winning.
We encourage you to check out the code created in the contest and see how you can use it to extract insight from the Common Crawl data! To learn how to get started, check out our wiki, which also includes some inspiration and ideas to get your creative juices flowing.
Thank you to all our wonderful sponsors!