Common Crawl URL Index

We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. You can read his guest blog post below and be sure to check out the triv.io site to learn more about how they help groups solve big data problems.

Common Crawl URL Index
by Scott Robertson

Common Crawl is my goto data set. It’s a huge collection of pages crawled from the internet and made available completely unfettered. Their choice to largely leave the data alone and make it available “as is”, is brilliant.

It’s almost like I did the crawling myself, minus the hassle of creating a crawling infrastructure, renting space in a data center and dealing with spinning platters covered in rust that freeze up you when you least want them to. I exaggerate. In this day and age I would spend hours, days maybe weeks agonizing over cloud infrastructure choices and worrying about my credit card bills if I wanted to create something on that scale.

If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.

Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

Keeping with Common Crawl tradition we’re making the entire index available as a giant download. Fear not, there’s no need to rack up bandwidth bills downloading the entire thing. We’ve implemented it as a prefixed b-tree so you can access parts of it randomly from S3 using byte range requests. At the same time, you’re free to download the entire beast and work with it directly if you desire.

Information about the format, and samples of accessing it using python are available on github. Feel free to post questions in the issue tracker and wikis there.

The index itself is located public datasets bucket at s3://commoncrawl/projects/url-index/url-index.1356128792.

This is the first release of the index. The main goals of the design is to allow querying of the index via byte-range queries and to make it easy to implement in any language. We hope you dear reader, will be encouraged to jump in and contribute code to access the index under your favorite language.

For now we’ve avoided clever encoding schemes and compression. We’re expecting that to change as the community has a chance to work with the data and contribute their expertise. Join the discussion we’re happy to have you.

blekko donates search data to Common Crawl

I am very excited to announce that blekko is donating search data to Common Crawl! 

blekko was founded in 2007 to pursue innovations that would eliminate spam in search results. blekko has created a new type of search experience that enlists human editors in its efforts to eliminate spam and personalize search. blekko has raised $55 million in VC and currently has 48 employees, including former Google and Yahoo! Search engineers. 

For details of their donation and collaboration with Common Crawl see the post from their blog below. Follow blekko on Twitter and subscribe to their blog to keep abreast of their news (lots of cool stuff going on over there!) and be sure to check out there search. 

From the blekko blog:

 

 

 

At blekko, we believe the web and search should be open and transparent — it’s number one in the blekko Bill of Rights. To make web data accessible, blekko gives away our search results to innovative applications using our API. Today, we’re happy to announce the ongoing donation of our search engine ranking metadata for 140 million websites and 22 billion webpages to the Common Crawl Foundation.

Common Crawl has built an open crawl of the web that can be accessed and analyzed by everyone. The goal is building a truly open web, with open access to information that enables more innovation in research, business, and education. Common Crawl will use blekko’s metadata to improve its crawl quality, while avoiding webspam, porn, and the influence of excessive SEO (search engine optimization). This will ensure that Common Crawl’s resources and engineering time are spent on webpages that are written by, and are useful to, humans.

We’re putting our full-fledged support behind Common Crawl’s crawl and mission with this donation. We’re not doing this because it makes us feel good (OK, it makes us feel a little good), or because it makes us look good (OK, it makes us look a little good), we’re helping Common Crawl because Common Crawl is taking strides towards our shared vision of an open and transparent Internet.

Just take a look at this excerpt from Common Crawl’s website:

“As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.”

Who could disagree with that?

.

.

.

.

The Norvig Web Data Science Award

I am very excited to announce  the Norvig Web Data Science Award!

Common Crawl and SARA created the award  to encourage research in web data science. The award is named in honor of distinguished computer scientist Peter Norvig. Peter is a  highly respected leader in several computer science fields including: internet search, artificial intelligence, natural language processing and machine learning. 

The award is open to students and researchers at public universities in the Netherlands. All submissions to the award will utilize Common Crawl’s corpus of web data. Applicants can submit their entries between today (November 15 2012) and January 15 2013. Submissions will be judged by Peter along with four other eminent computer scientists and the winner will be notified by February 15 2013. For full details about the award, please see the award website

The Jury:
          • Peter Norvig
          • Ricardo Baeza-Yates
          • Hilary Mason
          • Jimmy Lin
          • Evert Lammerts 

Everyone at Common Crawl and SARA is very excited to see what the Dutch students and researchers create using Common Crawl data! If you are a affiliated with a public university in the Netherlands you should definitely apply. Those who are e not affiliated with a Dutch university will still benefit from the award because the code for all submissions will be open source licensed. There are sure to be some phenomenal projects that will serve as inspiration for what you can do with Common Crawl data and provide code you can build on. Stay tuned for updates about the submissions and for the announcement of the winner in February 2013.

.

.

.

Towards Social Discovery – New Content Models; New Data; New Toolsets

Matthew BerkThis is a guest blog post by Matthew Berk, Founder of Lucky Oyster

Matthew has been on the front lines of search technology for the past decade. Previous to founding Lucky Oyster, he was Executive Vice President of Product Engineering at Marchex, where he oversaw a team of 100+ engineers. Berk was previously a founder and CTO for Open List, the first local search engine, which was sold for $13mm in 2006. Prior to this, Berk was a Research Director for Jupiter Research, where he focused on search technology and Web infrastructure. Mr. Berk holds a master’s degree from The Johns Hopkins University, where he was an Andrew W. Mellon Fellow in Humanistic Studies, and a BA, summa cum laude, from Cornell University. Berk lives in Seattle, WA, where he actively cultivates pearls for both fun and profit.


When I first came across the field of information retrieval in the 80’s and early 90’s (back when TREC began), vectors were all the rage, and the key units were terms, texts, and corpora. Through the 90’s and with the advent of hypertext and later the explosion of the Web, that metaphor shifted to pages, sites, and links, and approaches like HITS and Page Rank leveraged hyperlinking between documents and sites as key proxies for authority and relevance.

Today we’re at a crossroads, as the nature of the content we seek to leverage through search and discovery has shifted once again, with a specific gravity now defined by entities, structured metadata, and (social) connections. In particular, and based on my work with Common Crawl data specifically, content has shifted in three critical ways:

First, publication and authorship have now been completely democratized. And I’m not just talking about individuals writing blogs, but the way in which any of us can (and do), elect to “comment”, “post”, “like”, “pin”, “tag”, “recommend” or fifty other interaction events, thereby contributing to the vast corpus of interconnected data at any time, from any device. All of these tiny acts signify, but their meanings in aggregate are not yet fully understood.

Secondly, whereas the Web throughout its growth and development represented a vast public repository of usable information, we’re now seeing the locus of ownership shift speedily away from publicly accessible repositories, to highly guarded–and valued–walled gardens. The “deep Web” was nothing compared to the “social Graph” that’s now growing rampant. Want to understand why social is such a great priority at the formerly all-seeing eye of Google? Just look at the robots.txt files at facebook.com and graph.facebook.com. The latter is hair-raisingly stark:

User-agent: *
Disallow: /

Unlike throughout the broader Web, the owners of the great human Graph in which roughly 1/6 of the world population are participating have no seed for SEO.

Finally, the content that’s now making its way online is radically different from the Web pages, articles, and local business listings we’re used to seeing. It’s highly structured, thanks to well-promoted models for metadata decoration like the Open Graph and Schema.org, and socially inflected to a degree that’s astonishing. For example: the songs we listen to on Pandora, the games we play online, the books we download to our Kindles, our training runs, hikes with the kids, recipes and their outcomes, and a wide variety of newly forged kinds of socially-vectored entities and activities.

Between Web search–which today by necessity includes reference (Wikipedia) and common entity search on the one hand, and the long-scrolling Wall on the other, there’s an undeveloped axis for a new model of social discovery. It’s very reminiscent of that shift from textual IR to the Web search we saw in the late 90’s.

All of this brings me back around to the Common Crawl mission and data set. Up until very recently, if you wanted to study the Web and its deeper nature and evolution, unless you were among a privileged few, access to sufficient Web crawl content was almost prohibitively expensive. The twin specters of storage and bandwidth alone, not to mention the computational horsepower required to study the data gathered, were more than enough to discourage almost anyone.

But today, thanks to groups like Common Crawl and Amazon Web Services, data and computational muscle are free and/or affordable in ways that make innovation–or even new understanding–possible at Web scale, by almost anyone. In the past few months, I’ve been leveraging these new tools to dig far deeper into the problems I laid out above than I ever imagined (see Study of ~1.3 Billion URLs: ~22% of Web Pages Reference Facebook and Data Mining the Web: $100 Worth of Priceless). And this is just the beginning….

My hope is that access to this data and these tools, including broader exposure to the kinds of work possible (see here), will inspire more and more companies, groups, and even tinkering engineers to push the envelope, and to make that ever greater graph of human knowledge and interaction ever more accessible, discoverable, and ultimately useful.

In that spirit, we welcome any questions about what we’re doing with the data, how we’re doing it, or what we aim to solve at Lucky Oyster; just send a friendly note to audacity at lucky oyster dot com. Or if you’re in Las Vegas for the upcoming re:Invent show, I’ll be presenting more of this material along with Lisa from Common Crawl.

Winners of the Code Contest!

We’re very excited to announce the winners of the First Ever Common Crawl Code Contest! We were thrilled by the response to the contest and the many great entries. Several people let us know that they were not able to complete their project in time to submit to the contest. We’re currently working with them to finish the projects outside of the contest and we’ll be showcasing some of those projects in the near future! All entries creatively showcased the immense potential of the Common Crawl data.

Huge thanks to everyone who participated, and to our prize and swag sponsors! A special thank you goes out to our panel of incredible judges, who had a very difficult time choosing the winner from among such great entries.All of the entries were in the Social Impact category, so the third grand prize (originally intended for the Job Trends category) goes to the runner-up in Peoples’ Choice. And the grand prize winners are…

 

People’s Choice: Linking Entities to Wikipedia
It’s not surprising that this entry was so popular in the community!  It seeks to determine the meaning of a word based on the probability that it can be mapped to each of the possible Wikipedia concepts. It creates very useful building blocks for a large range of uses and it’s also exhilarating to see how it can be tweaked and tuned towards specific questions.

Project description
Code on GitHub


People’s Choice: French Open Data

Another very popular entry, this work maps the ecosphere of French open data in order to identify the players, their importance, and their relationship. The resulting graph grants insight into the world of French open data and the excellent code could easily be adapted to explore terms other than “Open Data” and/or could create subsets based on language.

Project description
Code on GitHub

 

Social Impact : Online Sentiment Towards Congressional Bills
This entry correlated Common Crawl data and congressional data to look at the online conversation surrounding individual pieces of legislation. Contest judge Pete Warden’s comments about this work do a great job of summing up all the excitement about this project:

“There are millions of people talking online about laws that may change their lives, but their voices are usually lost in the noise of the crowd. This work can highlight how ordinary web users think about bills in congress, and give their views influence on decision-making. By moving beyond polls into detailed analysis of people’s opinions on new laws, it shows how open data can ‘democratize’ democracy itself!”

Code on Github

 

Honorable Mentions
The four following projects – listed in alphabetical order – all came very close to winning.

Facebook Infection
Project description

Is Money the Root of All Evil?
Project description
Code on GitHub

Reverse Link!
Web app
Code on GitHub

Web Data Commons
Project description
Code on Assembla

We encourage you to check out the code created in the contest and see how you can use it to extract insight from the Common Crawl data! To learn how to get started, check out our wiki, which also includes some inspiration and ideas to get your creative juices flowing.

Thank you to all our wonderful sponsors!

Prize Sponsors:

 

Swag Sponsors:

 

Common Crawl Code Contest Extended Through the Holiday Weekend

Do you have a project that you are working on for the Common Crawl Code Contest that is not quite ready? If so, you are not the only one. A few people have emailed us to let us know their code is almost ready but they are worried about the deadline, so we have decided to extend the deadline through the holiday weekend.

Four extra days to enter the code contest! With the long weekend, you could get started today and have plenty of time to build something cool.  Take a look around our wiki for information about the datathe Quick Start Amazon AMI, the Quick Start build from Github, and a page of Inspiration and Ideas. Playing with Common Crawl data would be a super fun weekend project and the prize package is great.

Prize Package:

  • $1000 cash
  • $500 AWS credit
  • O’Reilly Data Science Kit
  • Nexus 7 tablet
  • GitHub pro account
  • Box full of awesome swag from: GitHub, Kaggle, EFF, Creative Commons, Hortonworks, and more
  • A 1/3 chance to win an all access pass to Strata + Hadoop World

Plus, every entrant get $50 in AWS credit just for entering. What better way is there to spend a long weekend than exploring an awesome dataset, writing code and throwing your hat in the ring for a great prize package? Have fun!

 

 

 

 

 

.

TalentBin Adds Prizes To The Code Contest

The prize package for the Common Crawl Code Contest now includes three Nexus 7 tablets thanks to TalentBin!

The prize packages for the contest are now:

  • $1000 in cash
  • $500 in AWS credit
  • O’Reilly Data Science Starter Kit
  • Nexus 7 tablet
  • Bag of awesome swag
  • A 1 in 3 chance of winning an all access pass to Strata + Hadoop World

We are excited to add the Nexus 7 tablets to the prize packages and very excited to be working with TalentBin. TalentBin makes an open web people search engine by scooping up all the interesting professional activities that folks engage in all across the web, interpreting that activity, and then mashing it up into composite professional profiles. And yup, you’re right, that’s a lot of unstructured data to make sense of.

But if you think about what you look like as a distilled mashup of all your professionally relevant activity on Twitter, Facebook, Meetup, Github, Stackoverflow, Quora, US patent database, etc. etc., you quickly realize that it’s a much richer, and more accurate picture than the few lines of stuff on your LinkedIn profile. Kinda like this:  http://youtu.be/Jvjpj88f-LU
Currently it is really popular with recruiting organizations trying to find the kind of folks who have really sparse LinkedIn profiles, or aren’t even on LinkedIn. If you are looking to hire (and who isn’t?) you should definitely check out TalentBin and follow them on Twitter.
If you haven’t participated in the Common Crawl Code Contest yet, now you have all the more reason to do so! The Nexus 7 tablet is fantastic. I have one and I love it. Play around with the Common Crawl data on AWS, getting coding and build something cool to win one of these great prize packages!

Amazon Web Services sponsoring $50 in credit to all contest entrants!

Did you know that every entry to the First Ever Common Crawl Code Contest gets $50 in Amazon Web Services (AWS) credits? If you’re a developer interested in big datasets and learning new platforms like Hadoop, you truly have no reason not to try your hand at creating an entry to the code contest! Plus, three grand prize winners will get $500 in AWS credits, so you can continue to play around with the dataset and hone your skills even more.

Amazon Web Services has published a dedicated landing page for the contest, which takes you straight to the data. Whether or not you decide to enter the code contest, if you’re looking to play around with and get used to the tools available, an excellent way to do so is with the Amazon Machine Image.

AWS has been a great supporter of the code contest as well as of Common Crawl in general. We are deeply appreciative for all they’re doing to help spread the word about Common Crawl and make our dataset easily accessible!

 

 

.

Still time to participate in the Common Crawl code contest

There is still plenty of time left to participate in the Common Crawl code contest! The contest is accepting entries until August 30th, why not spend some time this week playing around with the Common Crawl corpus and then submit your work to the contest?

Three prizes will be awarded, each with:

  • $1000 cash
  • $500 in AWS credit
  • O’Reilly Data Science Starter Kit
  • TCHO Chocolates
  • A box full of awesome swag including: a Kaggle hoodie, a Github coffee mug and stickers, a Hortonworks elephant, and several  great t-shirts

One lucky winner will receive a full access pass to Strata + Hadoop World! Plus, every entrant will receive $50 in AWS credit just for entering!

If you are looking for inspiration, you can check out our video or the Inspiration and Ideas page of our wiki. There is lots of helpful information to on our wiki to help you get started including :an Amazon Machine Image and a quick start guide. If you are looking for help with your work or a collaborator, you can post on the Discussion Group.

We are looking forward to seeing what you come up with!

 

 

.

Strata Conference + Hadoop World

This year’s Strata Conference teams up with Hadoop World for what promises to be a powerhouse convening in NYC from October 23-25. Check out their full announcement below and secure your spot today.

Strata + Hadoop World

Now in its second year in New York, the O’Reilly Strata Conference explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World to create the largest gathering of the Apache Hadoop community in the world.

Strata brings together decision makers using the raw power of big data to drive business strategy, and practitioners who collect, analyze, and manipulate that data—particularly in the worlds of finance, media, and government

The keynotes they have lined up this year are fantastic! Doug Cutting on Beyond Batch Rich Hickey on 10/24 The Composite Database, plus Mike Olson, Sharmila Shahani-Mulligan, Cathy O’Neil, and other great speakers.  The sessions are also full of great topics. You can see the full schedule here. This year’s conference will include the launch of the Strata Data Innovation Awards. There is so much important work being done in the world of data it is going to be a very difficult decision for the award committee and I can’t wait to see who the award winners are.  The entire three days of Strata + Hadoop are going to be excited and thought-provoking – you can’t afford to miss it.

P.S. We’re thrilled to have Strata as a prize sponsor of Common Crawl’s First Ever Code Contest. If you’ve been thinking about submitting an entry, you couldn’t ask for a better reason to do so: you’ll have the chance to win an all-access pass to Strata Conference + Hadoop World 2012!

 

 

.