Winners of the Code Contest!

We’re very excited to announce the winners of the First Ever Common Crawl Code Contest! We were thrilled by the response to the contest and the many great entries. Several people let us know that they were not able to complete their project in time to submit to the contest. We’re currently working with them to finish the projects outside of the contest and we’ll be showcasing some of those projects in the near future! All entries creatively showcased the immense potential of the Common Crawl data.

Huge thanks to everyone who participated, and to our prize and swag sponsors! A special thank you goes out to our panel of incredible judges, who had a very difficult time choosing the winner from among such great entries.All of the entries were in the Social Impact category, so the third grand prize (originally intended for the Job Trends category) goes to the runner-up in Peoples’ Choice. And the grand prize winners are…


People’s Choice: Linking Entities to Wikipedia
It’s not surprising that this entry was so popular in the community!  It seeks to determine the meaning of a word based on the probability that it can be mapped to each of the possible Wikipedia concepts. It creates very useful building blocks for a large range of uses and it’s also exhilarating to see how it can be tweaked and tuned towards specific questions.

Project description
Code on GitHub

People’s Choice: French Open Data

Another very popular entry, this work maps the ecosphere of French open data in order to identify the players, their importance, and their relationship. The resulting graph grants insight into the world of French open data and the excellent code could easily be adapted to explore terms other than “Open Data” and/or could create subsets based on language.

Project description
Code on GitHub


Social Impact : Online Sentiment Towards Congressional Bills
This entry correlated Common Crawl data and congressional data to look at the online conversation surrounding individual pieces of legislation. Contest judge Pete Warden’s comments about this work do a great job of summing up all the excitement about this project:

“There are millions of people talking online about laws that may change their lives, but their voices are usually lost in the noise of the crowd. This work can highlight how ordinary web users think about bills in congress, and give their views influence on decision-making. By moving beyond polls into detailed analysis of people’s opinions on new laws, it shows how open data can ‘democratize’ democracy itself!”

Code on Github


Honorable Mentions
The four following projects – listed in alphabetical order – all came very close to winning.

Facebook Infection
Project description

Is Money the Root of All Evil?
Project description
Code on GitHub

Reverse Link!
Web app
Code on GitHub

Web Data Commons
Project description
Code on Assembla

We encourage you to check out the code created in the contest and see how you can use it to extract insight from the Common Crawl data! To learn how to get started, check out our wiki, which also includes some inspiration and ideas to get your creative juices flowing.

Thank you to all our wonderful sponsors!

Prize Sponsors:


Swag Sponsors:


Common Crawl Code Contest Extended Through the Holiday Weekend

Do you have a project that you are working on for the Common Crawl Code Contest that is not quite ready? If so, you are not the only one. A few people have emailed us to let us know their code is almost ready but they are worried about the deadline, so we have decided to extend the deadline through the holiday weekend.

Four extra days to enter the code contest! With the long weekend, you could get started today and have plenty of time to build something cool.  Take a look around our wiki for information about the datathe Quick Start Amazon AMI, the Quick Start build from Github, and a page of Inspiration and Ideas. Playing with Common Crawl data would be a super fun weekend project and the prize package is great.

Prize Package:

  • $1000 cash
  • $500 AWS credit
  • O’Reilly Data Science Kit
  • Nexus 7 tablet
  • GitHub pro account
  • Box full of awesome swag from: GitHub, Kaggle, EFF, Creative Commons, Hortonworks, and more
  • A 1/3 chance to win an all access pass to Strata + Hadoop World

Plus, every entrant get $50 in AWS credit just for entering. What better way is there to spend a long weekend than exploring an awesome dataset, writing code and throwing your hat in the ring for a great prize package? Have fun!







TalentBin Adds Prizes To The Code Contest

The prize package for the Common Crawl Code Contest now includes three Nexus 7 tablets thanks to TalentBin!

The prize packages for the contest are now:

  • $1000 in cash
  • $500 in AWS credit
  • O’Reilly Data Science Starter Kit
  • Nexus 7 tablet
  • Bag of awesome swag
  • A 1 in 3 chance of winning an all access pass to Strata + Hadoop World

We are excited to add the Nexus 7 tablets to the prize packages and very excited to be working with TalentBin. TalentBin makes an open web people search engine by scooping up all the interesting professional activities that folks engage in all across the web, interpreting that activity, and then mashing it up into composite professional profiles. And yup, you’re right, that’s a lot of unstructured data to make sense of.

But if you think about what you look like as a distilled mashup of all your professionally relevant activity on Twitter, Facebook, Meetup, Github, Stackoverflow, Quora, US patent database, etc. etc., you quickly realize that it’s a much richer, and more accurate picture than the few lines of stuff on your LinkedIn profile. Kinda like this:
Currently it is really popular with recruiting organizations trying to find the kind of folks who have really sparse LinkedIn profiles, or aren’t even on LinkedIn. If you are looking to hire (and who isn’t?) you should definitely check out TalentBin and follow them on Twitter.
If you haven’t participated in the Common Crawl Code Contest yet, now you have all the more reason to do so! The Nexus 7 tablet is fantastic. I have one and I love it. Play around with the Common Crawl data on AWS, getting coding and build something cool to win one of these great prize packages!

Amazon Web Services sponsoring $50 in credit to all contest entrants!

Did you know that every entry to the First Ever Common Crawl Code Contest gets $50 in Amazon Web Services (AWS) credits? If you’re a developer interested in big datasets and learning new platforms like Hadoop, you truly have no reason not to try your hand at creating an entry to the code contest! Plus, three grand prize winners will get $500 in AWS credits, so you can continue to play around with the dataset and hone your skills even more.

Amazon Web Services has published a dedicated landing page for the contest, which takes you straight to the data. Whether or not you decide to enter the code contest, if you’re looking to play around with and get used to the tools available, an excellent way to do so is with the Amazon Machine Image.

AWS has been a great supporter of the code contest as well as of Common Crawl in general. We are deeply appreciative for all they’re doing to help spread the word about Common Crawl and make our dataset easily accessible!




Still time to participate in the Common Crawl code contest

There is still plenty of time left to participate in the Common Crawl code contest! The contest is accepting entries until August 30th, why not spend some time this week playing around with the Common Crawl corpus and then submit your work to the contest?

Three prizes will be awarded, each with:

  • $1000 cash
  • $500 in AWS credit
  • O’Reilly Data Science Starter Kit
  • TCHO Chocolates
  • A box full of awesome swag including: a Kaggle hoodie, a Github coffee mug and stickers, a Hortonworks elephant, and several  great t-shirts

One lucky winner will receive a full access pass to Strata + Hadoop World! Plus, every entrant will receive $50 in AWS credit just for entering!

If you are looking for inspiration, you can check out our video or the Inspiration and Ideas page of our wiki. There is lots of helpful information to on our wiki to help you get started including :an Amazon Machine Image and a quick start guide. If you are looking for help with your work or a collaborator, you can post on the Discussion Group.

We are looking forward to seeing what you come up with!




Strata Conference + Hadoop World

This year’s Strata Conference teams up with Hadoop World for what promises to be a powerhouse convening in NYC from October 23-25. Check out their full announcement below and secure your spot today.

Strata + Hadoop World

Now in its second year in New York, the O’Reilly Strata Conference explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World to create the largest gathering of the Apache Hadoop community in the world.

Strata brings together decision makers using the raw power of big data to drive business strategy, and practitioners who collect, analyze, and manipulate that data—particularly in the worlds of finance, media, and government

The keynotes they have lined up this year are fantastic! Doug Cutting on Beyond Batch Rich Hickey on 10/24 The Composite Database, plus Mike Olson, Sharmila Shahani-Mulligan, Cathy O’Neil, and other great speakers.  The sessions are also full of great topics. You can see the full schedule here. This year’s conference will include the launch of the Strata Data Innovation Awards. There is so much important work being done in the world of data it is going to be a very difficult decision for the award committee and I can’t wait to see who the award winners are.  The entire three days of Strata + Hadoop are going to be excited and thought-provoking – you can’t afford to miss it.

P.S. We’re thrilled to have Strata as a prize sponsor of Common Crawl’s First Ever Code Contest. If you’ve been thinking about submitting an entry, you couldn’t ask for a better reason to do so: you’ll have the chance to win an all-access pass to Strata Conference + Hadoop World 2012!




Mat Kelcey Joins The Common Crawl Advisory Board

Mat KelceyWe are excited to announce that Mat Kelcey has joined the Common Crawl Board of Advisors! Mat has been extremely helpful to Common Crawl over the last several months and we are very happy to have him as an official Advisor to the organization.

Mat is a brilliant engineer with a knack for machine learning, informational retrieval, natural language processing, and artificial intelligence. He is currently working on machine learning and natural language processing systems at Wavii. You can  also learn more about him by taking a look at some of his code on Github. You can keep up with what is on Mat’s mind on Twitter or on his blog. If you frequent the Common Crawl Discussion Group you will see lots of helpful comments and advice from Mat.

Please join me in welcoming Mat and celebrating Common Crawl’s good fortune to have him as part of our team by posting a comment here, on the discussion group, or on Twitter.






Common Crawl’s Brand Spanking New Video and First Ever Code Contest!

At Common Crawl we’ve been busy recently! After announcing the release of 2012 data and other enhancements, we are now excited to share with you this short video that explains why we here at Common Crawl are working hard to bring web crawl data to anyone who wants to use it. We hope it gets you excited about our work too. Please help us share this by posting, forwarding, and tweeting widely! We want our message to be broadcast loud and clear: openly accessible web crawl data is a powerful resource for education, research, and innovation of every kind.

We also hope that by the end of the video, you’ll be so inspired that you’ll be left itching to get your hands on our terabytes of data. Which is exactly why we’re launching our FIRST EVER CODE CONTEST. We’re calling all open data and open web enthusiasts to help us demonstrate the power of web crawl data to inform Job Trends and offer Social Impact Analysis, two examples given the video. If you’re up for the challenge, head over to our contest page to learn all the details of how to submit and get more ideas for ways to seek information from the corpus of data in these two very important fields of interest. The contest will be open for submission for just six weeks – until August 29th, and we’ve got some seriously awesome prizes and stellar judges lined up. So get coding!

2012 Crawl Data Now Available

I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages.

New Crawl Data
The 2012 Common Crawl corpus has been released in ARC file format.

JSON Crawl Metadata
In addition to the raw crawl content, the latest release publishes an extensive set of crawl metadata for each document in the corpus.  This metadata includes crawl statistics, charset information, HTTP headers, HTML META tags, anchor tags, and more.

Our hope is researchers will be able to take advantage of this small-but-powerful data set to both answer high level questions and drill into  a specific subset of data that they are interested in.

The crawl metadata is stored as JSON in Hadoop SequenceFiles on S3, colocated with ARC content files.  More information about Crawl Metadata can be found here, including a listing of all data points provided.

Text-Only Content
This release also features a text-only version of the corpus.  This version contains the page title, meta description, and all visible text content without HTML markup.  We’ve seen dramatic reductions in CPU consumption for applications that use the text-only files instead of extracting text from HTML.

In addition, the text content has been re-encoded from the document’s original character set into UTF-8.  This saves users from having to handle multiple character sets in their application.

More information about our Text-Only content can be found here.

Amazon AMI
Along with this release, we’ve published an Amazon Machine Image (AMI) to help both new and experienced users get up and running quickly.  The AMI includes a copy of our Common Crawl User Library, our Common Crawl Example Library, and launch scripts to show users how to analyze the Common Crawl corpus using either a local Hadoop cluster or Amazon Elastic MapReduce.

More information about our Amazon Machine Image can be found here.

We hope that everyone out there has an opportunity to try out the latest release.  If you have questions that aren’t answered in the Get Started page or FAQ, head over to our discussion group and share your question with the community.

The Open Cloud Consortium’s Open Science Data Cloud

Common Crawl has started talking with the Open Cloud Consortium (OCC) about working together. If you haven’t already heard of the OCC, it is an awesome nonprofit organization managing and operating cloud computing infrastructure that supports scientific, environmental, medical and health care research. We’re very interested in facilitating the use of Common Crawl data by researchers and academics, so we are excited about the idea of working with the OCC.

The Open Cloud Consortium has four working groups, one of which is the Open Science Data Cloud (OSDC). The infrastructure of the OSDC has been designed to address the challenges inherent in transporting large datasets, to balance the needs of data management and data analysis, and to archive data. The OSDC is based on a shared community infrastructure where hardware and software are shared among researchers and projects at the scale where it is most efficient to centrally locate and process data.

The OSDC has carved out a space between small public infrastructures like AWS, and the very large, dedicated infrastructures needed for projects like the large hadron collider. The OCC’s diagram describes the distinction it makes between small, medium, and very large infrastructures:


More details about the OCC and its working groups can be found in a highly informative paper [PDF] that was presented by several members of the OCC team at the 2010 ACM International Symposium on High Performance Distributed Computing. The paper gives a technical overview and describes some of the challenges faced by the Open Science Data Cloud. You can also find more information on the Open Cloud Consortium website and on the Open Science Data Cloud website.

We are excited about the important work being done by the Open Cloud Consortium and by the possibility of working closely with its Open Science Data Cloud working group. Stay tuned for more news as our partnership with the organization develops.