2012 Crawl Data Now Available

I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages.

New Crawl Data
The 2012 Common Crawl corpus has been released in ARC file format.

JSON Crawl Metadata
In addition to the raw crawl content, the latest release publishes an extensive set of crawl metadata for each document in the corpus.  This metadata includes crawl statistics, charset information, HTTP headers, HTML META tags, anchor tags, and more.

Our hope is researchers will be able to take advantage of this small-but-powerful data set to both answer high level questions and drill into  a specific subset of data that they are interested in.

The crawl metadata is stored as JSON in Hadoop SequenceFiles on S3, colocated with ARC content files.  More information about Crawl Metadata can be found here, including a listing of all data points provided.

Text-Only Content
This release also features a text-only version of the corpus.  This version contains the page title, meta description, and all visible text content without HTML markup.  We’ve seen dramatic reductions in CPU consumption for applications that use the text-only files instead of extracting text from HTML.

In addition, the text content has been re-encoded from the document’s original character set into UTF-8.  This saves users from having to handle multiple character sets in their application.

More information about our Text-Only content can be found here.

Amazon AMI
Along with this release, we’ve published an Amazon Machine Image (AMI) to help both new and experienced users get up and running quickly.  The AMI includes a copy of our Common Crawl User Library, our Common Crawl Example Library, and launch scripts to show users how to analyze the Common Crawl corpus using either a local Hadoop cluster or Amazon Elastic MapReduce.

More information about our Amazon Machine Image can be found here.

We hope that everyone out there has an opportunity to try out the latest release.  If you have questions that aren’t answered in the Get Started page or FAQ, head over to our discussion group and share your question with the community.

The Open Cloud Consortium’s Open Science Data Cloud

Common Crawl has started talking with the Open Cloud Consortium (OCC) about working together. If you haven’t already heard of the OCC, it is an awesome nonprofit organization managing and operating cloud computing infrastructure that supports scientific, environmental, medical and health care research. We’re very interested in facilitating the use of Common Crawl data by researchers and academics, so we are excited about the idea of working with the OCC.

The Open Cloud Consortium has four working groups, one of which is the Open Science Data Cloud (OSDC). The infrastructure of the OSDC has been designed to address the challenges inherent in transporting large datasets, to balance the needs of data management and data analysis, and to archive data. The OSDC is based on a shared community infrastructure where hardware and software are shared among researchers and projects at the scale where it is most efficient to centrally locate and process data.

The OSDC has carved out a space between small public infrastructures like AWS, and the very large, dedicated infrastructures needed for projects like the large hadron collider. The OCC’s diagram describes the distinction it makes between small, medium, and very large infrastructures:

 

More details about the OCC and its working groups can be found in a highly informative paper [PDF] that was presented by several members of the OCC team at the 2010 ACM International Symposium on High Performance Distributed Computing. The paper gives a technical overview and describes some of the challenges faced by the Open Science Data Cloud. You can also find more information on the Open Cloud Consortium website and on the Open Science Data Cloud website.

We are excited about the important work being done by the Open Cloud Consortium and by the possibility of working closely with its Open Science Data Cloud working group. Stay tuned for more news as our partnership with the organization develops.

OSCON 2012

We’re just one month away from one of the biggest and most exciting events of the year, O’Reilly’s Open Source Convention (OSCON). This year’s conference will be held July 16th-20th in Portland, Oregon. The date can’t come soon enough. OSCON is one of the most prominent confluences of “the world’s open source pioneers, builders, and innovators” and promises to stimulate, challenge, and amuse over the course of five action-packed days. It will feature an audience of 3,000 open-source enthusiasts, incredible speakers, more than a dozen tracks, and hundreds of workshops. It’s the place to be! So naturally, Common Crawl will be there to partake in the action.

Gil EbazGil Elbaz, Common Crawl’s fearless founder and CEO of Factual, Inc., will lead a session called Hiding Data Kills Innovation on Wednesday, July 18th at 2:30pm, where he’ll discuss the relationship between data accessibility and innovation. Other members of the Common Crawl team will be there as well, and we’re looking forward to meeting, connecting, and sharing ideas with you! Keep an eye out for Gil’s session and be sure to come say hi.

If you haven’t registered, it’s not too late to secure a spot today. If you’ve already registered, we hope to see you there! We’re curious: what are some other sessions you’re looking forward to at this year’s OSCON?

Data 2.0 Summit

Next week a few members of the Common Crawl team are going the Data 2.0 Summit in San Francisco. We are very much looking forward to the summit – it is the largest Cloud Data conference of 2012 and last year’s summit was a great experience. If you haven’t already registered, use the code below for a 20% discount.

The main theme of this year’s Data 2.0 is the question: Why is the next technology revolution a Data Revolution? There will be a great collection entrepreneurs, investors, and executives – leaders in the areas of Cloud Data, Social Data, Big Data, and the API Economy – to discuss this question in presentations, panels and casual conversations. Check out the list of speakers to get an idea of who will be present.

One of my favorite parts of the 2011 Data 2.0 Summit was the Startup Pitch Day. This year, the following 10 startups will compete for over $20,000 in prizes in front of a panel of VCs: Precog, FoodGenius, Wishery, Lumenous, MortarData,  HG Data, Junar, SizeUp, Ginger.io, Booshaka.

During the summit and the afterparty, there is sure to be a lot of talk about strategies for startups to monetize data, why investors fund data companies, why corporations are interested in acquiring data-centric tech startups, API infrastructure, accessing the twitter firehose, mining the social web, big data technologies like Hadoop and MapReduce, NoSQL technologies like Cassandra and MongoDB, and the state of open government and open data initiatives.

Data openness and accessibility will definitely be a big part of the discussions.  Our Founder and Chairman, Gil Elbaz, will be on a panel called “How Open is the Open Web?” along with Bram Cohen of BitTorrent, Sid Stamm of Mozilla, Jatinder Singh of PARC, and Scott Burke of Yahoo.

Some of the highlights I am looking forward to in addition to Gil and Eva’s panels are:

  • “Data Science and Predicting the Future” Anthony Goldbloom of Kaggle, Joe Lonsdale of Anduin Ventures and Professor Alexander Gray of Skytree will discuss What makes data science different from big data and when does data science best predict the future.
  • “Social Data: Foundation of the Social Web” will have Daniel Tunkelang, principle data scientist at Linkedin, along with speakers from  Microsoft, Clearspring  and Walmart Labs and moderator Liz Gannes discussing how social data including social sharing, social news, and social connections are changing how we search, advertise, and work.
  • “Big Data, Big Challenges: Where should big data innovate in 2012:” Max Schireson of 10gen , Walter Maguire of HP Vertica and other panelists talking about the challenges facing data scientists in 2012 and which searching, indexing, computing, and storing tools should be improved to overcome them.

If you can be in San Francisco on April 3rd you should definitely attend Data 2.0! If you are going to be there and want to talk about Common Crawl, drop us an email or send us a message on Twitter so we can make plans to meet up.

 

Get 20% off your Data 2.0 Summit Pass using the discount code “data2get2012“ through March 30th 2012 here:http://data2summit.com/register