I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages.
New Crawl Data
The 2012 Common Crawl corpus has been released in ARC file format.
JSON Crawl Metadata
In addition to the raw crawl content, the latest release publishes an extensive set of crawl metadata for each document in the corpus. This metadata includes crawl statistics, charset information, HTTP headers, HTML META tags, anchor tags, and more.
Our hope is researchers will be able to take advantage of this small-but-powerful data set to both answer high level questions and drill into a specific subset of data that they are interested in.
The crawl metadata is stored as JSON in Hadoop SequenceFiles on S3, colocated with ARC content files. More information about Crawl Metadata can be found here, including a listing of all data points provided.
This release also features a text-only version of the corpus. This version contains the page title, meta description, and all visible text content without HTML markup. We’ve seen dramatic reductions in CPU consumption for applications that use the text-only files instead of extracting text from HTML.
In addition, the text content has been re-encoded from the document’s original character set into UTF-8. This saves users from having to handle multiple character sets in their application.
More information about our Text-Only content can be found here.
Along with this release, we’ve published an Amazon Machine Image (AMI) to help both new and experienced users get up and running quickly. The AMI includes a copy of our Common Crawl User Library, our Common Crawl Example Library, and launch scripts to show users how to analyze the Common Crawl corpus using either a local Hadoop cluster or Amazon Elastic MapReduce.
More information about our Amazon Machine Image can be found here.
We hope that everyone out there has an opportunity to try out the latest release. If you have questions that aren’t answered in the Get Started page or FAQ, head over to our discussion group and share your question with the community.