We’re just one month away from one of the biggest and most exciting events of the year, O’Reilly’s Open Source Convention (OSCON). This year’s conference will be held July 16th-20th in Portland, Oregon. The date can’t come soon enough. OSCON is one of the most prominent confluences of “the world’s open source pioneers, builders, and innovators” and promises to stimulate, challenge, and amuse over the course of five action-packed days. It will feature an audience of 3,000 open-source enthusiasts, incredible speakers, more than a dozen tracks, and hundreds of workshops. It’s the place to be! So naturally, Common Crawl will be there to partake in the action.
Gil Elbaz, Common Crawl’s fearless founder and CEO of Factual, Inc., will lead a session called Hiding Data Kills Innovation on Wednesday, July 18th at 2:30pm, where he’ll discuss the relationship between data accessibility and innovation. Other members of the Common Crawl team will be there as well, and we’re looking forward to meeting, connecting, and sharing ideas with you! Keep an eye out for Gil’s session and be sure to come say hi.
If you haven’t registered, it’s not too late to secure a spot today. If you’ve already registered, we hope to see you there! We’re curious: what are some other sessions you’re looking forward to at this year’s OSCON?
We’re looking for students who want to try out the Hadoop platform and get a technical report published.
(If you’re looking for inspiration, we have some paper ideas below. Keep reading.)
Hadoop’s version of MapReduce will undoubtedbly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data – about 5 billion web pages – and documentation to help you learn these tools.
So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV.
As an added bonus, you would be helping us out. We’re trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference.
Leave a comment now if you’re interested! Then once you’ve talked with your advisor, follow up to your comment, and we’ll be available to help point you in the right direction technically.
Step 2: Turn your new skills on the Common Crawl corpus, available on Amazon Web Services.
“Identifying the most used Wikipedia articles with Hadoop and the Common Crawl corpus”
“Six degrees of Kevin Bacon: an exploration of open web data”
“A Hip-Hop family tree: From Akon to Jay-Z with the Common Crawl data”
Step 3: Reflect on the process and what you find. Compile these valuable insights into a publication. The possibilities are limitless; here are some fun titles we’d love to see come to life:
Here are some other interesting topics you could explore:
Using this data can we ask “how many Jack Blacks are there in the world?”
What is the average price for a camera?
How much can you trust HTTP headers? It’s extremely common that the response headers provided with a webpage are contradictory to the actual page — things like what language it’s in or the byte encoding. Browsers use these headers as hints but need to examine the actual content to make a decision about what that content is. It’s interesting to understand how often these two contradict.
How much is enough? Some questions we ask of data — such as “what’s the most common word in the english language” — actually don’t need much data at all to answer. So what is the point of a dataset of this size? What value can someone extract from the full dataset? How does this value change with a 50% sample, a 10% sample, a 1% sample? For a particular problem, how should this sample be done?
Train a text classifier to identify topicality. Extract meta keywords from Common Crawl HTML data, then construct a training corpus of topically-tagged documents to train a text classifier for a news application.
Identify political sites and their leanings. Cluster and visualize their networks of links (You could use Blekko’s /conservative /liberal tag lists as a starting point).
So, again — if you think this might be fun, leave a comment now to mark your interest. Talk with your advisor, post a follow up to your comment, and we’ll be in touch!
Big Data Week aims to connect data enthusiasts, technologists, and professionals across the globe through a series of meetups between April 19th-28th. The idea is to build community among groups working on big data and to spur conversations about relevant topics ranging from technology to commercial use cases. With big data an increasingly hot topic, it’s becoming ever more important for data scientists, technologists, and wranglers to work together to establish best practices and build upon each others’ innovations.
With 50 meetups spread across England, Australia, and the U.S., there is plenty happening between April 19-28. If you’re in the SF Bay Area, here are a few noteworthy events that may be of interest to you!
Bio + Tech | Bio Hackers and Founders Meetup on Tuesday, April 24th, 7pm at Giordano in the Mission. This will be a great chance to network with a diverse group of professionals from across the fields of science, data, and medicine.
Introduction to Hadoop on Tuesday, April 24th, 6:30pm at Swissnex. This is a full event, but you can join the waiting list.
InfoChimps Presents Ironfan on Thursday, April 26th, 7pm at SurveyMonkey in Palo Alto. Hear Flip Kromer, CTO of Infochimps, present on Ironfan, which makes provisioning and configuring your Big Data infrastructure simple.
Data Science Hackathon on Saturday, April 28th. This international hackathon aims to demonstrate the possibilities and power of combining Data Science with Open Source, Hadoop, Machine Learning, and Data Mining tools.
Next week a few members of the Common Crawl team are going the Data 2.0 Summit in San Francisco. We are very much looking forward to the summit – it is the largest Cloud Data conference of 2012 and last year’s summit was a great experience. If you haven’t already registered, use the code below for a 20% discount.
The main theme of this year’s Data 2.0 is the question: Why is the next technology revolution a Data Revolution? There will be a great collection entrepreneurs, investors, and executives – leaders in the areas of Cloud Data, Social Data, Big Data, and the API Economy – to discuss this question in presentations, panels and casual conversations. Check out the list of speakers to get an idea of who will be present.
During the summit and the afterparty, there is sure to be a lot of talk about strategies for startups to monetize data, why investors fund data companies, why corporations are interested in acquiring data-centric tech startups, API infrastructure, accessing the twitter firehose, mining the social web, big data technologies like Hadoop and MapReduce, NoSQL technologies like Cassandra and MongoDB, and the state of open government and open data initiatives.
Data openness and accessibility will definitely be a big part of the discussions. Our Founder and Chairman, Gil Elbaz, will be on a panel called “How Open is the Open Web?” along with Bram Cohen of BitTorrent, Sid Stamm of Mozilla, Jatinder Singh of PARC, and Scott Burke of Yahoo.
Some of the highlights I am looking forward to in addition to Gil and Eva’s panels are:
“Social Data: Foundation of the Social Web” will have Daniel Tunkelang, principle data scientist at Linkedin, along with speakers from Microsoft, Clearspring and Walmart Labs and moderator Liz Gannes discussing how social data including social sharing, social news, and social connections are changing how we search, advertise, and work.
“Big Data, Big Challenges: Where should big data innovate in 2012:” Max Schireson of 10gen , Walter Maguire of HP Vertica and other panelists talking about the challenges facing data scientists in 2012 and which searching, indexing, computing, and storing tools should be improved to overcome them.
If you can be in San Francisco on April 3rd you should definitely attend Data 2.0! If you are going to be there and want to talk about Common Crawl, drop us an email or send us a message on Twitter so we can make plans to meet up.
The following is a guest blog post by Pete Warden, a member of the Common Crawl Advisory Board . Pete is a British-born programmer living in San Francisco. After spending over a decade as a software engineer, including 5 years at Apple, he’s now focused on a career as a mad scientist. He is currently gathering, analyzing and visualizing the flood of web data that’s recently emerged, trying to turn it into useful information without trampling on people’s privacy. Pete is the current CTO of Jetpac, a site for sharing travel photos, tips, and guides among friends. Passionate about large-scale data processing and visualization, he writes regularly on the topic on his blog and as a regular contributor to O’Reilly Radar.
Common Crawl is one of those projects where I rant and rave about how world-changing it will be, and often all I get in response is a quizzical look. It’s an actively-updated and programmatically-accessible archive of public web pages, with over five billion crawled so far. So what, you say? This is going to be the foundation of a whole family of applications that have never been possible outside of the largest corporations. It’s mega-scale web-crawling for the masses, and will enable startups and hackers to innovate around ideas like a dictionary built from the web, reverse-engineering postal codes, or any other application that can benefit from huge amounts of real-world content.
Rather than grabbing each of you by the lapels individually and ranting, I thought it would be more productive to give you a simple example of how you can run your own code across the archived pages. It’s currently released as an Amazon Public Data Set, which means you don’t pay for access from Amazon servers, so I’ll show you how on their Elastic MapReduce service.
I’m grateful to Ben Nagy for the original Ruby code I’m basing this on. I’ve made minimal changes to his original code, and built a step-by-step guide describing exactly how to run it. If you’re interested in the Java equivalent, I recommendthis alternative five-minute guide.
1 – Fetch the example code from github
You’ll need git to get the example source code. If you don’t already have it, there’s a good guide to installing it here:
Buckets are a bit like top-level folders in Amazon’s S3 storage system. They need to have globally-unique names which don’t clash with any other Amazon user’s buckets, so when you see me using com.petewarden as a prefix, replace that with something else unique, like your own domain name. Click on the S3 tab at the top of the page and then click the Create Bucket button at the top of the left pane, and enter com.petewarden.commoncrawl01input for the first bucket. Repeat with the following three other buckets:
The last part of their names is meant to indicate what they’ll be used for. ‘scripts’ will hold the source code for your job, ‘input’ the files that are fed into the code, ‘output’ will hold the results of the job, and ‘logging’ will have any error messages it generates.
5 – Upload files to your buckets
Select your ‘scripts’ bucket in the left-hand pane, and click the Upload button in the center pane. Select extension_map.rb, extension_reduce.rb, and setup.shfrom the folder on your local machine where you cloned the git project. Click Start Upload, and it should only take a few seconds. Do the same steps for the ‘input’ bucket and the example_input.txt file.
6 – Create the Elastic MapReduce job
The EMR service actually creates a Hadoop cluster for you and runs your code on it, but the details are mostly hidden behind their user interface. Click on the Elastic MapReduce tab at the top, and then the Create New Job Flow button to get started.
7 – Describe the job
The Job Flow Name is only used for display purposes, so I normally put something that will remind me of what I’m doing, with an informal version number at the end. Leave the Create a Job Flow radio button on Run your own application, but choose Streaming from the drop-down menu.
8 – Tell it where your code and data are
This is probably the trickiest stage of the job setup. You need to put in the S3 URL (the bucket name prefixed with s3://) for the inputs and outputs of your job. Input Location should be the root folder of the bucket where you put the example_input.txt file, in my case ‘s3://com.petewarden.commoncrawl01input’. Note that this one is a folder, not a single file, and it will read whichever files are in that bucket below that location.
The Output Location is also going to be a folder, but the job itself will create it, so it mustn’t already exist (you’ll get an error if it does). This even applies to the root folder on the bucket, so you must have a non-existent folder suffix. In this example I’m using ‘s3://com.petewarden.commoncrawl01output/01/’.
The Mapper and Reducer fields should point at the source code files you uploaded to your ‘scripts’ bucket, ‘s3://com.petewarden.commoncrawl01scripts/extension_map.rb‘ and ‘s3://com.petewarden.commoncrawl01scripts/extension_map.rb‘. You can leave the Extra Args field blank, and click Continue.
9 – Choose how many machines you’ll run on
The defaults on this screen should be fine, with m1.small instance types everywhere, two instances in the core group, and zero in the task group. Once you get more advanced, you can experiment with different types and larger numbers, but I’ve kept the inputs to this example very small, so it should only take twenty minutes on the default three-machine cluster, which will cost you less than 30 cents. Click Continue.
10 – Set up logging
Hadoop can be a hard beast to debug, so I always ask Elastic MapReduce to write out copies of the log files to a bucket so I can use them to figure out what went wrong. On this screen, leave everything else at the defaults but put the location of your ‘logging’ bucket for the Amazon S3 Log Path, in this case ‘s3://com.petewarden.commoncrawl01logging‘. A new folder with a unique name will be created for every job you run, so you can specify the root of your bucket. Click Continue.
11 – Specify a boot script
The default virtual machine images Amazon supplies are a bit old, so we need to run a script when we start each machine to install missing software. We do this by selecting the Configure your Bootstrap Actions button, choosing Custom Action for the Action Type, and then putting in the location of the setup.sh file we uploaded, eg ‘s3://com.petewarden.commoncrawl01scripts/setup.sh‘. After you’ve done that, click Continue.
12 – Run your job
The last screen shows the settings you chose, so take a quick look to spot any typos, and then click Create Job Flow. The main screen should now contain a new job, with the status ‘Starting’ next to it. After a couple of minutes, that should change to ‘Bootstrapping’, which takes around ten minutes, and then running the job, which only takes two or three.
Debugging all the possible errors is beyond the scope of this post, but a good start is poking around the contents of the logging bucket, and looking at any description the web UI gives you.
Once the job has successfully run, you should see a few files beginning ‘part-‘ inside the folder you specified on the output bucket. If you open one of these up, you’ll see the results of the job.
This job is just a ‘Hello World’ program for walking the Common Crawl data set in Ruby, and simply counts the frequency of mime types and URL suffixes, and I’ve only pointed it at a small subset of the data. What’s important is that this gives you a starting point to write your own Ruby algorithms to analyse the wealth of information that’s buried in this archive. Take a look at the last few lines of extension_map.rb to see where you can add your own code, and editexample_input.txt to add more of the data set once you’re ready to sink your teeth in.
Big thanks again to Ben Nagy for putting the code together, and if you’re interested in understanding Hadoop and Elastic MapReduce in more detail, I created a video training session that might be helpful. I can’t wait to see all the applications that come out of the Common Crawl data set, so get coding!
Please read the announcement and check out the detailed information on the website. I am sure you will agree that this is important work and that you will find their results interesting.
we are happy to announce WebDataCommons.org, a joined project of Freie
Universität Berlin and the Karlsruhe Institute of Technology to extract all
Microformat, Microdata and RDFa data from the Common Crawl web corpus, the
largest and most up-to-data web corpus that is currently available to the
WebDataCommons.org provides the extracted data for download in the form of
RDF-quads. In addition, we produce basic statistics about the extracted
Up till now, we have extracted data from two Common Crawl web corpora: One
corpus consisting of 2.5 billion HTML pages dating from 2009/2010 and a
second corpus consisting of 1.4 billion HTML pages dating from February
The 2009/2010 extraction resulted in 5.1 billion RDF quads which describe
1.5 billion entities and originate from 19.1 million websites.
The February 2012 extraction resulted in 3.2 billion RDF quads which
describe 1.2 billion entities and originate from 65.4 million websites.
More detailed statistics about the distribution of formats, entities and
websites serving structured data, as well as growth between 2009/2010 and
2012 is provided on the project website:
It is interesting to see form the statistics that the RDFa and Microdata
deployment has grown a lot over the last years, but that Microformat data
still makes up the majority of the structured data that is embedded into
HTML pages (when looking at the amount of quads as well as the amount of
We hope that will be useful to the community by:
+ easing the access to Mircodata, Mircoformat and RDFa data, as you do not
need to crawl the Web yourself anymore in order to get access to a fair
portion of the structured data that is currently available on the Web.
+ laying the foundation for the more detailed analysis of the deployment of
the different technologies.
+ providing seed URLs for focused Web crawls that dig deeper into the
websites that offer a specific type of data.
Web Data Commons is a joint effort of Christian Bizer and Hannes Mühleisen
(Web-based Systems Group at Freie Universität Berlin) and Andreas Harth and
Steffen Stadtmüller (Institute AIFB at the Karlsruhe Institute of
Lots of thanks to:
+ the Common Crawl project for providing their great web crawl and thus
enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data
+ the PlanetData and the LOD2 EU research projects which supported the
For the future, we plan to update the extracted datasets on a regular basis
as new Common Crawl corpora are becoming available. We also plan to provide
the extracted data in the in the form of CSV-tables for common entity types
(e.g. product, organization, location, …) in order to make it easier to
mine the data.
Christian Bizer, Hannes Mühleisen, Andreas Harth and Steffen Stadtmüller
As part of our ongoing effort to grow Common Crawl into a truly useful and innovative tool, we recently formed an Advisory Board to guide us in our efforts. We have a stellar line-up of advisory board members who will lend their passion and expertise in numerous fields as we grow our vision. Together with our dedicated Board of Directors, we feel the organization is more prepared than ever to usher in an exciting new phase for Common Crawl and a new wave of innovation in education, business, and research.
Here is a brief introduction to the men and women who have generously agreed to donate their time and brainpower to Common Crawl. Full bios are available on our Advisory Board page.
Our legal counsel, Kevin DeBré, is a well respected Intellectual Property (IP) attorney who has continually worked at the forefront of the evolving IP landscape. Glenn Otis Brown brings additional legal expertise as well as a long history of working at the forefront of tech and the open web, including currently serving as Director of Business Development for Twitter and on the board of Creative Commons. Another strong advocate for openness, Joi Ito, is Director of the MIT Media Lab and Creative Commons Board Chair, who brings with him years of innovative work as a thought-leader in the field.
We look forward to the advice of Jen Pahlka, founder and Executive Director at Code for America. Jen has led Code for America through a remarkable two years of growth to become a high-impact success, and we are delighted to have her insight on growing a non-profit as well as her experience working with government. Eva Ho, VP of Marketing & Operations at Factual who has also served on the boards of several nonprofits, brings additional insight into nonprofit management, as well as valuable experience around big data.
Big data is critical to our work of maintaining an open crawl of the web, and we are fortunate to have numerous experts who can advise on this critical area. Kurt Bollacker is the Digital Research Director of the Long Now Foundation and he formerly served as Technical Director at Internet Archive and Chief Scientist at Metaweb. Pete Skomoroch is a highly respected data scientist, currently employed by LinkedIn, who brings with him substantial knowledge about machine learning and search. Boris Shimanovsky is a prolific, lifelong programmer and Director of Engineering at Factual. Pete Warden, also a programmer, is the current CTO of Jetpac and a highly respected expert in large-scale data processing and visualization.
Danny Sullivan, widely considered a leading “search engine guru,” will bring valuable guidance and insight as Common Crawl grows and develops. Bill Michels is another member of our team with extensive experience in search from his years at Yahoo! which include working as Director of Yahoo! BOSS. We are very lucky to have Peter Norvig, Director of Research at Google and a Fellow of the American Association for Artificial Intelligence and the Association for Computing Machinery.
We are delighted that such an array of talented people see the importance in the work we do, and are honored to have their guidance as we look forward to a year of growth and milestones for Common Crawl.
Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services’ Public Data Sets. This is great news because it means that the Common Crawl data corpus is now much more readily accessible and visible to the public. The greater accessibility and visibility is a significant help in our mission of enabling a new wave of innovation, education, and research.
Amazon Web Services (AWS) provides a centralized repository of public data sets that can be integrated in AWS cloud-based applications. AWS makes available such estimable large data sets as the mapping of the Human Genome and the US Census. Previously, such data was often prohibitively difficult to access and use. With the Amazon Elastic Compute Cloud, it takes a matter of minutes to begin computing on the data.
Demonstrating their commitment to an open web, AWS hosts public data sets at no charge for the community, so users pay only for the compute and storage they use for their own applications. What this means for you is that our data – all 5 billion web pages of it – just got a whole lot slicker and easier to use.
We greatly appreciate Amazon’s support for the open web in general, and we’re especially appreciative of their support for Common Crawl. Placing our data in the public data sets not only benefits the larger community, but it also saves us money. As a nonprofit in the early phases of existence, this is crucial.
A huge thanks to Amazon for seeing the importance in the work we do and for so generously supporting our shared goal of enabling increased open innovation!
As a sign of many more good things to come in 2012, Founder Gil Elbaz and Board Member Nova Spivack appeared on this week’s episode of This Week in Startups. Nova and Gil, in dicussion with host Jason Calacanis, explore in depth what Common Crawl is all about and how it fits into the larger picture of online search and indexing. Underlying their conversation is an exploration of how Common Crawl’s open crawl of the web is a powerful asset for educators, researchers, and entrepreneurs.
Some of my favorite moments from the show include:
In a great soundbyte from Jason at the beginning of the show, he observes that Common Crawl is in many ways the “Wikipedia of the search engine.” (8:50)
When the question is posed whether or not Common Crawl may eventually charge some fee for our data and tools, Nova’s response that Common Crawl is “better if it’s free… [We] want this to be like the public library system” captures the spirit of Common Crawl’s mission and our commitment to the open web. (32:00)
When asked about projects and applications that would benefit from Common Crawl, Gil makes a compelling case for organizations that can use Common Crawl as a teaching tool. If someone wants to teach Hadoop at scale, for example, it’s essential for them to have a realistic corpus to work with — and Common Crawl can provide that. (46:18 )
Those are just a few of the highlights, but I highly recommend watching the episode in its entirety for even more insights from Gil and Nova as we gear up for big things ahead for Common Crawl!
Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.
When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.
With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?
This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.
When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!
Ready to get started? Watch our screencast and follow along below:
Step 1 – Install Git and Eclipse
We first need to install a few important tools to get started:
Eclipse (for writing Hadoop code)
How to install (Windows and OS X):
Download the “Eclipse IDE for Java developers” installer package located at:
Next, start Eclipse. Open the File menu then select “Project” from the “New” menu. Open the “Java” folder and select “Java Project from Existing Ant Buildfile”. Click Browse, then locate the folder containing the code you just checked out (if you didn’t change the directory when you opened the terminal, it should be in your home directory) and select the “build.xml” file. Eclipse will find the right targets, and tick the “Link to the buildfile in the file system” box, as this will enable you to share the edits you make to it in Eclipse with git.
We now need to tell Eclipse how to build our JAR, so right click on the base project folder (by default it’s named “Hello World”) and select “Properties” from the menu that appears. Navigate to the Builders tab in the left hand panel of the Properties window, then click “New”. Select “Ant Builder” from the dialog which appears, then click OK.
To configure our new Ant builder, we need to specify three pieces of information here: where the buildfile is located, where the root directory of the project is, and which ant build target we wish to execute. To set the buildfile, click the “Browse File System” button under the “Buildfile:” field, and find the build.xml file which you found earlier. To set the root directory, click the “Browse File System” button under the “Base Directory:” field, and select the folder into which you checked out our code. To specify the target, enter “dist” without the quotes into the “Arguments” field. Click OK and close the Properties window.
Finally, right click on the base project folder and select “Build Project”, and Ant will assemble a JAR, ready for use in Elastic MapReduce.
Step 3 – Get an Amazon Web Services account (if you don’t have one already) and find your security credentials
If you don’t already have an account with Amazon Web Services, you can sign up for one at the following URL:
Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane, then click the “Upload” button, and select the JAR you just built. It should be located here:
<your checkout dir>/dist/lib/HelloWorld.jar
Step 5 – Create an Elastic MapReduce job based on your new JAR
Now that the JAR is uploaded into S3, all we need to do is to point Elastic MapReduce to it, and as it so happens, that’s pretty easy to do too! Visit the following URL:
and click the “Create New Job Flow” button. Give your new flow a name, and tick the “Run your own application” box. Select “Custom JAR” from the “Choose a Job Type” menu and click the “Continue” button.
The next field in the wizard will ask you which JAR to use and what command-line arguments to pass to it. Add the following location:
/YYYY/MM/DD/the hour that the crawler ran in 24-hour format/*.arc.gz
Thus, by passing these arguments to the JAR we uploaded, we’re telling Hadoop to:
1. Run the main() method in our HelloWorld class (located at org.commoncrawl.tutorial.HelloWorld)
2. Log into Amazon S3 with your AWS access codes
3. Count all the words taken from a chunk of what the web crawler downloaded at 6:00PM on January 7th, 2010
4. Output the results as a series of CSV files into your Amazon S3 bucket (in a directory called helloworld-out)
Edit 12/21/11: Updated to use directory prefix notation instead of glob notation (thanks Petar!)
If you prefer to run against a larger subset of the crawl, you can use directory prefix notation to specify a more inclusive set of data. For instance:
2010/01/07/18 – All files from this particular crawler run (6PM, January 7th 2010)
2010/ – All crawl files from 2010
Don’t worry about the continue fields for now, just accept the default values. If you’re offered the opportunity to use debugging, I recommend enabling it to be able to see your job in action. Once you’ve clicked through them all, click the “Create Job Flow” button and your Hadoop job will be sent to the Amazon cloud.
Step 6 – Watch the show
Now just wait and watch as your job runs through the Hadoop flow; you can look for errors by using the Debug button. Within about 10 minutes, your job will be complete. You can view results in the S3 Browser panel, located here. If you download these files and load them into a text editor, you can see what came out of the job. You can take this sort of data and add it into a database, or create a new Hadoop OutputFormat to export into XML which you can render into HTML with an XSLT, the possibilities are pretty much endless.
Step 7 – Start playing!
If you find something cool in your adventures and want to share it with us, we’ll feature it on our site if we think it’s cool too. To submit a remix, push your codebase to GitHub or Gitorious and send a message to our user group about it: we promise we’ll look at it.