This year’s Strata Conference teams up with Hadoop World for what promises to be a powerhouse convening in NYC from October 23-25. Check out their full announcement below and secure your spot today.
Now in its second year in New York, the O’Reilly Strata Conference explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World to create the largest gathering of the Apache Hadoop community in the world.
Strata brings together decision makers using the raw power of big data to drive business strategy, and practitioners who collect, analyze, and manipulate that data—particularly in the worlds of finance, media, and government
P.S. We’re thrilled to have Strata as a prize sponsor of Common Crawl’s First Ever Code Contest. If you’ve been thinking about submitting an entry, you couldn’t ask for a better reason to do so: you’ll have the chance to win an all-access pass to Strata Conference + Hadoop World 2012!
We’re looking for students who want to try out the Hadoop platform and get a technical report published.
(If you’re looking for inspiration, we have some paper ideas below. Keep reading.)
Hadoop’s version of MapReduce will undoubtedbly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data – about 5 billion web pages – and documentation to help you learn these tools.
So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV.
As an added bonus, you would be helping us out. We’re trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference.
Leave a comment now if you’re interested! Then once you’ve talked with your advisor, follow up to your comment, and we’ll be available to help point you in the right direction technically.
Step 2: Turn your new skills on the Common Crawl corpus, available on Amazon Web Services.
“Identifying the most used Wikipedia articles with Hadoop and the Common Crawl corpus”
“Six degrees of Kevin Bacon: an exploration of open web data”
“A Hip-Hop family tree: From Akon to Jay-Z with the Common Crawl data”
Step 3: Reflect on the process and what you find. Compile these valuable insights into a publication. The possibilities are limitless; here are some fun titles we’d love to see come to life:
Here are some other interesting topics you could explore:
Using this data can we ask “how many Jack Blacks are there in the world?”
What is the average price for a camera?
How much can you trust HTTP headers? It’s extremely common that the response headers provided with a webpage are contradictory to the actual page — things like what language it’s in or the byte encoding. Browsers use these headers as hints but need to examine the actual content to make a decision about what that content is. It’s interesting to understand how often these two contradict.
How much is enough? Some questions we ask of data — such as “what’s the most common word in the english language” — actually don’t need much data at all to answer. So what is the point of a dataset of this size? What value can someone extract from the full dataset? How does this value change with a 50% sample, a 10% sample, a 1% sample? For a particular problem, how should this sample be done?
Train a text classifier to identify topicality. Extract meta keywords from Common Crawl HTML data, then construct a training corpus of topically-tagged documents to train a text classifier for a news application.
Identify political sites and their leanings. Cluster and visualize their networks of links (You could use Blekko’s /conservative /liberal tag lists as a starting point).
So, again — if you think this might be fun, leave a comment now to mark your interest. Talk with your advisor, post a follow up to your comment, and we’ll be in touch!
Big Data Week aims to connect data enthusiasts, technologists, and professionals across the globe through a series of meetups between April 19th-28th. The idea is to build community among groups working on big data and to spur conversations about relevant topics ranging from technology to commercial use cases. With big data an increasingly hot topic, it’s becoming ever more important for data scientists, technologists, and wranglers to work together to establish best practices and build upon each others’ innovations.
With 50 meetups spread across England, Australia, and the U.S., there is plenty happening between April 19-28. If you’re in the SF Bay Area, here are a few noteworthy events that may be of interest to you!
Bio + Tech | Bio Hackers and Founders Meetup on Tuesday, April 24th, 7pm at Giordano in the Mission. This will be a great chance to network with a diverse group of professionals from across the fields of science, data, and medicine.
Introduction to Hadoop on Tuesday, April 24th, 6:30pm at Swissnex. This is a full event, but you can join the waiting list.
InfoChimps Presents Ironfan on Thursday, April 26th, 7pm at SurveyMonkey in Palo Alto. Hear Flip Kromer, CTO of Infochimps, present on Ironfan, which makes provisioning and configuring your Big Data infrastructure simple.
Data Science Hackathon on Saturday, April 28th. This international hackathon aims to demonstrate the possibilities and power of combining Data Science with Open Source, Hadoop, Machine Learning, and Data Mining tools.
Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.
When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.
With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?
This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.
When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!
Ready to get started? Watch our screencast and follow along below:
Step 1 – Install Git and Eclipse
We first need to install a few important tools to get started:
Eclipse (for writing Hadoop code)
How to install (Windows and OS X):
Download the “Eclipse IDE for Java developers” installer package located at:
Next, start Eclipse. Open the File menu then select “Project” from the “New” menu. Open the “Java” folder and select “Java Project from Existing Ant Buildfile”. Click Browse, then locate the folder containing the code you just checked out (if you didn’t change the directory when you opened the terminal, it should be in your home directory) and select the “build.xml” file. Eclipse will find the right targets, and tick the “Link to the buildfile in the file system” box, as this will enable you to share the edits you make to it in Eclipse with git.
We now need to tell Eclipse how to build our JAR, so right click on the base project folder (by default it’s named “Hello World”) and select “Properties” from the menu that appears. Navigate to the Builders tab in the left hand panel of the Properties window, then click “New”. Select “Ant Builder” from the dialog which appears, then click OK.
To configure our new Ant builder, we need to specify three pieces of information here: where the buildfile is located, where the root directory of the project is, and which ant build target we wish to execute. To set the buildfile, click the “Browse File System” button under the “Buildfile:” field, and find the build.xml file which you found earlier. To set the root directory, click the “Browse File System” button under the “Base Directory:” field, and select the folder into which you checked out our code. To specify the target, enter “dist” without the quotes into the “Arguments” field. Click OK and close the Properties window.
Finally, right click on the base project folder and select “Build Project”, and Ant will assemble a JAR, ready for use in Elastic MapReduce.
Step 3 – Get an Amazon Web Services account (if you don’t have one already) and find your security credentials
If you don’t already have an account with Amazon Web Services, you can sign up for one at the following URL:
Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane, then click the “Upload” button, and select the JAR you just built. It should be located here:
<your checkout dir>/dist/lib/HelloWorld.jar
Step 5 – Create an Elastic MapReduce job based on your new JAR
Now that the JAR is uploaded into S3, all we need to do is to point Elastic MapReduce to it, and as it so happens, that’s pretty easy to do too! Visit the following URL:
and click the “Create New Job Flow” button. Give your new flow a name, and tick the “Run your own application” box. Select “Custom JAR” from the “Choose a Job Type” menu and click the “Continue” button.
The next field in the wizard will ask you which JAR to use and what command-line arguments to pass to it. Add the following location:
/YYYY/MM/DD/the hour that the crawler ran in 24-hour format/*.arc.gz
Thus, by passing these arguments to the JAR we uploaded, we’re telling Hadoop to:
1. Run the main() method in our HelloWorld class (located at org.commoncrawl.tutorial.HelloWorld)
2. Log into Amazon S3 with your AWS access codes
3. Count all the words taken from a chunk of what the web crawler downloaded at 6:00PM on January 7th, 2010
4. Output the results as a series of CSV files into your Amazon S3 bucket (in a directory called helloworld-out)
Edit 12/21/11: Updated to use directory prefix notation instead of glob notation (thanks Petar!)
If you prefer to run against a larger subset of the crawl, you can use directory prefix notation to specify a more inclusive set of data. For instance:
2010/01/07/18 – All files from this particular crawler run (6PM, January 7th 2010)
2010/ – All crawl files from 2010
Don’t worry about the continue fields for now, just accept the default values. If you’re offered the opportunity to use debugging, I recommend enabling it to be able to see your job in action. Once you’ve clicked through them all, click the “Create Job Flow” button and your Hadoop job will be sent to the Amazon cloud.
Step 6 – Watch the show
Now just wait and watch as your job runs through the Hadoop flow; you can look for errors by using the Debug button. Within about 10 minutes, your job will be complete. You can view results in the S3 Browser panel, located here. If you download these files and load them into a text editor, you can see what came out of the job. You can take this sort of data and add it into a database, or create a new Hadoop OutputFormat to export into XML which you can render into HTML with an XSLT, the possibilities are pretty much endless.
Step 7 – Start playing!
If you find something cool in your adventures and want to share it with us, we’ll feature it on our site if we think it’s cool too. To submit a remix, push your codebase to GitHub or Gitorious and send a message to our user group about it: we promise we’ll look at it.