Learn how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents. Check out the full blog post where this video originally appeared.
As a sign of many more good things to come in 2012, Founder Gil Elbaz and Board Member Nova Spivack appeared on this week’s episode of This Week in Startups. Nova and Gil, in dicussion with host Jason Calacanis, explore in depth what Common Crawl is all about and how it fits into the larger picture of online search and indexing. Underlying their conversation is an exploration of how Common Crawl’s open crawl of the web is a powerful asset for educators, researchers, and entrepreneurs.
Some of my favorite moments from the show include:
- In a great soundbyte from Jason at the beginning of the show, he observes that Common Crawl is in many ways the “Wikipedia of the search engine.” (8:50)
- When the question is posed whether or not Common Crawl may eventually charge some fee for our data and tools, Nova’s response that Common Crawl is “better if it’s free… [We] want this to be like the public library system” captures the spirit of Common Crawl’s mission and our commitment to the open web. (32:00)
- When asked about projects and applications that would benefit from Common Crawl, Gil makes a compelling case for organizations that can use Common Crawl as a teaching tool. If someone wants to teach Hadoop at scale, for example, it’s essential for them to have a realistic corpus to work with — and Common Crawl can provide that. (46:18 )
Those are just a few of the highlights, but I highly recommend watching the episode in its entirety for even more insights from Gil and Nova as we gear up for big things ahead for Common Crawl!
Founder Gil Elbaz and Board Member Nova Spivack appeared on This Week in Startups on January 10, 2012. Nova and Gil, in dicussion with host Jason Calacanis, explore in depth what Common Crawl is all about and how it fits into the larger picture of online search and indexing. Underlying their conversation is an exploration of how Common Crawl’s open crawl of the web is a powerful asset for educators, researchers, and entrepreneurs.
Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.
When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.
With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?
This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.
When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!
Ready to get started? Watch our screencast and follow along below:
Step 1 – Install Git and Eclipse
We first need to install a few important tools to get started:
Eclipse (for writing Hadoop code)
How to install (Windows and OS X):
Download the “Eclipse IDE for Java developers” installer package located at:
How to install (Linux):
Run the following command in a terminal:
# sudo yum install eclipse
# sudo apt-get install eclipse
Git (for retrieving our sample application)
How to install (Windows)
Install the latest .EXE from:
How to install (OS X)
Install the appropriate .DMG from:
How to install (Linux)
Run the following command in a terminal:
# sudo yum install git
# sudo apt-get install git
Step 2 – Check out the code and compile the HelloWorld JAR
Now that you’ve installed the packages you need to play with our code, run the following command from a terminal/command prompt to pull down the code:
# git clone git://github.com/ssalevan/cc-helloworld.git
Next, start Eclipse. Open the File menu then select “Project” from the “New” menu. Open the “Java” folder and select “Java Project from Existing Ant Buildfile”. Click Browse, then locate the folder containing the code you just checked out (if you didn’t change the directory when you opened the terminal, it should be in your home directory) and select the “build.xml” file. Eclipse will find the right targets, and tick the “Link to the buildfile in the file system” box, as this will enable you to share the edits you make to it in Eclipse with git.
We now need to tell Eclipse how to build our JAR, so right click on the base project folder (by default it’s named “Hello World”) and select “Properties” from the menu that appears. Navigate to the Builders tab in the left hand panel of the Properties window, then click “New”. Select “Ant Builder” from the dialog which appears, then click OK.
To configure our new Ant builder, we need to specify three pieces of information here: where the buildfile is located, where the root directory of the project is, and which ant build target we wish to execute. To set the buildfile, click the “Browse File System” button under the “Buildfile:” field, and find the build.xml file which you found earlier. To set the root directory, click the “Browse File System” button under the “Base Directory:” field, and select the folder into which you checked out our code. To specify the target, enter “dist” without the quotes into the “Arguments” field. Click OK and close the Properties window.
Finally, right click on the base project folder and select “Build Project”, and Ant will assemble a JAR, ready for use in Elastic MapReduce.
Step 3 – Get an Amazon Web Services account (if you don’t have one already) and find your security credentials
If you don’t already have an account with Amazon Web Services, you can sign up for one at the following URL:
Once you’ve registered, visit the following page and copy down your Access Key ID and Secret Access Key:
This information can be used by any Amazon Web Services client to authorize things that cost money, so be sure to keep this information in a safe place.
Step 4 – Upload the HelloWorld JAR to Amazon S3
Uploading the JAR we just built to Amazon S3 is a lot simpler than it sounds. First, visit the following URL:
Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane, then click the “Upload” button, and select the JAR you just built. It should be located here:
<your checkout dir>/dist/lib/HelloWorld.jar
Step 5 – Create an Elastic MapReduce job based on your new JAR
Now that the JAR is uploaded into S3, all we need to do is to point Elastic MapReduce to it, and as it so happens, that’s pretty easy to do too! Visit the following URL:
and click the “Create New Job Flow” button. Give your new flow a name, and tick the “Run your own application” box. Select “Custom JAR” from the “Choose a Job Type” menu and click the “Continue” button.
The next field in the wizard will ask you which JAR to use and what command-line arguments to pass to it. Add the following location:
s3n://<your bucket name>/HelloWorld.jar
then add the following arguments to it:
org.commoncrawl.tutorial.HelloWorld <your aws secret key id> <your aws secret key> 2010/01/07/18/1262876244253_18.arc.gz s3n://<your bucket name>/helloworld-out
CommonCrawl stores its crawl information as GZipped ARC-formatted files (http://www.archive.org/web/researcher/ArcFileFormat.php), and each one is indexed using the following strategy:
/YYYY/MM/DD/the hour that the crawler ran in 24-hour format/*.arc.gz
Thus, by passing these arguments to the JAR we uploaded, we’re telling Hadoop to:
1. Run the main() method in our HelloWorld class (located at org.commoncrawl.tutorial.HelloWorld)
2. Log into Amazon S3 with your AWS access codes
3. Count all the words taken from a chunk of what the web crawler downloaded at 6:00PM on January 7th, 2010
4. Output the results as a series of CSV files into your Amazon S3 bucket (in a directory called helloworld-out)
Edit 12/21/11: Updated to use directory prefix notation instead of glob notation (thanks Petar!)
If you prefer to run against a larger subset of the crawl, you can use directory prefix notation to specify a more inclusive set of data. For instance:
2010/01/07/18 – All files from this particular crawler run (6PM, January 7th 2010)
2010/ – All crawl files from 2010
Don’t worry about the continue fields for now, just accept the default values. If you’re offered the opportunity to use debugging, I recommend enabling it to be able to see your job in action. Once you’ve clicked through them all, click the “Create Job Flow” button and your Hadoop job will be sent to the Amazon cloud.
Step 6 – Watch the show
Now just wait and watch as your job runs through the Hadoop flow; you can look for errors by using the Debug button. Within about 10 minutes, your job will be complete. You can view results in the S3 Browser panel, located here. If you download these files and load them into a text editor, you can see what came out of the job. You can take this sort of data and add it into a database, or create a new Hadoop OutputFormat to export into XML which you can render into HTML with an XSLT, the possibilities are pretty much endless.
Step 7 – Start playing!
If you find something cool in your adventures and want to share it with us, we’ll feature it on our site if we think it’s cool too. To submit a remix, push your codebase to GitHub or Gitorious and send a message to our user group about it: we promise we’ll look at it.
We have started a Common Crawl discussion list to enable discussions and encourage collaboration between the community of coders, hackers, data scientists, developers and organizations interested in working with open web crawl data. Please join our discussion mailing list to:
- Discuss challenges
- Share ideas for projects and products
- Look for collaborators and partners
- Offer advice and share methods
- Ask questions and get advice from others
- Show off cool stuff you build
- Keep up to date on the latest news from Common Crawl
The Common Crawl discussion list uses Google Groups and you can sign up here.
It was wonderful to see our first blog post and the great piece by Marshall Kirkpatrick on ReadWriteWeb generate so much interest in Common Crawl last week! There were many questions raised on Twitter and in the comment sections of our blog, RWW and Hacker News. In this post we respond to the most common questions. Because it is a long blog post, we have provided a navigation list of questions below. Thanks for all the support and please keep the questions coming!
*Is there a sample dataset or sample .arc file?
*Is it possible to get a list of domain names?
*Is the code open source?
*Where can people obtain access to the Hadoop classes and other code?
*Where can people learn more about the stack and the processing architecture?
*How do you deal with spam and deduping?
*Why should anyone care about five billion pages when Google has so many more?
*How frequently is the crawl data updated?
*How is the metadata organized and stored?
*What is the cost for a simple Hadoop job over the entire corpus?
*Is the data available by torrent?
Is there a sample dataset or sample .arc file?
We are currently working to create a sample dataset so people can consume and experiment with a small segment of the data before dealing with the entire corpus. One commenter suggested that we create a focused crawl of blogs and RSS feeds, and I am happy to say that is just what we had in mind. Stay tuned: We will be announcing the sample dataset soon and posting a sample .arc file on our website even sooner!
Is your code open source?
Anything required to access the buckets or the Common Crawl data that we publish is open source, and any utility code that we develop as part of the crawl is also going to be made open source. However, the crawl infrastructure depends on our internal MapReduce and HDFS file system, and it is not yet in a state that would be useful to third parties. In the future, we plan to break more parts of the internal source code into self-contained pieces to be released as open source.
Where can people access the Hadoop classes and other code?
We have a GitHub repository, that was temporarily down due to some accidental check-ins. It is now back up and can be found here and on the Accessing the Data page of our website.
Where can people learn more about the stack and the processing architecture?
We plan to make the details of our internal infrastructure available in a detailed blog post as soon as time allows. We are using all of our engineering brainpower to optimize the crawler, but we expect to have the bandwidth for additional technical documentation and communication soon. Meanwhile, you can check out a presentation given at a Hadoop user group by Ahad Rana on SlideShare.
How do you deal with spam and deduping?
We use shingling and simhash to do fuzzy deduping of the content we download. The corpus in S3 has not been filtered for spam, because it is not clear whether we should really remove spammy content from the crawl. For example, individuals who want to build a spam filter need access to a crawl with spam. This might be an area in which we can work with the open-source community to develop spam lists/filters.
In addition, we do not have the resources necessary to police the accuracy of any spam filters we develop and currently can only rely on algorithmic means of determining spam, which can sometimes produce false positives.
Why should anyone care about five billion pages when Google has so many more?
Although this question was not common like the others addressed in this post, I would like to respond to a comment on our blog:
“If 5 bln. is just the total number of different URLs you’ve downloaded, then it ain’t much. Google’s index was 1 billion way back in 2000, They’ve downloaded a trillion URLs by 2008. And they say most of is junk, that is simply not worth indexing.”
We are not trying to replace Google; our goal is to provide a high-quality, open corpus of web crawl data.
We agree that many of the pages on the web are junk, and we have no inclination to crawl a larger number of pages just for the sake of having a larger number. Five billion pages is a substantial corpus and, though we may expand the size in the near future, we are focused on quality over quantity.
Also, when Google announced they had a trillion URLs, that was the number of URLs they were aware of, not the number of pages they had downloaded. We have 15 billion URLs in our database, but we don’t currently download them all because those additional 10 billion are—in our judgment—not nearly as important as the five billion we do download. One outcome from our focus on the crawl’s quality is our system of ranking pages, which allows us to determine how important a page is and which of the five billion pages that make up our corpus are among the most important.
How frequently is the crawl data updated?
We spent most of 2011 tweaking the algorithms to improve the freshness and quality of the crawl. We will soon start the improved crawler. In 2012 there will be fresher and more consistent updates – we expect to crawl continuously and update the S3 buckets once a month.
We hope to work with the community to determine what additional metadata and focused crawls would be most valuable and what subsets of web pages should be crawled with the highest frequency.
How is the metadata organized and stored?
The page rank and other metadata we compute is not part of the S3 corpus, but we do collect this information and expect to make it available in a separate S3 bucket in Hadoop SequenceFiles format. On the subject of page ranking, please be aware that the page rank we compute for pages may not have a high degree of correlation to Google’s PageRank, since we do not use their PageRank algorithm.
- The Common Crawl corpus is approximately 40TB.
- Crawl data is stored on S3 in the form of 100MB compressed archives.
- There are between 400-500K such files in the corpus.
- If you open multiple S3 streams in parallel, maintain an average 1MB/sec throughput per S3 stream and start 10 parallel streams per Mapper, you should sustain a throughput of 10 MB/sec.
- If you run one Mapper per EC2 small instance and start 100 instances, you would have an aggregate throughput of ~3TB/hour.
- At that rate you would need 16 hours to scan 50TB of data – a total of 1600 machine hours.
- 1600 machine hours at $0.085 per hour will cost ~$130.
- The cost of any subsequent aggregation/data consolidation jobs and the cost of storing your final data on S3 brings you to a total cost of approximately $150.
Is the data available by torrent?
Do you mean the distribution of a subset of the data via torrents, or do you mean the distribution of updates to the crawl via torrents? The current data set is 40+ TB in size, and it seems to us to be too big to be distributed via this mechanism, but perhaps we are wrong. If you have some ideas about how we could go about doing this, and whether or not it would require significant bandwidth resources on our part, we would love to hear from you.
A little under four years ago, Gil Elbaz formed the Common Crawl Foundation. He was driven by a desire to ensure a truly open web. He knew that decreasing storage and bandwidth costs, along with the increasing ease of crunching big data, made building and maintaining an open repository of web crawl data feasible. More important than the fact that it could be built was his powerful belief that it should be built. The web is the largest collection of information in human history, and web crawl data provides an immensely rich corpus for scientific research, technological advancement, and business innovation. Gil started the Common Crawl Foundation to take action on the belief that it is crucial our information-based society that web crawl data be open and accessible to anyone who desires to utilize it.
That was the inspiration phase of Common Crawl – one person with a passion for openness forming a new foundation to work towards democratizing access to web information, thereby driving a new wave of innovation. Common Crawl quickly moved into the building phase, as Gil found others who shared his belief in the open web. In 2008, Carl Malamud and Nova Spivack joined Gil to form the Common Crawl board of directors. Talented engineer Ahad Rana began developing the technology for our crawler and processing pipeline. Today, thanks to the robust system that Ahad has built, we have an open repository of crawl data that covers approximately 5 billion pages and includes valuable metadata, such as page rank and link graphs. All of our data is stored on Amazon’s S3 and is accessible to anyone via EC2.
Common Crawl is now entering the next phase – spreading the word about the open system we have built and how people can use it. We are actively seeking partners who share our vision of the open web. We want to collaborate with individuals, academic groups, small start-ups, big companies, governments and nonprofits.
Over the next several months, we will be expanding our website and using this blog to describe our technology and data, communicate our philosophy, share ideas, and report on the products of our collaborations. We will also be working to build up a GitHub repository of code that has been and can be used to work with Common Crawl data. Most important, we will be talking with the community of people who share our interests. Thinking about an application you’d like to see built on Common Crawl data? Have Hadoop scripts that could be adapted to find insightful information in the crawl data? Know of a stimulating meetup, conference or hackathon we should attend? We want to hear from you!
This is the phase where the original vision truly comes to life, and the ideas Gil Elbaz had years ago will be converted to new products and insights. To say it is an exciting time is a tremendous understatement.
Hear Common Crawl founder discuss how data accessibility is crucial to increasing rates of innovation as well as give ideas on how to facilitate increased access to data.