<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>CommonCrawl</title>
	<atom:link href="http://commoncrawl.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://commoncrawl.org</link>
	<description></description>
	<lastBuildDate>Wed, 15 Feb 2012 22:24:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Common Crawl&#8217;s Advisory Board</title>
		<link>http://commoncrawl.org/common-crawls-advisory-board/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=common-crawls-advisory-board</link>
		<comments>http://commoncrawl.org/common-crawls-advisory-board/#comments</comments>
		<pubDate>Wed, 15 Feb 2012 20:42:39 +0000</pubDate>
		<dc:creator>Allison Domicone</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[advisory board]]></category>

		<guid isPermaLink="false">http://commoncrawl.org/?p=12603</guid>
		<description><![CDATA[As part of our ongoing effort to grow Common Crawl into a truly useful and innovative tool, we recently formed an Advisory Board to guide us in our efforts. We have a stellar line-up of advisory board members who will lend their passion and expertise in numerous fields as we grow our vision. Together with [...]]]></description>
			<content:encoded><![CDATA[<p>As part of our ongoing effort to grow Common Crawl into a truly useful and innovative tool, we recently formed an Advisory Board to guide us in our efforts. We have a stellar line-up of advisory board members who will lend their passion and expertise in numerous fields as we grow our vision. Together with our dedicated <a title="Board of Directors" href="http://commoncrawl.org/about/board-of-directors/">Board of Directors</a>, we feel the organization is more prepared than ever to usher in an exciting new phase for Common Crawl and a new wave of innovation in education, business, and research.</p>
<p>Here is a brief introduction to the men and women who have generously agreed to donate their time and brainpower to Common Crawl. Full bios are available on our<a href="../about/advisory-board/"> Advisory Board page</a>.</p>
<p>Our legal counsel, <a href="http://www.linkedin.com/pub/kevin-debre/0/16/900" target="_blank">Kevin DeBré</a>, is a well respected Intellectual Property (IP) attorney who has continually worked at the forefront of the evolving IP landscape. <a href="http://www.linkedin.com/pub/glenn-otis-brown/13/448/704" target="_blank">Glenn Otis Brown</a> brings additional legal expertise as well as a long history of working at the forefront of tech and the open web, including currently serving as Director of Business Development for Twitter and on the board of Creative Commons. Another strong advocate for openness, <a href="http://en.wikipedia.org/wiki/Joi_Ito">Joi Ito</a>, is Director of the MIT Media Lab and Creative Commons Board Chair, who brings with him years of innovative work as a thought-leader in the field.</p>
<p>We look forward to the advice of <a href="http://en.wikipedia.org/wiki/Jennifer_Pahlka">Jen Pahlka</a>, founder and Executive Director at Code for America. Jen has led Code for America through a remarkable two years of growth to become a high-impact success, and we are delighted to have her insight on growing a non-profit as well as her experience working with government. <a href="http://www.linkedin.com/in/evaho1" target="_blank">Eva Ho</a>, VP of Marketing &amp; Operations at Factual who has also served on the boards of several nonprofits, brings additional insight into nonprofit management, as well as valuable experience around big data.</p>
<p>Big data is critical to our work of maintaining an open crawl of the web, and we are fortunate to have numerous experts who can advise on this critical area. <a href="http://en.wikipedia.org/wiki/Kurt_Bollacker">Kurt Bollacker </a>is the Digital Research Director of the Long Now Foundation and he formerly served as Technical Director at Internet Archive and Chief Scientist at Metaweb. <a href="http://www.linkedin.com/in/peterskomoroch">Pete Skomoroch</a> is a highly respected data scientist, currently employed by LinkedIn, who brings with him substantial knowledge about machine learning and search. <a href="http://www.linkedin.com/in/shimanovsky">Boris Shimanovsky </a>is a prolific, lifelong programmer and Director of Engineering at Factual. <a href="http://www.linkedin.com/in/petewarden">Pete Warden</a>, also a programmer, is the current CTO of Jetpac and a highly respected expert in large-scale data processing and visualization.</p>
<p><a href="http://en.wikipedia.org/wiki/Danny_Sullivan_(technologist)">Danny Sullivan</a>, widely considered a leading &#8220;search engine guru,&#8221; will bring valuable guidance and insight as Common Crawl grows and develops. <a href="http://www.linkedin.com/in/billmichels">Bill Michels</a> is another member of our team with extensive experience in search from his years at Yahoo! which include working as Director of Yahoo! BOSS. We are very lucky to have <a href="http://en.wikipedia.org/wiki/Peter_Norvig">Peter Norvig</a>, Director of Research at Google and a Fellow of the American Association for Artificial Intelligence and the Association for Computing Machinery.</p>
<p>We are delighted that such an array of talented people see the importance in the work we do, and are honored to have their guidance as we look forward to a year of growth and milestones for Common Crawl.</p>
]]></content:encoded>
			<wfw:commentRss>http://commoncrawl.org/common-crawls-advisory-board/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Article: Tech Entrepreneur Gil Elbaz made it big in L.A.</title>
		<link>http://commoncrawl.org/12579/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=12579</link>
		<comments>http://commoncrawl.org/12579/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 17:36:27 +0000</pubDate>
		<dc:creator>Allison Domicone</dc:creator>
				<category><![CDATA[Media]]></category>

		<guid isPermaLink="false">http://commoncrawl.org/?p=12579</guid>
		<description><![CDATA[February 5, 2012]]></description>
			<content:encoded><![CDATA[<p><a href="http://commoncrawl.org/12579/latimeslogosmall/" rel="attachment wp-att-12580"><img class="alignnone  wp-image-12580" title="LATimeslogoSmall" src="http://commoncrawl.org/wp-content/uploads/2012/02/LATimeslogoSmall.png" alt="" width="350" height="54" /></a></p>
<p>Published in Los Angeles Times: <a href="http://www.latimes.com/business/la-fi-himi-elbaz-20120205,0,3052440.story" target="_blank">Tech Entrepreneur Gil Elbaz made it big in L.A.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://commoncrawl.org/12579/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Common Crawl on AWS Public Data Sets</title>
		<link>http://commoncrawl.org/common-crawl-on-aws-public-data-sets/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=common-crawl-on-aws-public-data-sets</link>
		<comments>http://commoncrawl.org/common-crawl-on-aws-public-data-sets/#comments</comments>
		<pubDate>Mon, 23 Jan 2012 18:19:34 +0000</pubDate>
		<dc:creator>Lisa Green</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12491</guid>
		<description><![CDATA[Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services&#8217; Public Data Sets. This is great news because it means that the Common Crawl data corpus is now much more readily accessible and visible to the public. The greater accessibility and visibility is a significant help in our mission [...]]]></description>
			<content:encoded><![CDATA[<div><img class="size-full wp-image-12496 alignleft" title="AWS_LOGO_CMYK" src="http://www.commoncrawl.org/wp-content/uploads/2012/01/AWS_LOGO_CMYK.png" alt="" width="207" height="75" />
</div>
<p>Common Crawl is thrilled to announce that <a href="http://aws.amazon.com/datasets/4174056542375970">our data is now hosted on Amazon Web Services&#8217; Public Data Sets</a>. This is great news because it means that the Common Crawl data corpus is now much more readily accessible and visible to the public. The greater accessibility and visibility is a significant help in our mission of enabling a new wave of innovation, education, and research.</p>
<div>
<p>Amazon Web Services (AWS) provides a centralized repository of public data sets that can be integrated in AWS cloud-based applications. AWS makes available such estimable large data sets as the mapping of the Human Genome and the US Census. Previously, such data was often prohibitively difficult to access and use. With the Amazon Elastic Compute Cloud, it takes a matter of minutes to begin computing on the data.</p>
<p>Demonstrating their commitment to an open web, AWS hosts public data sets at no charge for the community, so users pay only for the compute and storage they use for their own applications. What this means for you is that our data &#8211; all 5 billion web pages of it &#8211; just got a whole lot slicker and easier to use.</p>
<p>We greatly appreciate Amazon’s support for the open web in general, and we’re especially appreciative of their support for Common Crawl. Placing our data in the public data sets not only benefits the larger community, but it also saves us money. As a nonprofit in the early phases of existence, this is crucial.</p>
<p>A huge thanks to Amazon for seeing the importance in the work we do and for so generously supporting our shared goal of enabling increased open innovation!</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://commoncrawl.org/common-crawl-on-aws-public-data-sets/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Article: SemanticWeb.com</title>
		<link>http://commoncrawl.org/article-semanticweb-com/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=article-semanticweb-com</link>
		<comments>http://commoncrawl.org/article-semanticweb-com/#comments</comments>
		<pubDate>Fri, 20 Jan 2012 18:50:34 +0000</pubDate>
		<dc:creator>Lisa Green</dc:creator>
				<category><![CDATA[Media]]></category>

		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12521</guid>
		<description><![CDATA[January 20, 2012]]></description>
			<content:encoded><![CDATA[<p><a href="http://semanticweb.com/common-crawl-founder-gil-elbaz-speaks-about-new-relationship-with-amazon-semantic-web-projects-using-its-corpus-and-why-open-web-crawls-matter-to-developing-big-data-expertise_b26109"><img class="size-full wp-image-12522 alignleft" title="semanticweb.com logo" src="http://www.commoncrawl.org/wp-content/uploads/2012/01/semanticweb.com-logo.jpg" alt="" width="180" height="180" /></a></p>
<p>Published on SemanticWeb.com: <a href="http://semanticweb.com/common-crawl-founder-gil-elbaz-speaks-about-new-relationship-with-amazon-semantic-web-projects-using-its-corpus-and-why-open-web-crawls-matter-to-developing-big-data-expertise_b26109">Common Crawl Founder Gil Elbaz Speaks About New Relationship With Amazon, Semantic Web Projects Using Its Corpus, And Why Open Web Crawls Matter To Developing Big Data Expertise</a></p>
<h1></h1>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://commoncrawl.org/article-semanticweb-com/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Article: I Programmer</title>
		<link>http://commoncrawl.org/article-i-programmer/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=article-i-programmer</link>
		<comments>http://commoncrawl.org/article-i-programmer/#comments</comments>
		<pubDate>Fri, 20 Jan 2012 00:04:37 +0000</pubDate>
		<dc:creator>Allison Domicone</dc:creator>
				<category><![CDATA[Media]]></category>

		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12454</guid>
		<description><![CDATA[November 14, 2011]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.i-programmer.info/news/136-open-source/3320-common-crawl.html"><img class="alignnone size-full wp-image-12471" title="i-programmer" src="http://www.commoncrawl.org/wp-content/uploads/2012/01/i-programmer.png" alt="" width="298" height="75" /></a></p>
<p>Published on &#8220;I Programmer&#8221;: <a href="http://www.i-programmer.info/news/136-open-source/3320-common-crawl.html" target="_blank">Common Crawl &#8211; now everyone can be Google</a></p>
]]></content:encoded>
			<wfw:commentRss>http://commoncrawl.org/article-i-programmer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Article: Read Write Web</title>
		<link>http://commoncrawl.org/article-read-write-web/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=article-read-write-web</link>
		<comments>http://commoncrawl.org/article-read-write-web/#comments</comments>
		<pubDate>Fri, 20 Jan 2012 00:02:56 +0000</pubDate>
		<dc:creator>Allison Domicone</dc:creator>
				<category><![CDATA[Media]]></category>

		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12452</guid>
		<description><![CDATA[November 7, 2011]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.readwriteweb.com/archives/common_crawl_foundation_announces_5_billion_page_w.php"><img class="alignnone size-full wp-image-12519" title="readwriteweb_logo" src="http://www.commoncrawl.org/wp-content/uploads/2012/01/readwriteweb_logo.jpg" alt="" width="371" height="109" /></a></p>
<p>Published on Read Write Web: <a href="http://www.readwriteweb.com/archives/common_crawl_foundation_announces_5_billion_page_w.php" target="_blank">New 5 Billion Page Web Index with Page Rank Now Available for Free from Common Crawl <em></em></a></p>
]]></content:encoded>
			<wfw:commentRss>http://commoncrawl.org/article-read-write-web/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Video Tutorial: MapReduce for the Masses</title>
		<link>http://commoncrawl.org/video-tutorial-zero-to-hadoop-in-five-minutes-with-common-crawl/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=video-tutorial-zero-to-hadoop-in-five-minutes-with-common-crawl</link>
		<comments>http://commoncrawl.org/video-tutorial-zero-to-hadoop-in-five-minutes-with-common-crawl/#comments</comments>
		<pubDate>Thu, 19 Jan 2012 23:23:54 +0000</pubDate>
		<dc:creator>Allison Domicone</dc:creator>
				<category><![CDATA[Media]]></category>

		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12435</guid>
		<description><![CDATA[Learn how you can harness the power of MapReduce. ]]></description>
			<content:encoded><![CDATA[<p>Learn how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents. Check out the full <a title="MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl" href="http://www.commoncrawl.org/mapreduce-for-the-masses/">blog post</a> where this video originally appeared.</p>
<p><iframe src="http://www.youtube.com/embed/y4GZ0Ey9DVw" width="625" height="469"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://commoncrawl.org/video-tutorial-zero-to-hadoop-in-five-minutes-with-common-crawl/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Gil Elbaz and Nova Spivack on This Week in Startups</title>
		<link>http://commoncrawl.org/gil-elbaz-and-nova-spivack-on-this-week-in-startups/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=gil-elbaz-and-nova-spivack-on-this-week-in-startups</link>
		<comments>http://commoncrawl.org/gil-elbaz-and-nova-spivack-on-this-week-in-startups/#comments</comments>
		<pubDate>Thu, 12 Jan 2012 17:50:37 +0000</pubDate>
		<dc:creator>Allison Domicone</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[gil elbaz]]></category>
		<category><![CDATA[nova spivack]]></category>
		<category><![CDATA[this week in startups]]></category>

		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12373</guid>
		<description><![CDATA[As a sign of many more good things to come in 2012, Founder Gil Elbaz and Board Member Nova Spivack appeared on this week&#8217;s episode of This Week in Startups. Nova and Gil, in dicussion with host Jason Calacanis, explore in depth what Common Crawl is all about and how it fits into the larger [...]]]></description>
			<content:encoded><![CDATA[<p>As a sign of many more good things to come in 2012, Founder Gil Elbaz and Board Member Nova Spivack appeared on this week&#8217;s episode of <a href="http://thisweekin.com/thisweekin-startups/gil-elbaz-and-nova-spivack-of-common-crawl-on-this-week-in-startups-222/" target="_blank">This Week in Startups</a>. Nova and Gil, in dicussion with host Jason Calacanis, explore in depth what Common Crawl is all about and how it fits into the larger picture of online search and indexing. Underlying their conversation is an exploration of how Common Crawl&#8217;s open crawl of the web is a powerful asset for educators, researchers, and entrepreneurs.</p>
<p>Some of my favorite moments from the show include:</p>
<ul>
<li>In a great soundbyte from Jason at the beginning of the show, he observes that Common Crawl is in many ways the &#8220;Wikipedia of the search engine.&#8221; (8:50)</li>
<li>When the question is posed whether or not Common Crawl may eventually charge some fee for our data and tools, Nova&#8217;s response that Common Crawl is &#8220;better if it&#8217;s free&#8230; [We] want this to be like the public library system&#8221; captures the spirit of Common Crawl&#8217;s mission and our commitment to the open web. (32:00)</li>
<li>When asked about projects and applications that would benefit from Common Crawl, Gil makes a compelling case for organizations that can use Common Crawl as a teaching tool. If someone wants to teach Hadoop at scale, for example, it&#8217;s essential for them to have a realistic corpus to work with &#8212; and Common Crawl can provide that. (46:18 )</li>
</ul>
<p>Those are just a few of the highlights, but I highly recommend watching the episode in its entirety for even more insights from Gil and Nova as we gear up for big things ahead for Common Crawl!</p>
<p><iframe src="http://www.youtube.com/embed/cjtZW6hR_o0" frameborder="0" width="560" height="315"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://commoncrawl.org/gil-elbaz-and-nova-spivack-on-this-week-in-startups/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Video: This Week in Startups &#8211; Gil Elbaz and Nova Spivack</title>
		<link>http://commoncrawl.org/video-this-week-in-startups/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=video-this-week-in-startups</link>
		<comments>http://commoncrawl.org/video-this-week-in-startups/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 23:19:38 +0000</pubDate>
		<dc:creator>Allison Domicone</dc:creator>
				<category><![CDATA[Media]]></category>

		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12432</guid>
		<description><![CDATA[January 10, 2012]]></description>
			<content:encoded><![CDATA[<p>Founder Gil Elbaz and Board Member Nova Spivack appeared on <a href="http://thisweekin.com/thisweekin-startups/gil-elbaz-and-nova-spivack-of-common-crawl-on-this-week-in-startups-222/" target="_blank">This Week in Startups</a> on January 10, 2012. Nova and Gil, in dicussion with host Jason Calacanis, explore in depth what Common Crawl is all about and how it fits into the larger picture of online search and indexing. Underlying their conversation is an exploration of how Common Crawl’s open crawl of the web is a powerful asset for educators, researchers, and entrepreneurs.</p>
<p><iframe src="http://www.youtube.com/embed/cjtZW6hR_o0" width="625" height="469"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://commoncrawl.org/video-this-week-in-startups/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl</title>
		<link>http://commoncrawl.org/mapreduce-for-the-masses/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=mapreduce-for-the-masses</link>
		<comments>http://commoncrawl.org/mapreduce-for-the-masses/#comments</comments>
		<pubDate>Fri, 16 Dec 2011 20:21:06 +0000</pubDate>
		<dc:creator>Steve Salevan</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[crawl]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[howto]]></category>

		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12242</guid>
		<description><![CDATA[Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we&#8217;ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset [...]]]></description>
			<content:encoded><![CDATA[<p>Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we&#8217;ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.</p>
<p>When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.</p>
<p>With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?</p>
<p>This is the very question we hope to answer with this blog post, and the example we&#8217;ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we&#8217;ll be running it against the Internet.</p>
<p>When you&#8217;ve got a taste of what&#8217;s possible when open source meets open data, we&#8217;d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!</p>
<p>Ready to get started?  Watch our screencast and follow along below:</p>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/y4GZ0Ey9DVw?fs=1&#038;feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p style="text-align: center"><span style="text-decoration: underline"><strong>Step 1 &#8211; Install Git and Eclipse</strong></span></p>
<p>We first need to install a few important tools to get started:</p>
<p><strong>Eclipse (for writing Hadoop code)</strong></p>
<p><span style="text-decoration: underline">How to install (Windows and OS X):</span></p>
<p>Download the &#8220;Eclipse IDE for Java developers&#8221; installer package located at:</p>
<p style="text-align: left"><a title="http://www.youtube.com/watch?v=y4GZ0Ey9DVw" href="http://www.youtube.com/watch?v=y4GZ0Ey9DVw" target="_blank">http://www.eclipse.org/downloads/</a></p>
<p><span style="text-decoration: underline">How to install (Linux):</span></p>
<p>Run the following command in a terminal:</p>
<p><em>RHEL/Fedora</em></p>
<pre> # sudo yum install eclipse</pre>
<p><em>Ubuntu/Debian</em></p>
<pre> # sudo apt-get install eclipse</pre>
<p><strong><span style="text-decoration: underline">Git (for retrieving our sample application)</span></strong></p>
<p><span style="text-decoration: underline">How to install (Windows)</span></p>
<p>Install the latest .EXE from:</p>
<p style="text-align: left"><a title="http://code.google.com/p/msysgit/downloads/list" href="http://code.google.com/p/msysgit/downloads/list" target="_blank">http://code.google.com/p/msysgit/downloads/list</a></p>
<p><span style="text-decoration: underline">How to install (OS X)</span></p>
<p>Install the appropriate .DMG from:</p>
<p style="text-align: left"><a title="http://code.google.com/p/git-osx-installer/downloads/list" href="http://code.google.com/p/git-osx-installer/downloads/list" target="_blank">http://code.google.com/p/git-osx-installer/downloads/list</a></p>
<p><span style="text-decoration: underline">How to install (Linux)</span></p>
<p>Run the following command in a terminal:</p>
<p><em>RHEL/Fedora</em></p>
<pre># sudo yum install git</pre>
<p><em>Ubuntu/Debian</em></p>
<pre># sudo apt-get install git</pre>
<p style="text-align: center"><span style="text-decoration: underline"><strong>Step 2 &#8211; Check out the code and compile the HelloWorld JAR</strong></span></p>
<p>Now that you&#8217;ve installed the packages you need to play with our code, run the following command from a terminal/command prompt to pull down the code:</p>
<pre># git clone git://github.com/ssalevan/cc-helloworld.git</pre>
<p>Next, start Eclipse.  Open the File menu then select &#8220;Project&#8221; from the &#8220;New&#8221; menu.  Open the &#8220;Java&#8221; folder and select &#8220;Java Project from Existing Ant Buildfile&#8221;.  Click Browse, then locate the folder containing the code you just checked out (if you didn&#8217;t change the directory when you opened the terminal, it should be in your home directory) and select the &#8220;build.xml&#8221; file.  Eclipse will find the right targets, and tick the &#8220;Link to the buildfile in the file system&#8221; box, as this will enable you to share the edits you make to it in Eclipse with git.</p>
<p>We now need to tell Eclipse how to build our JAR, so right click on the base project folder (by default it&#8217;s named &#8220;Hello World&#8221;) and select &#8220;Properties&#8221; from the menu that appears.  Navigate to the Builders tab in the left hand panel of the Properties window, then click &#8220;New&#8221;.  Select &#8220;Ant Builder&#8221; from the dialog which appears, then click OK.</p>
<p>To configure our new Ant builder, we need to specify three pieces of information here: where the buildfile is located, where the root directory of the project is, and which ant build target we wish to execute.  To set the buildfile, click the &#8220;Browse File System&#8221; button under the &#8220;Buildfile:&#8221; field, and find the build.xml file which you found earlier.  To set the root directory, click the &#8220;Browse File System&#8221; button under the &#8220;Base Directory:&#8221; field, and select the folder into which you checked out our code.  To specify the target, enter &#8220;dist&#8221; without the quotes into the &#8220;Arguments&#8221; field.  Click OK and close the Properties window.</p>
<p>Finally, right click on the base project folder and select &#8220;Build Project&#8221;, and Ant will assemble a JAR, ready for use in Elastic MapReduce.</p>
<p style="text-align: center"><span style="text-decoration: underline"><strong>Step 3 &#8211; Get an Amazon Web Services account (if you don’t have one already) and find your security credentials</strong></span></p>
<p>If you don&#8217;t already have an account with Amazon Web Services, you can sign up for one at the following URL:</p>
<p style="text-align: left"><a title="https://aws-portal.amazon.com/gp/aws/developer/registration/index.html" href="https://aws-portal.amazon.com/gp/aws/developer/registration/index.html" target="_blank">https://aws-portal.amazon.com/gp/aws/developer/registration/index.html</a></p>
<p>Once you&#8217;ve registered, visit the following page and copy down your Access Key ID and Secret Access Key:</p>
<p style="text-align: left"><a title="https://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key" href="https://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key" target="_blank">https://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key</a></p>
<p>This information can be used by any Amazon Web Services client to authorize things that cost money, so be sure to keep this information in a safe place.</p>
<p style="text-align: center"><strong><span style="text-decoration: underline">Step 4 &#8211; Upload the HelloWorld JAR to Amazon S3</span></strong></p>
<p>Uploading the JAR we just built to Amazon S3 is a lot simpler than it sounds. First, visit the following URL:</p>
<p style="text-align: left"><a title="https://console.aws.amazon.com/s3/home" href="https://console.aws.amazon.com/s3/home" target="_blank">https://console.aws.amazon.com/s3/home</a></p>
<p>Next, click &#8220;Create Bucket&#8221;, give your bucket a name, and click the &#8220;Create&#8221; button. Select your new S3 bucket in the left-hand pane, then click the &#8220;Upload&#8221; button, and select the JAR you just built. It should be located here:</p>
<p style="text-align: center"><strong>&lt;your checkout dir&gt;</strong>/dist/lib/HelloWorld.jar</p>
<p style="text-align: center"><span style="text-decoration: underline"><strong>Step 5 &#8211; Create an Elastic MapReduce job based on your new JAR</strong></span></p>
<p>Now that the JAR is uploaded into S3, all we need to do is to point Elastic MapReduce to it, and as it so happens, that&#8217;s pretty easy to do too! Visit the following URL:</p>
<p style="text-align: left"><a title="https://console.aws.amazon.com/elasticmapreduce/home" href="https://console.aws.amazon.com/elasticmapreduce/home" target="_blank">https://console.aws.amazon.com/elasticmapreduce/home</a></p>
<p>and click the &#8220;Create New Job Flow&#8221; button. Give your new flow a name, and tick the &#8220;Run your own application&#8221; box. Select &#8220;Custom JAR&#8221; from the &#8220;Choose a Job Type&#8221; menu and click the &#8220;Continue&#8221; button.</p>
<p>The next field in the wizard will ask you which JAR to use and what command-line arguments to pass to it. Add the following location:</p>
<p style="text-align: center">s3n://<strong>&lt;your bucket name&gt;</strong>/HelloWorld.jar</p>
<p>then add the following arguments to it:</p>
<p>org.commoncrawl.tutorial.HelloWorld <strong>&lt;your aws secret key id&gt;</strong> <strong>&lt;your aws secret key&gt;</strong> 2010/01/07/18/1262876244253_18.arc.gz s3n://<strong>&lt;your bucket name&gt;</strong>/helloworld-out</p>
<p>CommonCrawl stores its crawl information as GZipped ARC-formatted files (<a title="http://www.archive.org/web/researcher/ArcFileFormat.php" href="http://www.archive.org/web/researcher/ArcFileFormat.php" target="_blank">http://www.archive.org/web/researcher/ArcFileFormat.php</a>), and each one is indexed using the following strategy:</p>
<p style="text-align: center">/<strong>YYYY</strong>/<strong>MM</strong>/<strong>DD</strong>/<strong>the hour that the crawler ran in 24-hour format</strong>/*.arc.gz</p>
<p>Thus, by passing these arguments to the JAR we uploaded, we&#8217;re telling Hadoop to:</p>
<p>1. Run the main() method in our HelloWorld class (located at org.commoncrawl.tutorial.HelloWorld)</p>
<p>2. Log into Amazon S3 with your AWS access codes</p>
<p>3. Count all the words taken from a chunk of what the web crawler downloaded at 6:00PM on January 7th, 2010</p>
<p>4. Output the results as a series of CSV files into your Amazon S3 bucket (in a directory called helloworld-out)</p>
<p><em>Edit 12/21/11: Updated to use directory prefix notation instead of glob notation (thanks Petar!)</em></p>
<p>If you prefer to run against a larger subset of the crawl, you can use directory prefix notation to specify a more inclusive set of data. For instance:</p>
<p><strong>2010</strong>/<strong>01</strong>/<strong>07</strong>/<strong>18</strong> - All files from this particular crawler run (6PM, January 7th 2010)</p>
<p><strong>2010/ </strong>- All crawl files from 2010</p>
<p>Don&#8217;t worry about the continue fields for now, just accept the default values. If you’re offered the opportunity to use debugging, I recommend enabling it to be able to see your job in action. Once you&#8217;ve clicked through them all, click the &#8220;Create Job Flow&#8221; button and your Hadoop job will be sent to the Amazon cloud.</p>
<p style="text-align: center"><span style="text-decoration: underline"><strong>Step 6 &#8211; Watch the show</strong></span></p>
<p>Now just wait and watch as your job runs through the Hadoop flow; you can look for errors by using the Debug button. Within about 10 minutes, your job will be complete. You can view results in the S3 Browser panel, located here. If you download these files and load them into a text editor, you can see what came out of the job. You can take this sort of data and add it into a database, or create a new Hadoop OutputFormat to export into XML which you can render into HTML with an XSLT, the possibilities are pretty much endless.</p>
<p style="text-align: center"><span style="text-decoration: underline"><strong>Step 7 &#8211; Start playing!</strong></span></p>
<p>If you find something cool in your adventures and want to share it with us, we’ll feature it on our site if we think it’s cool too. To submit a remix, push your codebase to <a title="GitHub" href="http://github.org" target="_blank">GitHub</a> or <a title="Gitorious" href="http://gitorious.org" target="_blank">Gitorious</a> and send a message to our <a title="user group" href="http://groups.google.com/group/common-crawl" target="_blank">user group</a> about it: we promise we’ll look at it.</p>
]]></content:encoded>
			<wfw:commentRss>http://commoncrawl.org/mapreduce-for-the-masses/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
		</item>
	</channel>
</rss>
<!-- This Quick Cache file was built for (  commoncrawl.org/feed/ ) in 0.66354 seconds, on Feb 22nd, 2012 at 6:25 pm UTC. -->
<!-- This Quick Cache file will automatically expire ( and be re-built automatically ) on Feb 22nd, 2012 at 7:25 pm UTC -->
<!-- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -->
<!-- Quick Cache Is Fully Functional :-) ... A Quick Cache file was just served for (  commoncrawl.org/feed/ ) in 0.00039 seconds, on Feb 22nd, 2012 at 6:53 pm UTC. -->
