<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for CommonCrawl</title>
	<atom:link href="http://commoncrawl.org/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://commoncrawl.org</link>
	<description></description>
	<lastBuildDate>Thu, 16 Feb 2012 02:53:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>Comment on Answers to Recent Community Questions by Audris</title>
		<link>http://commoncrawl.org/answers-to-recent-community-questions/#comment-426</link>
		<dc:creator>Audris</dc:creator>
		<pubDate>Thu, 16 Feb 2012 02:53:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12082#comment-426</guid>
		<description>no. I wanted to pursue http://matpalm.com/blog/2011/12/10/common_crawl_visible_text/
but had no chance to do that yet. There appears to be a mention of first copying 
&quot;The first thing was to get the data into a hadoop cluster. 
It&#039;s made up of 300,000 100mb gzipped arc files stored in S3.
I wrote a dead simple 
distributed copy to do this.&quot;
and:
&quot;Pulling from S3 to EC2 is network bound so I ran using the 
MultithreadedMapRunner to ensure I could get as much throughput as possible.&quot;
</description>
		<content:encoded><![CDATA[<p>no. I wanted to pursue <a href="http://matpalm.com/blog/2011/12/10/common_crawl_visible_text/" rel="nofollow">http://matpalm.com/blog/2011/12/10/common_crawl_visible_text/</a><br />
but had no chance to do that yet. There appears to be a mention of first copying<br />
&#8220;The first thing was to get the data into a hadoop cluster.<br />
It&#8217;s made up of 300,000 100mb gzipped arc files stored in S3.<br />
I wrote a dead simple<br />
distributed copy to do this.&#8221;<br />
and:<br />
&#8220;Pulling from S3 to EC2 is network bound so I ran using the<br />
MultithreadedMapRunner to ensure I could get as much throughput as possible.&#8221;</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Answers to Recent Community Questions by Marcus</title>
		<link>http://commoncrawl.org/answers-to-recent-community-questions/#comment-424</link>
		<dc:creator>Marcus</dc:creator>
		<pubDate>Sat, 11 Feb 2012 19:55:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12082#comment-424</guid>
		<description> Did you determine how to increase throughput?</description>
		<content:encoded><![CDATA[<p> Did you determine how to increase throughput?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on SlideShare: Building a Scalable Web Crawler with Hadoop by Building a Scalable Web Crawler with Hadoop &#171; Another Word For It</title>
		<link>http://commoncrawl.org/slideshare-building-a-scalable-web-crawler-with-hadoop/#comment-421</link>
		<dc:creator>Building a Scalable Web Crawler with Hadoop &#171; Another Word For It</dc:creator>
		<pubDate>Fri, 27 Jan 2012 21:31:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12456#comment-421</guid>
		<description>[...] Building a Scalable Web Crawler with Hadoop [...]</description>
		<content:encoded><![CDATA[<p>[...] Building a Scalable Web Crawler with Hadoop [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Common Crawl on AWS Public Data Sets by Stephan</title>
		<link>http://commoncrawl.org/common-crawl-on-aws-public-data-sets/#comment-420</link>
		<dc:creator>Stephan</dc:creator>
		<pubDate>Fri, 27 Jan 2012 06:16:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12491#comment-420</guid>
		<description>Bittorrent is not an option? :-)</description>
		<content:encoded><![CDATA[<p>Bittorrent is not an option? :-)</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl by snyff</title>
		<link>http://commoncrawl.org/mapreduce-for-the-masses/#comment-419</link>
		<dc:creator>snyff</dc:creator>
		<pubDate>Thu, 26 Jan 2012 05:39:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12242#comment-419</guid>
		<description>If someone is interrested by having a direct access to the data, you can use aws(https://github.com/timkay/aws):

% perl aws get &quot;x-amz-request-payer:requester&quot;   commoncrawl-crawl-002/2010/01/07/18/1262876244253_18.arc.gz &gt; 1262876244253_18.arc.gz

The file is around 100Mo</description>
		<content:encoded><![CDATA[<p>If someone is interrested by having a direct access to the data, you can use aws(https://github.com/timkay/aws):</p>
<p>% perl aws get &#8220;x-amz-request-payer:requester&#8221;   commoncrawl-crawl-002/2010/01/07/18/1262876244253_18.arc.gz &gt; 1262876244253_18.arc.gz</p>
<p>The file is around 100Mo</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl by State of Data #82 &#171; Dr Data&#039;s Blog</title>
		<link>http://commoncrawl.org/mapreduce-for-the-masses/#comment-416</link>
		<dc:creator>State of Data #82 &#171; Dr Data&#039;s Blog</dc:creator>
		<pubDate>Fri, 20 Jan 2012 06:45:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12242#comment-416</guid>
		<description>[...] #architecture – Zero to Hadoop in 5 minutes [...]</description>
		<content:encoded><![CDATA[<p>[...] #architecture – Zero to Hadoop in 5 minutes [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl by George</title>
		<link>http://commoncrawl.org/mapreduce-for-the-masses/#comment-415</link>
		<dc:creator>George</dc:creator>
		<pubDate>Thu, 19 Jan 2012 03:16:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12242#comment-415</guid>
		<description>Great step by step. The hardest part was working through the constrains of case and spaces. Once I learned to use all lower case and no spaces all went smooth. My process ran 12 for minutes on the example archive.</description>
		<content:encoded><![CDATA[<p>Great step by step. The hardest part was working through the constrains of case and spaces. Once I learned to use all lower case and no spaces all went smooth. My process ran 12 for minutes on the example archive.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl by Peter Quirk</title>
		<link>http://commoncrawl.org/mapreduce-for-the-masses/#comment-412</link>
		<dc:creator>Peter Quirk</dc:creator>
		<pubDate>Sat, 14 Jan 2012 00:13:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12242#comment-412</guid>
		<description>I got this working a couple of weeks ago with the only problem being that the bucket name needed to be all lower-case. The output data format is supposed to be CSV but it looks like no CSV format I&#039;ve ever seen. The record separator is single-quote and the field separator is comma. most CSV-compatible tools cannot process this format. Is there a way to condition the output to be more easily digested by other tools?</description>
		<content:encoded><![CDATA[<p>I got this working a couple of weeks ago with the only problem being that the bucket name needed to be all lower-case. The output data format is supposed to be CSV but it looks like no CSV format I&#8217;ve ever seen. The record separator is single-quote and the field separator is comma. most CSV-compatible tools cannot process this format. Is there a way to condition the output to be more easily digested by other tools?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl by Luke Stanley</title>
		<link>http://commoncrawl.org/mapreduce-for-the-masses/#comment-411</link>
		<dc:creator>Luke Stanley</dc:creator>
		<pubDate>Fri, 13 Jan 2012 15:42:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12242#comment-411</guid>
		<description>I&#039;m very grateful for this how to, and at the same time I&#039;m frustrated that it&#039;s 5 minutes of intense clicking and prodding just to have a few java files run on Amazon. (Albeit manipulating some amazing data sets).
It would have been good to see a little coverage of the code and it&#039;s output.If it was a Sikuli script or a shell script to setup it would be easier too.
Surely there are tools that allow command line deployment and show the output in real time too?
I really thought this would be for the masses, it isn&#039;t clear that it is yet.If I get around to playing with this it would be interesting to compare how long it ACTUALLY takes to do the same thing as the video from start to finish, and to perhaps screen capture THAT more authentic process (despite ).There must be some web app alternative where code can just be edited and deployed without fuss? (Just given Amazon login data etc.) What about Jython also?</description>
		<content:encoded><![CDATA[<p>I&#8217;m very grateful for this how to, and at the same time I&#8217;m frustrated that it&#8217;s 5 minutes of intense clicking and prodding just to have a few java files run on Amazon. (Albeit manipulating some amazing data sets).<br />
It would have been good to see a little coverage of the code and it&#8217;s output.If it was a Sikuli script or a shell script to setup it would be easier too.<br />
Surely there are tools that allow command line deployment and show the output in real time too?<br />
I really thought this would be for the masses, it isn&#8217;t clear that it is yet.If I get around to playing with this it would be interesting to compare how long it ACTUALLY takes to do the same thing as the video from start to finish, and to perhaps screen capture THAT more authentic process (despite ).There must be some web app alternative where code can just be edited and deployed without fuss? (Just given Amazon login data etc.) What about Jython also?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl by MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl &#171; Another Word For It</title>
		<link>http://commoncrawl.org/mapreduce-for-the-masses/#comment-410</link>
		<dc:creator>MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl &#171; Another Word For It</dc:creator>
		<pubDate>Fri, 13 Jan 2012 00:32:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.commoncrawl.org/?p=12242#comment-410</guid>
		<description>[...] MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl [...]</description>
		<content:encoded><![CDATA[<p>[...] MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>
<!-- This Quick Cache file was built for (  commoncrawl.org/comments/feed/ ) in 0.56475 seconds, on Feb 22nd, 2012 at 6:58 pm UTC. -->
<!-- This Quick Cache file will automatically expire ( and be re-built automatically ) on Feb 22nd, 2012 at 7:58 pm UTC -->
