Menu principal

How to install Nutch on an AWS EC2 Cluster

In order to install Nutch on an amazon EC2 Cluster, you will need a good comprehension of Nutch and Hadoop, this is the goal of this post.

We deviced to create this tutorial as we ran into many basic problems for which there was no clear solution documented on the web.

Version

Here we will explain how to install Nutch 1.9 on Debian wheeze, there is no guarantee that the following instructions will work with a different setup. You need to be very careful about getting the right versions, otherwise you may be exposed to lots of incomprehensible issues.

Note : this tutorial works with Nutch 1.10 as well.

Goal

In our case we are interested in directly getting the raw HTML from web pages of crawled web sites. We don’t need the indexation of pages. Thus our choices have been made with this goal in mind, you will have to decide if this is the best in your case.

Why Nutch ?

If you are interested by why we choose Nutch instead of another crawler / scraper, you can read our post: Choosing a Web Crawler.

Why Nutch 1.9 instead of 2.x ?

Nutch 2.x is rewrite from scratch. The biggest modification is the integration of Apache Gora, which allows to connect to several DB such Hbase, Cassandra, etc. However, Nutch 2.x is slower and has less features than Nutch 1.x.
This is why we choose 1.x.
Moreover, our only goal is to get the html code, thus we will not store it into database, we will prefer to store it directly into files.

Install Nutch 1.9

If you would like to implement a specific behavior in Nutch, such as a custom parsing, it is easier to modify the source code than writing a plugin. On the other hand, you will lose the possibility to use it with other Nutch versions.

For all instructions that follow, do not forget to replace all environment variables by your owns values, specially in the xml code.

If you decided to install from source, do not forget to install ant and to specify your plugin folder path in $NUTCH_HOME/conf/nutch-site.xml

You can define url filter in order to crawl only specific websites using $NUTCH_HOME/conf/regex-urlfilter.txt

To manage the starting urls, you should create files with one url per line into a directory given as seed to Nutch.

If you would like to get the best performance with your crawler, you can add the following parameters to $NUTCH_HOME/conf/nutch-default.xml, but beware that it removes the politeness that Nutch offers !

Then edit $NUTCH_HOME/src/bin/crawl and change numSlaves=1 at line 54 to your real number of slaves. You can also increase numTasks at line 59 from * 2 to * 10 and the numThreads at line 68 from 50 to 100.

If your goal is like ours, that is to only get HTML or text without indexing, you can disable Solr. To do so, comment the Solr operations of the crawl script Link inversion, Indexing on Solr and Cleanup on Solr and skip the Solr installation part.

Then compile

ant runtime

Solr 3.4

Be sure to install Solr 3.4 and not 4.6 as explained in the following link.

S3n: A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3.

S3 (alias S3b): A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem – you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

S3a: A successor to the S3 Native, s3n fs, the S3a: system uses Amazon’s libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for/successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema.

If you would like to stop the script before the end. You can do it properly at the end of the next segment by creating a file .STOP when the script was launched.

Restart crawling

If you would like to restart crawling, restart where you where and not refetch everything since the beginning. If you stop the script properly (.STOP), basically you just have to skip the Nutch inject command and it will works. If you did not, there are several cases:

If your crawldb is not locked (contains .locked), you can just skip the inject command as well.

If it is locked, you can just try to remove the .locked and skip the inject command again.

If you got any error with the above method, your crawldb is lost. Thus you have to generate it, by running Nutch updatedb on all finished segments and then running Nutch dedup, and, if you are lucky it will hopefully works ! But it will take a while. Personally I never tried that method.

Extraction

Nutch allow you to directly get back data very easily by launching Nutch readseg. It works for local or deploy mode, you just need to edit the path for you case.

Backup to S3

You can cp the segments folder in parallel of Nutch, because segment of one iteration is never use again after the iteration. But the crawldb is updated at each iteration, thus you need to wait crawldb finished to cp before let Nutch do the next iteration.

If you proceed as above, write with Hadoop to s3 (and not s3n), unfortunately only hadoop will be able to read data of your bucket and you are not going to be able to use other tools such s3cmd.

IMPORTANT

Never use the -p option with distcp command on S3 which preserve inter alias user, group and permission. S3 does not manage any permission in file or directory level. Thus it just slows down the copy for nothing. Moreover if you are doing it in the other direction, from S3 to HDFS, with -p, it will fail at map 100% !

Amazon

We use the image: ami-7ffae53a, which is a Debian, wheeze, pvm.

We use spot instances in order to reduce the cost.

We use m3.medium type, because t1 and t2 are made for very small CPU usage.