Thursday, January 30, 2014

Book Review : Web Crawling and Data Mining with Apache Nutch

In our space, we found that some of the most current healthcare related information is found on the internet. We harvest that information as input to our healthcare masterfile. Our crawlers run against hundreds of websites. We have a fairly large web harvester, which is what drove me to explore Nutch with Cassandra: Crawling the web with Cassandra.

When Web Crawling and Data Mining with Apache Nutch came out, I was eager to have a read. The first quarter of the book is largely introductory. It walks you through the basics of operating Nutch and the layers in the design: Injecting, Generating, Fetching, Parsing, Scoring and Indexing (with SOLR).

For me, the book got a bit more interesting when it covered the Nutch Plugin architecture. HINT: Take a look at the overall architecture diagram on Page 34 before you start reading!

The book then covers deployment and scaling. A fair amount of time is spent on SOLR deployment and scaling (via sharding), which in and of itself may be valuable if you are a SOLR shop. (not so much if you are Elastic Search (ES) fans -- in fact, it was one of the reasons why we moved to ES ;)

About midway through the book, the real fun starts when the author covers how to run Nutch with/on Hadoop. This includes detailed instructions on Hadoop installation and configuration. This is followed by a chapter on persistence mechanisms, which uses Gora to abstract away the actual storage.

Overall, this is a solid book, especially if you are new to the space and need detailed, line by line instructions to get up and running. To kick it up a notch, it would have been nice to have a smattering of few use cases and real-world examples, but given the book is only about a hundred pages, it does a good job of balancing utility with color commentary.