Nutch – Plugin Tutorial

Nutch – Plugin Tutorial

May 17, 2012

In one of my previous posts about Nutch, I already mentioned plugins. The plugin system is central to how Nutch works and allows you to customize Nutch to your personal needs in a very flexible and maintainable way. Everybody who wants to use Nutch for other things than just playing around will be challenged to write an own plugin at one point or another. Indeed, there are many settings which can be changed within the files ‘nutch-default.xml’ or rather ‘nutch-site.xml’. But simply imagine you would like to add a new field to the index by doing some custom analysis of a parsed web page content, saving the result in a new variable and passing it to Solr as an additional field. This fairly plain example should be a quite common scenario and it requires you to implement a plugin. Of course, instead of writing additional code for the plugin, you could also alter the existing code of Nutch and therewith achieve the desired behavior. However, on the one hand you might have maintainability issues once you need to use a newer version of the Nutch project and on the other hand, developing a new plugin is easier and faster anyways. I have to say though, that “easier” does not mean “easy”. Turns out that writing a plugin is not trivial at all and requires a series of steps. This is why I will try to walk you through the process of creating a plugin and thereby hopefully help you in getting your own plugin started. Please notice, that this tutorial is based on Nutch 1.4. It may not work with other versions of Nutch.

The Use CaseAs already mentioned above, I simply would like to add a new field to the index. This new field should indicate the length of the parsed content of the respective web page and therefore be called “pageLength”.

Where to StartEvery plugin needs to extend one or more already existing interfaces called extension-points. A list of the most important extension points can be found here. For our use case, we need to extend ‘IndexingFilter’, since we want to customize the indexing process.

Required FilesAs a first step, you need to create all the necessary new files. Lets say, we call the plugin “myPlugin”. Then you need to create the new folder $NUTCH_HOME/src/plugin/myPlugin. Next, simply copy and paste all the files from the urlmeta-plugin ($NUTCH_HOME/src/plugin/urlmeta) to the myPlugin-folder. Now, rename and delete the adequate files and directories in order to get the following structure (you can do this within Eclipse as well as directly on the file system):

1

2

3

4

5

6

7

8

9

10

11

myPlugin/

plugin.xml

build.xml

ivy.xml

src/

java/

org/

apache/

nutch/

indexer/

AddField.java

plugin.xml
Your plugin.xml file should look like this (for comparison: you can find another example on the PluginCentral):

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

<?xml version="1.0"encoding="UTF-8"?>

<plugin id="myPlugin"name="Add Field to Index"

version="1.0.0"provider-name="your name">

<runtime>

<library name="myPlugin.jar">

<export name="*"/>

</library>

</runtime>

<extension id="org.apache.nutch.indexer.myPlugin"

name="Add Field to Index"

point="org.apache.nutch.indexer.IndexingFilter">

<implementation id="myPlugin"

class="org.apache.nutch.indexer.AddField"/>

</extension>

</plugin>

build.xml & ivy.xml
While the ivy.xml-file can stay exactly the same, you need to change one phrase in the build.xml to get the following result:

1

2

3

4

<?xml version="1.0"encoding="UTF-8"?>

<project name="myPlugin"default="jar">

<import file="../build-plugin.xml"/>

</project>

AddFieldToIndex.javaThis is the location where you finally can become creative and implement the desired customization to Nutch.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

packageorg.apache.nutch.indexer;

importorg.apache.commons.logging.Log;

importorg.apache.commons.logging.LogFactory;

importorg.apache.hadoop.conf.Configuration;

importorg.apache.hadoop.io.Text;

importorg.apache.nutch.crawl.CrawlDatum;

importorg.apache.nutch.crawl.Inlinks;

importorg.apache.nutch.indexer.IndexingFilter;

importorg.apache.nutch.indexer.NutchDocument;

importorg.apache.nutch.parse.Parse;

publicclassAddFieldimplementsIndexingFilter{

privatestaticfinalLog LOG=LogFactory.getLog(AddField.class);

privateConfiguration conf;

//implements the filter-method which gives you access to important Objects like NutchDocument

publicNutchDocument filter(NutchDocument doc,Parse parse,Text url,

CrawlDatum datum,Inlinks inlinks){

Stringcontent=parse.getText();

//adds the new field to the document

doc.add("pageLength",content.length());

returndoc;

}

//Boilerplate

publicConfiguration getConf(){

returnconf;

}

//Boilerplate

publicvoidsetConf(Configuration conf){

this.conf=conf;

}

}

Final ConfigurationsBefore you can observe the new plugin up and running and hopefully be happy about its performance, there are a couple more configuration steps you need to consider. The first one is to add the plugin to .src/plugin/build.xml. Therefore simply add the following line of code to the ‘Build & Deploy’ section of the file.

Usually you could compile Nutch at this point and start using the plugin. However, since we want to add a new field to the index, we need to let Nutch know about our intentions. This is why you need to add the code below to $NUTCH_HOME/conf/schema.xml in the <fields> section. In case its not the first time you run Nutch using Solr, you also need to add the code to $SOLR_HOME/…/solr/conf/schema.xml.

1

<field name="pageLength"type="long"stored="true"indexed="true"/>

Also, go ahead and add the following line to $NUTCH_HOME/conf/solrindex-mapping.xml.

1

<field dest="pageLength"source="pageLength"/>

Side note: If you installed Nutch correctly, then performing all these changes in $NUTCH_HOME/conf/ is sufficient as Nutch will propagate them to $NUTCH_HOME/runtime/local/conf/ as soon as you build Nutch.

Now, in a last step, you need to build Nutch by executing $NUTCH_HOME/build.xml.

Wrap UpI tried to keep the use case as simple as possible, as there are many configuration tasks that need to be taken care of. But once you understand the fundamentals of the plugin-concept of Nutch as well as how to get a plugin working, then you should also be capable of implementing even very comprehensive and challenging plugins – if you know how to program of course. In case you are still facing problems when trying to get your first plugin running, leave me a comment below and I will be happy to help. On the other hand, if the tutorial was helpful for you, it would be great to hear about that as well.

Additional Useful Resources– An insightful blog post that helped me a lot when I was developing my first plugin
– the Apache Nutch PluginCentral– take a look at an already implemented plugin that works: $NUTCH_HOME/src/plugin/urlmeta

18 comments on Nutch – Plugin Tutorial

Hi Florian,
I followed the instructions but it’s not working… I also added some Log.debug messages in the code, however i guess the plugin is not being invoked since they are not in hadoop.log …
I’m using nutch 1.4, my directory structure is:
src > java > com > ds > Socialq.java

Below my plugin.xml

Note:: on hadoop.log it appears in list of registered plugins…
Could you help me? I really dont’t know why it’s not working…

Hi Antonio,
I never changed the directory structure, but that should be ok.
Is there an error message in the log? If so, what does it say?
Is there a yourplugin.jar in the runtime/local/plugins? Also, I think the .jar name and the plugin name need to be the same…
If all that seems to be ok, then – as you already suspected – my guess would be that the plugin.xml file is not configured correctly. Here the id- and the class-attribute would be different for you: e.g. “com.ds.Socialq”
Also, notice that in my example I had to name the project in build.xml “myPlugin”.
Hope that helps!

Hello again,
I tried changed the code of urlmeta plugin (put some Log messages) and re-build to see if it works; well… it did not work too – appeared in registered plugins but not show the messages in hadoop log

Hi Antonio,
maybe your code for the log messages does not work? Simply System.out.println(“…”) also shows up in hadoop.log…
Does Nutch work without any additional plugins, meaning is it setup correctly? Did you add the urlmeta plugin within nutch-site.xml?
Are you sure, that there is no error shown in hadoop.log?
These are all guesses of course…
My tutorial is based on Nutch 1.4, so there is no other configuration.
If you just cant get it to work, you could implement the plugin exactly as I am doing it and then step by step alter that working version. Also consider to present your issue to the Nutch-User mailing list.
Good luck 😉

I replicated your Tutorial Step by Step and get the same error as Antonio. What I have seen is the following warning in hadoop.log:

2012-08-12 19:55:50,331 WARN mapred.JobClient – No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

Does this tell you something?

Also another remark. You call your plugin class file AddFieldToIndex.java but your class AddField, which is also referenced in plugin.xml. I believe this won’t work so I called it AddField consistently in my implementation.

Hi Alaak, thanks for your feedback. Your guess is absolutely correct. My java-File actually is named “AddField” and not “AddFieldToIndex” as I claim in the part where I explain the necessary file system structure. I already corrected that bug. Hope it works now!
If not, then please check your Nutch version, as I created this tutorial with Nutch 1.4 and am not sure whether it also works with the newest version of Nutch.
About the warning message: I don’t know for sure, but it is only a warning message?! So if you don’t get an error elsewhere, you should be fine.

Hi Florian,
I have had some trouble as i’m using external libraries along with the nutch APIs. I added these in a lib folder inside my plugin folder. Also my plugin needs to use a file that has to be read from the current directory. If this is indeed possible, can you please help me out? 🙂 Thanks.

Hi Ananth, both is possible. However, I don’t really see how your questions relate to Nutch in particular as they look like general programming questions to me.
Libraries: You can either add external libraries within Eclispe (right click on project -> build path -> configure build path -> add external JARs) or whatever IDE you are using (easy way) or integrate your libraries into the build process of Nutch by adapting build.xml to your needs. I didn’t look into the second option so far, which is why I can’t give you more detailed information there.
Read file: simply read the file within the java code of the plugin using a relative or absolute file path?! I don’t see your problem there…
Hope that helps, Florian

Hi,
I have read all the steps mentioned by you. It’s very good and helpful.

I am Working on : Refining(reordering) the results(urls) provided by the NUTCH…

Installation:
I have installed Nutch 2.1 and solr 4.0. I have crawled the seed urls using crawl command successfully.
I can see all the crawled details in my Mysql database “webpage” table.
I have also successfully installed the solr 4.0. I can see solr UI at localhost:8984/solr.

Queries:

1)When i give the query in solr the results displayed are ranked according which plugin of NUTCH? I can see the OPICScoringFilter plugin in NUTCH directory.
Is this plugin rank the results in solr?

2)There are many plugins in the NUTCH.. So what is the sequence of their execution? Means, which plugins get called after which plugin? How to find it?

3)Basically i want to refine the results provided by the NUTCH….So which plugin should i extend ? How can i do that??

Hi Florian,
I’m going to develop a custom plugin for nutch and I’m wondering if there is a way to compile the plugin without copying the code in the plugin folder.
I’m developing this plugin for a complex project where I’m customizing also other products and all my code is managed by SVN.
I need to create the nutch plugin source folder in my svn folder and so I can’t create it in the nutch folder.
Because I think that this is a very common problem, I’m wondering if you have already found a solution.

Hi,
I see that this plugin runs on the Nutch 1.x but gives error with Nutch 2.x. What needs to be changed in order to execute it on Nutch 2.x?

Feedback

Your email address will not be published. Required fields are marked *

Name*

Email

Website

Comment

About Me

Hi, I'm Florian Hartl. My main interests are data science, software engineering, health, and meaning. Originally from Bavaria, Germany, I currently live in Santa Monica where I work as a Data Scientist.