&tldr; If I wanted to learn about the nutch pipeline at a high level, then write a custom parser / indexer of my own where would a starting point be?

I have used the latest 1.x Nutch to crawl a few specific websites and been disappointed with the results, even after experimenting with new html-microdata capabilities with updates to Any23 project incorporated by Nutch, I am still not (yet) excited. Bottom line is website data is not well structured and not super friendly to algorithmic consumption (but you already knew that). To that end, I am interested to developer custom parsers per internet domain in an effort to capture specific domain data. It currently looks like the plugin.includes does not allow a per domain-based approach for parser / indexer. I wonder if someone could guide me toward a high level view of the Nutch data pipeline, then guide me towards where to get started for creating custom parsers that might support a per-domain approach?

The interfaces related to extending Nutch parser/indexer are actually verysimple. However, finding up-to-date documented samples is not. Luckily,Nutch comes with plenty built-in, so my suggestion would be to pick one, anddive into its implementation. Then just copy its folder and use it as askeleton, replacing the specific logic (and plugin metadata).

The first question you need to ask yourself is if you really want to write aParser/Indexer or just a HtmlParseFilter/IndexingFilter. I suspect that thedefault behaviour of the Nutch Parser and Indexer is useful for you, and youjust want to add more functionality (that is what Any23 is doing). You canchain Filters, so your code could also leverage the Any23 logic, forexample.

I think the best approach for domain-specific parsers is to have a customparser that maps from the URL to the specific code. This can be just one bigif/else, or a Map of domain->code (possibly using functional programming),or you can even have this map configurable in some file.

Once you have more specific questions/problems, I suggest you email[EMAIL PROTECTED]. [EMAIL PROTECTED] is intended for discussing codecontributions to Nutch, as far as I understand, and I think less people seeyour messages here. (Also, more people will benefit from your questionsthere...)

In summary, from my experience, writing any one of these plugins is reallyeasy (discounting your own complex logic, of course), just implementing oneor a few methods, changing some plugin XML file, and adding your extensionto the global build (Ant) files. But to really understand how the passeddata looks, and what you can do with it, debugging (in local mode) is theultimate tool, and in the end is much more time-efficient than looking forinformation on the web. This is partly because a lot of the data is passedin Map-like form, so even the JavaDoc doesn't really tell you what will bethere (it depends on what plugins you have configured, and how youconfigured those plugins...).

> custom parser / indexer of my own where would a starting point be?> > I have used the latest 1.x Nutch to crawl a few specific websites and been> disappointed with the results, even after experimenting with new html-> microdata capabilities with updates to Any23 project incorporated by

Nutch, I> am still not (yet) excited. Bottom line is website data is not wellstructured and> not super friendly to algorithmic consumption (but you already knew that).To> that end, I am interested to developer custom parsers per internet domainin an> effort to capture specific domain data. It currently looks like theplugin.includes> does not allow a per domain-based approach for parser / indexer. I wonderif> someone could guide me toward a high level view of the Nutch datapipeline,> then guide me towards where to get started for creating custom parsersthat> might support a per-domain approach?> > Thanks,> David

Thank you for all the tips. I think I need to understand better the pipeline of parsers and if/how their plug-in.includes order matters.

> On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[EMAIL PROTECTED]> wrote:> > Hi David,> > The interfaces related to extending Nutch parser/indexer are actually very> simple. However, finding up-to-date documented samples is not. Luckily,> Nutch comes with plenty built-in, so my suggestion would be to pick one, and> dive into its implementation. Then just copy its folder and use it as a> skeleton, replacing the specific logic (and plugin metadata).> > The first question you need to ask yourself is if you really want to write a> Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I suspect that the> default behaviour of the Nutch Parser and Indexer is useful for you, and you> just want to add more functionality (that is what Any23 is doing). You can> chain Filters, so your code could also leverage the Any23 logic, for> example.> > The documentation starting point is the Wiki> (https://wiki.apache.org/nutch/). For your specific question, this is the> most relevant page: https://wiki.apache.org/nutch/AboutPlugins.> > One (old) example of writing a custom parser can be found here:> http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I suggest you> Google for more information as needed, but always keep in mind that things> may have changed over time.> > I think the best approach for domain-specific parsers is to have a custom> parser that maps from the URL to the specific code. This can be just one big> if/else, or a Map of domain->code (possibly using functional programming),> or you can even have this map configurable in some file.> > Once you have more specific questions/problems, I suggest you email> [EMAIL PROTECTED]. [EMAIL PROTECTED] is intended for discussing code> contributions to Nutch, as far as I understand, and I think less people see> your messages here. (Also, more people will benefit from your questions> there...)> > In summary, from my experience, writing any one of these plugins is really> easy (discounting your own complex logic, of course), just implementing one> or a few methods, changing some plugin XML file, and adding your extension> to the global build (Ant) files. But to really understand how the passed> data looks, and what you can do with it, debugging (in local mode) is the> ultimate tool, and in the end is much more time-efficient than looking for> information on the web. This is partly because a lot of the data is passed> in Map-like form, so even the JavaDoc doesn't really tell you what will be> there (it depends on what plugins you have configured, and how you> configured those plugins...).> > Yossi.> > >> -----Original Message----->> From: David Ferrero [mailto:[EMAIL PROTECTED]]>> Sent: 11 February 2018 04:00>> To: [EMAIL PROTECTED]>> Subject: Custom Parser / Indexer Starting points>> >> &tldr; If I wanted to learn about the nutch pipeline at a high level, then> write a>> custom parser / indexer of my own where would a starting point be?>> >> I have used the latest 1.x Nutch to crawl a few specific websites and been>> disappointed with the results, even after experimenting with new html->> microdata capabilities with updates to Any23 project incorporated by> Nutch, I>> am still not (yet) excited. Bottom line is website data is not well> structured and>> not super friendly to algorithmic consumption (but you already knew that).> To>> that end, I am interested to developer custom parsers per internet domain> in an>> effort to capture specific domain data. It currently looks like the> plugin.includes>> does not allow a per domain-based approach for parser / indexer. I wonder> if>> someone could guide me toward a high level view of the Nutch data> pipeline,>> then guide me towards where to get started for creating custom parsers> that>> might support a per-domain approach?

The plug-in.includes order does not matter.To define the order of HtmlParseFilters, use the propertyhtmlparsefilter.order.To define the order of Parsers, use the file conf/parse-plugins.xml. Notethat once a single Parser returns a result, the following parsers will notbe run.

> parsers and if/how their plug-in.includes order matters.> > > On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[EMAIL PROTECTED]> wrote:> >> > Hi David,> >> > The interfaces related to extending Nutch parser/indexer are actually> > very simple. However, finding up-to-date documented samples is not.> > Luckily, Nutch comes with plenty built-in, so my suggestion would be> > to pick one, and dive into its implementation. Then just copy its> > folder and use it as a skeleton, replacing the specific logic (and

plugin

> metadata).> >> > The first question you need to ask yourself is if you really want to> > write a Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I> > suspect that the default behaviour of the Nutch Parser and Indexer is> > useful for you, and you just want to add more functionality (that is> > what Any23 is doing). You can chain Filters, so your code could also> > leverage the Any23 logic, for example.> >> > The documentation starting point is the Wiki> > (https://wiki.apache.org/nutch/). For your specific question, this is> > the most relevant page: https://wiki.apache.org/nutch/AboutPlugins.> >> > One (old) example of writing a custom parser can be found here:> > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I> > suggest you Google for more information as needed, but always keep in> > mind that things may have changed over time.> >> > I think the best approach for domain-specific parsers is to have a> > custom parser that maps from the URL to the specific code. This can be> > just one big if/else, or a Map of domain->code (possibly using> > functional programming), or you can even have this map configurable in

some

> file.> >> > Once you have more specific questions/problems, I suggest you email> > [EMAIL PROTECTED]. [EMAIL PROTECTED] is intended for discussing> > code contributions to Nutch, as far as I understand, and I think less> > people see your messages here. (Also, more people will benefit from> > your questions> > there...)> >> > In summary, from my experience, writing any one of these plugins is> > really easy (discounting your own complex logic, of course), just> > implementing one or a few methods, changing some plugin XML file, and> > adding your extension to the global build (Ant) files. But to really> > understand how the passed data looks, and what you can do with it,> > debugging (in local mode) is the ultimate tool, and in the end is much> > more time-efficient than looking for information on the web. This is> > partly because a lot of the data is passed in Map-like form, so even> > the JavaDoc doesn't really tell you what will be there (it depends on> > what plugins you have configured, and how you configured those

> The plug-in.includes order does not matter.> To define the order of HtmlParseFilters, use the property> htmlparsefilter.order.> To define the order of Parsers, use the file conf/parse-plugins.xml. Note> that once a single Parser returns a result, the following parsers will not> be run.>> > -----Original Message-----> > From: David Ferrero [mailto:[EMAIL PROTECTED]]> > Sent: 12 February 2018 06:23> > To: [EMAIL PROTECTED]> > Subject: Re: Custom Parser / Indexer Starting points> >> > Thank you for all the tips. I think I need to understand better the> pipeline of> > parsers and if/how their plug-in.includes order matters.> >> > > On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[EMAIL PROTECTED]>> wrote:> > >> > > Hi David,> > >> > > The interfaces related to extending Nutch parser/indexer are actually> > > very simple. However, finding up-to-date documented samples is not.> > > Luckily, Nutch comes with plenty built-in, so my suggestion would be> > > to pick one, and dive into its implementation. Then just copy its> > > folder and use it as a skeleton, replacing the specific logic (and> plugin> > metadata).> > >> > > The first question you need to ask yourself is if you really want to> > > write a Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I> > > suspect that the default behaviour of the Nutch Parser and Indexer is> > > useful for you, and you just want to add more functionality (that is> > > what Any23 is doing). You can chain Filters, so your code could also> > > leverage the Any23 logic, for example.> > >> > > The documentation starting point is the Wiki> > > (https://wiki.apache.org/nutch/). For your specific question, this is> > > the most relevant page: https://wiki.apache.org/nutch/AboutPlugins.> > >> > > One (old) example of writing a custom parser can be found here:> > > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I> > > suggest you Google for more information as needed, but always keep in> > > mind that things may have changed over time.> > >> > > I think the best approach for domain-specific parsers is to have a> > > custom parser that maps from the URL to the specific code. This can be> > > just one big if/else, or a Map of domain->code (possibly using> > > functional programming), or you can even have this map configurable in> some> > file.> > >> > > Once you have more specific questions/problems, I suggest you email> > > [EMAIL PROTECTED]. [EMAIL PROTECTED] is intended for discussing> > > code contributions to Nutch, as far as I understand, and I think less> > > people see your messages here. (Also, more people will benefit from> > > your questions> > > there...)> > >> > > In summary, from my experience, writing any one of these plugins is> > > really easy (discounting your own complex logic, of course), just> > > implementing one or a few methods, changing some plugin XML file, and> > > adding your extension to the global build (Ant) files. But to really> > > understand how the passed data looks, and what you can do with it,> > > debugging (in local mode) is the ultimate tool, and in the end is much> > > more time-efficient than looking for information on the web. This is> > > partly because a lot of the data is passed in Map-like form, so even> > > the JavaDoc doesn't really tell you what will be there (it depends on> > > what plugins you have configured, and how you configured those> plugins...).> > >> > > Yossi.> > >> > >> > >> -----Original Message-----> > >> From: David Ferrero [mailto:[EMAIL PROTECTED]]> > >> Sent: 11 February 2018 04Sent from Gmail IPad

NEW: Monitor These Apps!

Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext