can output lists of discovered homepage URL's for seed lists and static fetch interval
*can output blacklists for hosts that have too many DNS failures to filter from the CrawlDB using domainblacklist-urlfilter

For JEXL expressions, all host metadata fields are available. All other fields are also available as:

unfetched – number of unfetched records
fetched – number of fetched records
gone – number of 404's
redirTemp – number if temporary redirects
redirPerm – number if permanent redirects
redirs – total number of redirects (redirTemp + redirPerm)
notModified – number of not modified records
ok – number of usable pages (fetched + notModified)
numRecords – total number of records
dnsFailures – number of DNS failures

Activity

Initial patch. This introduces a HostDB that keeps track of host information such as its homepage, CrawlDB statistics, DNS status and allows for metadata to be added. The dump tool can produce output suitable for the DomainBlacklistURLFilter. With it, you can automatically get rid of unknown that pollute your CrawlDB.

Markus Jelsma
added a comment - 10/May/12 11:38 Initial patch. This introduces a HostDB that keeps track of host information such as its homepage, CrawlDB statistics, DNS status and allows for metadata to be added. The dump tool can produce output suitable for the DomainBlacklistURLFilter. With it, you can automatically get rid of unknown that pollute your CrawlDB.
Comments are appreciated as usual!

TODO: fix multi redirects: host_a => host_b/page => host_c/page/whatever
http://www.ferienwohnung-armbruster.de/
http://www.ferienwohnung-armbruster.de/website/
http://www.ferienwohnung-armbruster.de/website/willkommen.php
We cannot reresolve redirects for host objects as CrawlDatum metadata is
not available. We also cannot reliably use the reducer in all cases since
redirects may be across hosts or even domains. The example above has
redirects that will end up in the same reducer. During that phase,
however, we do not know which URL redirects to the next URL.

The example is not showing the case when the re-directions are across different hosts.

Tejas Patil
added a comment - 12/May/13 11:35 Hi Markus Jelsma ,
The initial patch is good. This feature would be a good addition to nutch
I did some minor changes to it ( NUTCH-1325 .trunk.v2.path) mainly to make it work with the current trunk.
Sorry for bringing this up (after one entire year). Would it be ok if I take this work forward ?
If "yes", then kindly provide me more details about the stuff in "TODO":
(1) DumpHostDb class doesnt has a reducer and there was this comment there:
reduce unknown hosts to single unknown domain if possible. Enable via configuration
host_a.example.org,host_a.example.org ==> example.org
In the example, both the hosts were same. Are these ok:
host_a.example.org, host_b.example.org ==> example.org
x.xyz.org, a.abc.org ==> unknown
(2) In the UpdateHostDb class, map() method:
TODO: fix multi redirects: host_a => host_b/page => host_c/page/whatever
http://www.ferienwohnung-armbruster.de/
http://www.ferienwohnung-armbruster.de/website/
http://www.ferienwohnung-armbruster.de/website/willkommen.php
We cannot reresolve redirects for host objects as CrawlDatum metadata is
not available. We also cannot reliably use the reducer in all cases since
redirects may be across hosts or even domains. The example above has
redirects that will end up in the same reducer. During that phase,
however, we do not know which URL redirects to the next URL.
The example is not showing the case when the re-directions are across different hosts.

Hi Tejas - you're right for (1), it should indeed be host_a.example.org, host_b.example.org ==> example.org but not x.xyz.org, a.abc.org ==> unknown. The reducer should take the domain + suffix as key and then emit the domain if ALL hosts are unknown. If you emit a domain if most but not all hosts are unknown, the DomainBlacklistURLFilter will remove the entire domain from the CrawlDB and WebgraphDB.

The example for (2) does not include cross-domain redirects but the problem is similar. I think it works fine for now because multi-redirects are not very common on the entire internet.

A larger problem is the filterNormalize() method. It actually receives a hostname, not a URL. And to pass URL filters we must prepend the URL scheme to make it look like a URL. I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed.

I think i've got a slightly newer version of the tools but don't know what actually changed in the past year. I'll try to diff and upload it.

Markus Jelsma
added a comment - 13/May/13 15:24 Hi Tejas - you're right for (1), it should indeed be host_a.example.org, host_b.example.org ==> example.org but not x.xyz.org, a.abc.org ==> unknown. The reducer should take the domain + suffix as key and then emit the domain if ALL hosts are unknown. If you emit a domain if most but not all hosts are unknown, the DomainBlacklistURLFilter will remove the entire domain from the CrawlDB and WebgraphDB.
The example for (2) does not include cross-domain redirects but the problem is similar. I think it works fine for now because multi-redirects are not very common on the entire internet.
A larger problem is the filterNormalize() method. It actually receives a hostname, not a URL. And to pass URL filters we must prepend the URL scheme to make it look like a URL. I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed.
I think i've got a slightly newer version of the tools but don't know what actually changed in the past year. I'll try to diff and upload it.

Hi Markus Jelsma,
I stopped by this Jira (after a long time !!!) with an intention of getting it to a stage where we could have it inside trunk.
You had replied to my two concerns.

For (1):

host_a.example.org, host_b.example.org ==> example.org

This might NOT be a good idea.
(a) The websites for say "cs.uci.edu" and "bio.uci.edu" might be hosted independently. It can be argued to consider them as different hosts.
(b) I am not sure about the standards, but if something like "uci.cs.edu" is valid (subdomain is suffix of domain) then there would be a problem when we resolve "uci.cs.edu" and "ucla.cs.edu" to "cs.edu".

For (2): "I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed."
Do you have any suggestion to work this out ?

Tejas Patil
added a comment - 15/Dec/13 02:34 Hi Markus Jelsma ,
I stopped by this Jira (after a long time !!!) with an intention of getting it to a stage where we could have it inside trunk.
You had replied to my two concerns.
For (1):
host_a.example.org, host_b.example.org ==> example.org
This might NOT be a good idea.
(a) The websites for say "cs.uci.edu" and "bio.uci.edu" might be hosted independently. It can be argued to consider them as different hosts.
(b) I am not sure about the standards, but if something like "uci.cs.edu" is valid (subdomain is suffix of domain) then there would be a problem when we resolve "uci.cs.edu" and "ucla.cs.edu" to "cs.edu".
For (2): "I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed."
Do you have any suggestion to work this out ?

Markus Jelsma
added a comment - 16/Dec/13 13:21 Hi Tejas,
(1):
Current mapper is:
if (datum.numFailures() >= failureThreshold) {
// TODO: also write to external storage, i.e. memcache
context.write(key, emptyText);
}
If we change this to:
context.write(key, datum.numFailures());
Then in the reducer we can check if all hosts have failed, then emit domain name. If one host hasn't failed, we have to emit all the failed host names.
(2):
Perhaps we can retry with https:// and other scheme's if the first fails with http:// . It is ugly but should work.
Cheers,

Markus Jelsma
added a comment - 02/Jan/14 10:34 Hi Tejas - i think most seems fine now and i like the changes you've made so far and i cannot come up with a better solution right now for the https:// schema filtering issue.
Or there any other issues we didn't think about? Anyone else

Markus Jelsma
added a comment - 22/Jan/14 09:38 Hi Tejas - i am fine with the changes you have uploaded yesterday although the filtering thing is still as ugly as it was, although it works for https now

Markus Jelsma
added a comment - 22/Jan/14 11:56 Thanks a lot Tejas for spending your time on fixing the loose ends i left in the original work. I'll migrate your code back to our own Nutch.
Great work!

Hi Markus Jelsma,
Thanks for the correction. This feature would have not been without you in the first place. Apart from being a good addition to Nutch, HostDb has also helped in getting a simple design for Sitemap feature (NUTCH-1465).

Tejas Patil
added a comment - 22/Jan/14 13:10 Hi Markus Jelsma ,
Thanks for the correction. This feature would have not been without you in the first place. Apart from being a good addition to Nutch, HostDb has also helped in getting a simple design for Sitemap feature ( NUTCH-1465 ).
Cheers !!!

It would take me few weeks before I can work on this one. The reason being: I have recently left school and started working at a company. There is some legal paperwork that I would have to finish off to work on open source projects (even if its during my free time).

Tejas Patil
added a comment - 06/Mar/14 16:39 It would take me few weeks before I can work on this one. The reason being: I have recently left school and started working at a company. There is some legal paperwork that I would have to finish off to work on open source projects (even if its during my free time).

Gui Forget
added a comment - 08/Oct/14 04:21 Attaching NUTCH-1325 -trunk-v5.patch with following changes:
Updates from the latest trunk
Fix the DNS Failures reported by Markus
Ensure the stats and homePage URL are propagated
Fix concurrency issue where the reducer would finish before the resolvers completed their job
Attaching NUTCH-1325 -v4-v5.patch to make it easier to see my code changes.

Hello Lewis - these are the new parameters:
+ public static final String HOSTDB_NUMERIC_FIELDS = "hostdb.numeric.fields";
+ public static final String HOSTDB_STRING_FIELDS = "hostdb.string.fields";

List the crawldatum md fields that are numeric and you want stats on in numeric, and string fields in string.fields. Run updatehostdb and you will get HostDatum md for the selected fields. A newer version also supports median stats on numerics. I am going to use these within memex soon!

I also plan an upgrade for dumphostdb so it can be used to let Nutch automatically restrict the crawl to metadata field values such as only english, or crawl only pages that are within a threshold (numeric) such as for instance illegal content, abusive stuff, whatever.

Markus Jelsma
added a comment - 26/Mar/15 19:32 Hello Lewis - these are the new parameters:
+ public static final String HOSTDB_NUMERIC_FIELDS = "hostdb.numeric.fields";
+ public static final String HOSTDB_STRING_FIELDS = "hostdb.string.fields";
List the crawldatum md fields that are numeric and you want stats on in numeric, and string fields in string.fields. Run updatehostdb and you will get HostDatum md for the selected fields. A newer version also supports median stats on numerics. I am going to use these within memex soon!
I also plan an upgrade for dumphostdb so it can be used to let Nutch automatically restrict the crawl to metadata field values such as only english, or crawl only pages that are within a threshold (numeric) such as for instance illegal content, abusive stuff, whatever.

Here's a new patch with support for streaming medians! The original source is not ASL so this is an issue. I also added support for Jexl expressions to the dump tool. Now you can select data from the db using proper expressions.

Markus Jelsma
added a comment - 27/Mar/15 11:06 Here's a new patch with support for streaming medians! The original source is not ASL so this is an issue. I also added support for Jexl expressions to the dump tool. Now you can select data from the db using proper expressions.

Here's another patch solving a NPE. I also removed parameters from the dump tool. Since we now have the expression engine, you have more freedom and there is no need to have -numFailedThreshold N or whatever parameter. See the dump tool's source as to which variables are available.

Markus Jelsma
added a comment - 27/Mar/15 11:57 Here's another patch solving a NPE. I also removed parameters from the dump tool. Since we now have the expression engine, you have more freedom and there is no need to have -numFailedThreshold N or whatever parameter. See the dump tool's source as to which variables are available.

Markus Jelsma
added a comment - 21/Jan/16 12:30 Updated patch to use TDigest for streaming percentiles. But because now i can specify percentile (50 for median), why not allow for configurable percentiles.

Yes, they are very useful for finding websites that, for example, overall score positively on custom text or structure classifiers such as give me all hosts that in general talk about music, politics or illicit topics. Also, the dumping can generate a wide variety of blacklists for e.g. not crawling (generating) certain hosts, not indexing them of removing them completely. Of course, if your erase hosts from your CrawlDB, you must keep the blacklist around, or it will come back at some point

Markus Jelsma
added a comment - 21/Jan/16 15:09 Yes, they are very useful for finding websites that, for example, overall score positively on custom text or structure classifiers such as give me all hosts that in general talk about music, politics or illicit topics. Also, the dumping can generate a wide variety of blacklists for e.g. not crawling (generating) certain hosts, not indexing them of removing them completely. Of course, if your erase hosts from your CrawlDB, you must keep the blacklist around, or it will come back at some point