nutch-dev mailing list archives

I followed the script (with minor variations) in the wiki at
http://wiki.apache.org/nutch/Crawl
however, I think I found another bug. Apply this patch and it will
index pages with a status of STATUS_FETCH_NOTMODIFIED as well as
STATUS_FETCH_SUCCESS.
Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
===================================================================
--- src/java/org/apache/nutch/indexer/IndexerMapReduce.java (revision 802632)
+++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java (working copy)
@@ -84,8 +84,10 @@
if (CrawlDatum.hasDbStatus(datum))
dbDatum = datum;
else if (CrawlDatum.hasFetchStatus(datum)) {
- // don't index unmodified (empty) pages
- if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
+ /*
+ * Where did this person get the idea that unmodified pages
are empty?
+ // don't index unmodified (empty) pages
+ if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */
fetchDatum = datum;
} else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
@@ -108,7 +110,7 @@
}
if (!parseData.getStatus().isSuccess() ||
- fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
+ (fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS &&
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) {
return;
}
Index: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
===================================================================
--- src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (revision
802632)
+++ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (working
copy)
@@ -124,11 +124,15 @@
reqStr.append("\r\n");
}
- reqStr.append("\r\n");
if (datum.getModifiedTime() > 0) {
reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getModifiedTime()));
reqStr.append("\r\n");
}
+ else if (datum.getFetchTime() > 0) {
+ reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getFetchTime()));
+ reqStr.append("\r\n");
+ }
+ reqStr.append("\r\n");
byte[] reqBytes= reqStr.toString().getBytes();
On Tue, Aug 11, 2009 at 5:35 AM, Alex McLintock<alex.mclintock@gmail.com> wrote:
> I've been wondering about this problem. When you did the invertlinks
> and index steps did you do it just on the current/most recent segment
> or all the segments?
>
> Presumably this is why you tried to do a merge?
>
> Alex
>
> 2009/8/10 Paul Tomblin <ptomblin@xcski.com>:
>> After applying the patch I sent earlier, I got it so that it correctly
>> skips downloading pages that haven't changed. And after doing the
>> generate/fetch/updatedb loop, and merging the segments with mergeseg,
>> dumping the segment file seems to show that it still has the old
>> content as well as the new content. But when I then ran the
>> invertlinks and index step, the resulting index consists of very small
>> files compared to the files from the previous crawl, indicating that
>> it only indexed the stuff that it had newly fetched. I tried the
>> NutchBean, and sure enough it could only find things I knew were on
>> the newly loaded pages, and couldn't find things that occur hundreds
>> of times on the pages that haven't changed. "merge" doesn't seem to
>> help, since the resulting merged index is still the same size as
>> before merging.
>
--
http://www.linkedin.com/in/paultomblin