Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.

Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

* change Page to use a 1-byte float, representing fetchInterval in seconds.

* implement a pluggable FetchSchedule, which adjusts fetchInterval and nextFetchTime

* change FetchListTool and UpdateDatabaseTool to use them. NOTE: it appears there was a bug in FetchListTool, where the fetchlist entries recorded in segments would have their fetchTime increased by 1 week. This is not needed, only pages in WebDB need this.

* improve status reporting throughout all plugins.

* change plugins to detect if the content is unchanged. If possible, plugins will not fetch such content, but in any case they will set their status accordingly.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

This is very interesting patch, but I have a question:
- If page isn't modified, you don't refetch page.
- If you don't refrech page, there is in old segments?
- If it's in old segments, the segments data will be increasses, how to
analize that which segment is deletable?
- If it's in the old segments, the useable index will be larger and
larger. Because there is a limitation: optimal 2Kbyte RAM / page -> this
will decrease performance or increasse RAM usage?

Sorry my performance question, this patch is very interesting and usable.

Thanks for your your answer,
Ferenc

Andrzej Bialecki (JIRA) wrotte:

>Adaptive re-fetch interval. Detecting umodified content
>-------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assigned to: Andrzej Bialecki
>
>
>Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
>
>Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.
>
>
>

[hidden email] wrote:
> Dear Andrzej,
>
> This is very interesting patch, but I have a question:
> - If page isn't modified, you don't refetch page.

Correct - that's the whole point of this patch.

> - If you don't refrech page, there is in old segments?

Yes, the page should be in some old segment. This question brings an
interesting dilemma - should I add an option to forcefully refetch a
page, in case you lost the old segment data? Hmmm... In the current code
there is an option "-adddays", but with adjustable interval this doesn't
make much sense.

> - If it's in old segments, the segments data will be increasses, how to
> analize that which segment is deletable?

Well, there is no good answer to that even with the current code... You
can use mergesegs tool to keep only the latest versions of pages. But I
agree, this patch make the problem of handling old segments more serious
- how to "phase out" older segments.

> - If it's in the old segments, the useable index will be larger and
> larger. Because there is a limitation: optimal 2Kbyte RAM / page -> this
> will decrease performance or increasse RAM usage?

The index (Lucene index) will not be larger - the deduplication process
takes care of that. Only the latest version of the content will show up
in the index, and for identical content only the one reachable via the
shortest URL.

>
> Sorry my performance question, this patch is very interesting and usable.

For a file system , We can directly get the modified date store it in the db

The plugins will have a look at the content date and if it is different they will index it

Otherwise they will not fetch it

This can be a solution for file based content

(The thing is it does away entirely with fetch interval and takes decision only based upon file modification date)

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

This patch already supports this. Anyway, it needs to be significantly re-worked to fit into the current development version.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

Is there a patch modified for the current branch or should i take a stab at this?

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

I'm working on this, the patch will be available in a couple of days. I could use then your help with review and testing... ;-)

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff, 20051230.txt
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

This patch is updated to the current trunk/ . The default configuration works as before, and uses DefaultFetchSchedule.

If there are no major objections I will commit it shortly.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff, 20051230.txt, 20060227.txt
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

Not an objection, but a simple comment.
Why not making FetchSchedule a new ExtensionPoint and then DefaultFetchSchedule and AdaptiveFetchSchedule some fetch schedule plugins?

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff, 20051230.txt, 20060227.txt
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

The main reason was that currently most of the "pluggable" extensions that result in running a single selected plugin are handled using a simple Factory pattern; as opposed to ChainedFilter pattern, where we use extension points.

I guess the original reason was that implementations would almost always consist of a single class, so it didn't make sense to complicate it and require the whole plugin infrastructure ... It would be the same in this case (just a single class), so I followed the same pattern.

It's easy to change this to use an extension point, if people prefer it this way.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
> Key: NUTCH-61
> URL: http://issues.apache.org/jira/browse/NUTCH-61> Project: Nutch
> Type: New Feature
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff, 20051230.txt, 20060227.txt
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.