In my crawler system, I have set the fetch interval as 30 days. I initially set my user agent as say "...." then many urls are getting rejected. But after changing my user agent to appropriate name, I want to fetch those urls which are rejected initially.
But the thing is those urls with the db_gone status will have retry interval as 45 days. So generator wont pick that.Hence in this case how would I fetch those urls with db_gone status?

Is nutch by default has any options to crawl those db_gone urls alone?

Or do I need to write a seperate map-reduce program to collect those urls and use freegen to generate segments for them?

1 Answer
1

You just need to configure nutch-site.xml with a different refetch interval.

ADDITION

<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of a page
(90 days). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>

That will not select the already crawled urls greater than this new fetch interval and instead it would set this property for further urls that we are crawling.
–
sriramJul 11 '11 at 9:29

Well the way it works is that there are two intervals: refetch interval which is throttled by Nutch from the default to the hard refetch intervall (by default 90 days). No matter what when the hard refetch is met it will refetch the page.
–
millebiiJul 11 '11 at 11:57