This comment has been minimized.

Agreed. What about a property fetcher.min.crawl.delay? The crawl delay will then be between this value and that of fetcher.max.crawl.delay. There is already a fetcher.server.min.delay but it's used as delay between "fetches without fetch" (eg. robots.txt denied).

This comment has been minimized.

There is already a fetcher.server.min.delay but it's used as delay between "fetches without fetch" (eg. robots.txt denied).

No, it is used in normal fetches (asap==false). Robots denied have the opposite value of asap=true.
It is used when multithreading on a single queue to define an alternative value to the normal crawl delay. This is inherited from Nutch and I agree that it is confusing. The name of the param doesn't help either (server?)

Let me have a think about it. Do we need a special case for multithreading or would the fetcher.min.crawl.delay be used no matter what? It could always be set to 0 or a low value for users who want an aggressive setting. If so should we declare it in the same way and place as the max value?

Your thoughts and suggestions are welcome as usual

This comment has been minimized.

edited

Ok, sorry, you're right.

Do we need a special case for multithreading

It could make sense if there is a custom per-queue number of threads (fetcher.maxThreads.queueId): then fetcher.server.min.delay is for queues which you're explicitly want aggressive crawling while the other is to guarantee a minimum delay for the polite queues.

This comment has been minimized.

If users are concerned with being too aggressive even if permitted by robots, would they set a min value different from the default delay they specify? What about having a boolean parameter instead indicating that if a delay from robots is found and is lower than the default, we should use the latter anyway? It would be set to false by default.

As for SimpleFetcherBolt, the logic is different from FetcherBolt when it comes to the max values (but that won't necessarily be always the case) and we could have the same logic as FetcherBolt when it comes to min values. In fact, we could have a new config indicating whether we should skip a URL if the max value set by a server is above our maxValue (as currently done in FetcherBolt) or enforce the max value we set (as in SimpleFetcherBolt)

This comment has been minimized.

or just fetcher.server.delay.force and fetcher.max.crawl.delay.force ?
it will be important to document the behaviour of these configs in the
default config, code and WIKI as their meaning is not necessarily very
obvious nor do they have the same semantic between them
Thanks!

On 21 March 2018 at 13:37, Sebastian Nagel ***@***.***> wrote:
would they set a min value different from the default delay they specify?
Rarely. And it makes even more sense if there is also a boolean whether to
enforce the max. configured delay.
How to name the new boolean properties? fetcher.server.delay.enforce resp.
fetcher.max.crawl.delay.enforce?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#549 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANUzwGN600lt8xNsY6Qavg_bzcrOowSks5tglergaJpZM4St8wT>
.

This comment has been minimized.

… delay,
configurable behavior when if crawl-delay exceeds max. crawl delay
- add config property fetcher.server.delay.force (default false)
if true: the value of fetcher.server.delay.force is used even
if a shorter crawl-delay is specified in robots.txt
- add config property fetcher.max.crawl.delay.force (default false)
if true: the value of fetcher.max.crawl.delay is used even if
the robots.txt requests a longer crawl-delay
if false: URLs in queues with an overlong crawl-delay are skipped
- avoid repeated log messages by logging only if robots.txt crawl-delay
differs from queue delay

Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.