Announcement (2017-05-07): www.ruby-forum.com is now read-only since I
unfortunately do not have the time to support and maintain the forum any
more. Please see rubyonrails.org/community and ruby-lang.org/en/community
for other Rails- und Ruby-related community platforms.

Hi,
I would like to put a brake on spiders which are hammering a site with
dynamic content generation. They should still get to see the content,
but only not generate excessive load. I therefore constructed a map to
identify spiders, which works well, and then tried to
limit_req_zone $binary_remote_addr zone=slow:10m ...;
if ($is_spider) {
limit_req zone=slow;
}
Unfortunately, limit_req is not allowed inside "if", and I don't see an
obvious way to achieve this effect otherwise.
If you have any tips, that would be much appreciated!
Kind regards,
--Toni++

Hello,
On Mon, Oct 14, 2013 at 09:25:24AM -0400, Sylvia wrote:
> Doesnt robots.txt "Crawl-Delay" directive satisfy your needs?
I have it already there, but I don't know how long it takes for such a
directive, or any changes to robots.txt for that matter, to take effect.
Observing the logs, I'd say that this delay between changing robots.txt
and a change in robot behaviour would take several days, as I cannot see
any effects so far.
> Normal spiders should obey robots.txt, if they dont - they can be banned.
Banning Google is not a good idea, no matter how abusive they might be,
and they incidentically operate one of those robots which keep hammering
the site. I'd much prefer a technical solution to enforce such limits,
over convention.
I'd also like to limit the request frequency over an entire pool, so
that I can say "clients from this pool can make requests only with this
fequency, combined, not per client IP", because it doesn't buy me
anything if I can limit the individual search robot to a decent
frequency, but then get hammered by 1000 search robots in parallel, each
one observing the request limit. Right?
Kind regards,
--Toni++

On Mon, Oct 14, 2013 at 01:59:23PM +0200, Toni Mueller wrote:
Hi there,
This is untested, but follows the docs at
http://nginx.org/r/limit_req_zone:> I therefore constructed a map to> identify spiders, which works well, and then tried to>> limit_req_zone $binary_remote_addr zone=slow:10m ...;>> if ($is_spider) {> limit_req zone=slow;> }>> If you have any tips, that would be much appreciated!
In your map, let $is_spider be empty if is not a spider ("default",
presumably), and be something else if it is a spider (possibly
$binary_remote_addr if every client should be counted individually,
or something else if you want to group some spiders together.)
Then define
limit_req_zone $is_spider zone=slow:10m ...;
instead of what you currently have.
f
--
Francis Daly francis@daoine.org

H Francis,
On Mon, Oct 14, 2013 at 03:23:03PM +0100, Francis Daly wrote:
> In your map, let $is_spider be empty if is not a spider ("default",> presumably), and be something else if it is a spider (possibly> $binary_remote_addr if every client should be counted individually,> or something else if you want to group some spiders together.)
thanks a bunch! This works like a charm!
Kind regards,
--Toni++