reduce_limit error

reduce_limit error

Devs,

I'm checking in a patch that should cut down on the number of mailing
list questions asking why a particular reduce function is hella slow.
Essentially the patch throws an error if the reduce function return
value is not at least half the size of the values array that was
passed in. (The check is skipped if the size is below a fixed amount,
200 bytes for now).

I expect this heuristic will need fine tuning. Ideally we'd never
raise the error on "good" reduces, and always raise it on
"ill-conceived" ones. If you hit the error on a reduce that seems like
it should be considered good, please ping the list so we have an
opportunity to fine-tune.

There is a config option:

[query_server_config]
reduce_limit = true

changing this to false will revert to the old behavior. Ideally very
few applications will require this config change, so if you find
yourself changing this setting, it's a sign you should mail the list.

Re: reduce_limit error

Thanks for the note, Chris. It might be nice if we could figure out a
way not to print "wazzup" N*1000 times when the test suite runs :-)
Other than that, cool! Should make support a good bit easier.

Adam

On May 4, 2009, at 6:08 PM, Chris Anderson wrote:

> Devs,
>
> I'm checking in a patch that should cut down on the number of mailing
> list questions asking why a particular reduce function is hella slow.
> Essentially the patch throws an error if the reduce function return
> value is not at least half the size of the values array that was
> passed in. (The check is skipped if the size is below a fixed amount,
> 200 bytes for now).
>
> I expect this heuristic will need fine tuning. Ideally we'd never
> raise the error on "good" reduces, and always raise it on
> "ill-conceived" ones. If you hit the error on a reduce that seems like
> it should be considered good, please ping the list so we have an
> opportunity to fine-tune.
>
> There is a config option:
>
> [query_server_config]
> reduce_limit = true
>
> changing this to false will revert to the old behavior. Ideally very
> few applications will require this config change, so if you find
> yourself changing this setting, it's a sign you should mail the list.
>
> --
> Chris Anderson
> http://jchrisa.net> http://couch.io

Re: reduce_limit error

On Mon, May 4, 2009 at 3:23 PM, Adam Kocoloski <[hidden email]> wrote:
> Thanks for the note, Chris. It might be nice if we could figure out a way
> not to print "wazzup" N*1000 times when the test suite runs :-) Other than
> that, cool! Should make support a good bit easier.
>

Yes, that's an interesting side effect of Erlang's dump everything
crash reporting. I'm not sure what to do about that, but a more
general fix could be handy.

Maybe it'd be worth it in this particular case to convert to an
exit(normal) after sending the error message out to linked processes.

> Adam
>
> On May 4, 2009, at 6:08 PM, Chris Anderson wrote:
>
>> Devs,
>>
>> I'm checking in a patch that should cut down on the number of mailing
>> list questions asking why a particular reduce function is hella slow.
>> Essentially the patch throws an error if the reduce function return
>> value is not at least half the size of the values array that was
>> passed in. (The check is skipped if the size is below a fixed amount,
>> 200 bytes for now).
>>
>> I expect this heuristic will need fine tuning. Ideally we'd never
>> raise the error on "good" reduces, and always raise it on
>> "ill-conceived" ones. If you hit the error on a reduce that seems like
>> it should be considered good, please ping the list so we have an
>> opportunity to fine-tune.
>>
>> There is a config option:
>>
>> [query_server_config]
>> reduce_limit = true
>>
>> changing this to false will revert to the old behavior. Ideally very
>> few applications will require this config change, so if you find
>> yourself changing this setting, it's a sign you should mail the list.
>>
>> --
>> Chris Anderson
>> http://jchrisa.net>> http://couch.io>
>

Re: reduce_limit error

On Mon, May 04, 2009 at 03:08:38PM -0700, Chris Anderson wrote:
> I'm checking in a patch that should cut down on the number of mailing
> list questions asking why a particular reduce function is hella slow.
> Essentially the patch throws an error if the reduce function return
> value is not at least half the size of the values array that was
> passed in. (The check is skipped if the size is below a fixed amount,
> 200 bytes for now).

I think that 200 byte limit is too low, as I have now had to turn off the
reduce_limit on my server for this:

Re: reduce_limit error

> On Mon, May 04, 2009 at 03:08:38PM -0700, Chris Anderson wrote:
>> I'm checking in a patch that should cut down on the number of mailing
>> list questions asking why a particular reduce function is hella slow.
>> Essentially the patch throws an error if the reduce function return
>> value is not at least half the size of the values array that was
>> passed in. (The check is skipped if the size is below a fixed amount,
>> 200 bytes for now).
>
> I think that 200 byte limit is too low, as I have now had to turn off the
> reduce_limit on my server for this:
>
> RestClient::RequestFailed: 500 reduce_overflow_error (Reduce output must
> shrink more rapidly. Current output: '[{"v4/24": 480,"v4/20": 10,"v4/26":
> 10,"v4/19": 3,"v4/27": 23,"v4/18": 1,"v4/28": 32,"v4/32": 424,"v4/25":
> 17,"v4/30": 28,"v4/22": 15,"v4/16": 200,"v4/29": 74,"v4/21": 1,"v4/14":
> 41,"v4/12": 1,"v4/13": 1,"v4/17": 4,"v4/11": 1}]')
>
> I'd have thought a threshold of 4KB would be safe enough?
>

That looks an awful lot like a "wrong" kind of reduce function. Is
there a reason why you don't just emit map keys like "v4/24" and use a
normal row-counting reduce? It looks like this reduce would eventually
overwhelm the interpreter, as your set of hash keys looks like it may
grow without bounds as it encounters more data.

Perhaps I'm wrong. 200 bytes is a bit small, but I'd be worried that
with 4kb users wouldn't get a warning until they had moved a "bad"
reduce to production data.

If your reduce is ok even on giant data sets, maybe you can experiment
with the minimum value in share/server/views.js line 52 that will
allow you to proceed.

Re: reduce_limit error

On Tue, May 05, 2009 at 01:19:10PM -0700, Chris Anderson wrote:
> It looks like this reduce would eventually
> overwhelm the interpreter, as your set of hash keys looks like it may
> grow without bounds as it encounters more data.

As you can probably see, it's counting IP address prefixes, and it's
bounded. Even encountering all possible IPv4 prefixes (/0 to /32) and IPv6
(/0 to /128), there will be never be any more than 162 keys in the hash.

> Perhaps I'm wrong. 200 bytes is a bit small, but I'd be worried that
> with 4kb users wouldn't get a warning until they had moved a "bad"
> reduce to production data.

It's not so much a warning as a hard error :-)

> If your reduce is ok even on giant data sets, maybe you can experiment
> with the minimum value in share/server/views.js line 52 that will
> allow you to proceed.

In my case, I'm happy to turn off the checking entirely. I was just
following the request in default.ini:

; If you think you're hitting reduce_limit with a "good" reduce function,
; please let us know on the mailing list so we can fine tune the heuristic.

Re: reduce_limit error

> On Tue, May 05, 2009 at 01:19:10PM -0700, Chris Anderson wrote:
>> It looks like this reduce would eventually
>> overwhelm the interpreter, as your set of hash keys looks like it may
>> grow without bounds as it encounters more data.
>
> As you can probably see, it's counting IP address prefixes, and it's
> bounded. Even encountering all possible IPv4 prefixes (/0 to /32) and IPv6
> (/0 to /128), there will be never be any more than 162 keys in the hash.
>
>> Perhaps I'm wrong. 200 bytes is a bit small, but I'd be worried that
>> with 4kb users wouldn't get a warning until they had moved a "bad"
>> reduce to production data.
>
> It's not so much a warning as a hard error :-)
>
>> If your reduce is ok even on giant data sets, maybe you can experiment
>> with the minimum value in share/server/views.js line 52 that will
>> allow you to proceed.
>
> In my case, I'm happy to turn off the checking entirely. I was just
> following the request in default.ini:
>
> ; If you think you're hitting reduce_limit with a "good" reduce function,
> ; please let us know on the mailing list so we can fine tune the heuristic.
>

Gotcha, your reduce seems ok given the bounded nature of the data set.
Still I'm not clear why you don't just have a map with keys like:

Re: reduce_limit error

On Tue, May 05, 2009 at 03:21:55PM -0700, Chris Anderson wrote:
> Gotcha, your reduce seems ok given the bounded nature of the data set.
> Still I'm not clear why you don't just have a map with keys like:

It's because this is a query I expect to run often (e.g. once every page
hit), and I am imagining it will be cheaper to pick a single object out of
the root node, rather than walk all over the view btree, grouping keys and
re-reducing.

Re: reduce_limit error

> On Mon, May 04, 2009 at 03:08:38PM -0700, Chris Anderson wrote:
>> I'm checking in a patch that should cut down on the number of mailing
>> list questions asking why a particular reduce function is hella slow.
>> Essentially the patch throws an error if the reduce function return
>> value is not at least half the size of the values array that was
>> passed in. (The check is skipped if the size is below a fixed amount,
>> 200 bytes for now).
>
> I think that 200 byte limit is too low, as I have now had to turn off the
> reduce_limit on my server for this:

I'm using a reduce function to sort data in so that clients can query
for the most recent piece of data. For example

The main reason I might to do this is to simplify client logic, but
another valid reason is to prevent sending and processing
unnecessarily large chunks of JSON.

This kind of reduce function may fall foul of the
reduce_overflow_error, but only if the document is greater than 200
bytes. So, I'm echoing the opinion that 200 bytes is too low. I also
believe that throwing an exception is a bit draconian as it could
result in an unjustified failure in production. I think a warning
would be more appropriate.

Re: reduce_limit error

You can turn it off (caveat emptor);

; Changing reduce_limit to false will disable reduce_limit.
; If you think you're hitting reduce_limit with a "good" reduce function,
; please let us know on the mailing list so we can fine tune the heuristic.
[query_server_config]
reduce_limit = true

> On Tue, May 5, 2009 at 8:50 PM, Brian Candler<[hidden email]> wrote:
>> On Mon, May 04, 2009 at 03:08:38PM -0700, Chris Anderson wrote:
>>> I'm checking in a patch that should cut down on the number of mailing
>>> list questions asking why a particular reduce function is hella slow.
>>> Essentially the patch throws an error if the reduce function return
>>> value is not at least half the size of the values array that was
>>> passed in. (The check is skipped if the size is below a fixed amount,
>>> 200 bytes for now).
>>
>> I think that 200 byte limit is too low, as I have now had to turn off the
>> reduce_limit on my server for this:
>
> I'm using a reduce function to sort data in so that clients can query
> for the most recent piece of data. For example
>
> function most_recent_reading-map(doc) {
> if(doc.type === "TemperatureReading") {
> emit(doc.station_id, doc);
> }
> }
>
> function most_recent_reading-reduce(keys, values) {
> var sorted = values.sort(function (a,b) {
> return b.created_at.localeCompare(a.created_at)
> });
> return sorted[0];
> }
>
> The main reason I might to do this is to simplify client logic, but
> another valid reason is to prevent sending and processing
> unnecessarily large chunks of JSON.
>
> This kind of reduce function may fall foul of the
> reduce_overflow_error, but only if the document is greater than 200
> bytes. So, I'm echoing the opinion that 200 bytes is too low. I also
> believe that throwing an exception is a bit draconian as it could
> result in an unjustified failure in production. I think a warning
> would be more appropriate.
>
> Paul
>

Re: reduce_limit error

> On Tue, May 5, 2009 at 8:50 PM, Brian Candler<[hidden email]> wrote:
>> On Mon, May 04, 2009 at 03:08:38PM -0700, Chris Anderson wrote:
>>> I'm checking in a patch that should cut down on the number of mailing
>>> list questions asking why a particular reduce function is hella slow.
>>> Essentially the patch throws an error if the reduce function return
>>> value is not at least half the size of the values array that was
>>> passed in. (The check is skipped if the size is below a fixed amount,
>>> 200 bytes for now).
>>
>> I think that 200 byte limit is too low, as I have now had to turn off the
>> reduce_limit on my server for this:
>
> I'm using a reduce function to sort data in so that clients can query
> for the most recent piece of data. For example
>
> function most_recent_reading-map(doc) {
> if(doc.type === "TemperatureReading") {
> emit(doc.station_id, doc);
> }
> }
>
> function most_recent_reading-reduce(keys, values) {
> var sorted = values.sort(function (a,b) {
> return b.created_at.localeCompare(a.created_at)
> });
> return sorted[0];
> }
>

you should never accumulate a list in a reduce function...

if you want to create a compressed final JSON output, the thing to do
would be to run a list function on a group reduce query, and have it
make the final aggregate. that way you don't end up with an infinitely
long overflowing list in your reduce values.

> The main reason I might to do this is to simplify client logic, but
> another valid reason is to prevent sending and processing
> unnecessarily large chunks of JSON.
>
> This kind of reduce function may fall foul of the
> reduce_overflow_error, but only if the document is greater than 200
> bytes. So, I'm echoing the opinion that 200 bytes is too low. I also
> believe that throwing an exception is a bit draconian as it could
> result in an unjustified failure in production. I think a warning
> would be more appropriate.
>
> Paul
>

Re: reduce_limit error

>> function most_recent_reading-map(doc) {
>> if(doc.type === "TemperatureReading") {
>> emit(doc.station_id, doc);
>> }
>> }
>>
>> function most_recent_reading-reduce(keys, values) {
>> var sorted = values.sort(function (a,b) {
>> return b.created_at.localeCompare(a.created_at)
>> });
>> return sorted[0];
>> }
>>
>
> you should never accumulate a list in a reduce function...
>
> if you want to create a compressed final JSON output, the thing to do
> would be to run a list function on a group reduce query, and have it
> make the final aggregate. that way you don't end up with an infinitely
> long overflowing list in your reduce values.

Given that the reduce function returns a single value, I don't
understand why you consider it to be accumulating a list. I see it as
being roughly equivalent to returning a very large scalar.

As the argument against having the output of a reduce function grow
too fast is based on degraded performance with a large dataset, I ran
a test case with 1 million docs. The query returns in about 0.03s,
which is significantly faster than a group_level based query against a
dataset of a similar size.

Re: reduce_limit error

On Sun, Aug 16, 2009 at 11:18:32AM -0700, Chris Anderson wrote:
> you should never accumulate a list in a reduce function...

He's not. He's just returning a single document, which is the one with the
largest value of created_at.

IMO that's a pretty reasonable thing to do, because it's not unbounded
expansion: the largest reduce value will be the largest single document in
the database. Note that we explicitly say it's OK to emit an entire document
as a value in a k/v index, i.e. emit(null, doc), and his reduce function is
just picking one of those 'values'.

Re: reduce_limit error

> On Sun, Aug 16, 2009 at 05:55:56PM +0100, Robert Newson wrote:
>> You can turn it off (caveat emptor);
>>
>> ; Changing reduce_limit to false will disable reduce_limit.
>> ; If you think you're hitting reduce_limit with a "good" reduce function,
>> ; please let us know on the mailing list so we can fine tune the heuristic.
>> [query_server_config]
>> reduce_limit = true
>
> I think that it should be a limit in bytes, not a true/false. I have also
> had cases where I generate a limited-sized reduce value which is a bit
> larger than 200 bytes.
>

Oh I see that now - returning the first item in the list makes more sense.

As far as a byte limit for the threshold, that would be entirely
possible and not that hard to patch. I'd love to see a patch along
these lines. I like the suggestion to make the threshold configurable
instead of just an on/off switch.