if flush times out final stage (janitor creating vbuckets back) it returns success causing clients to see TMPFAIL after flush succeeds (WAS: there are reports that even after invoking FLUSH nodes return TMPFAIL...)

Description

"reported by SDK team"
After a flush (through REST) for a time we still get tmpfail returned by server nodes. This is not expected, and would be kind of annoying from an application and/or cause problems with automated tests.

update:

I've updated subject. Indeed this is possible. But IMHO given that clients should always be prepared to handle TMPFAIL out of everything I've lowered down to minor.

further update:

I disagree on the "always be able to handle TMPFAIL", especially in the case of running tests at the SDK side. At the moment, we ask our users to handle tmpfail directly. That's intentional and seems to make sense as a pressure relief valve for steady state, but between these flushes and especially from unit tests, it'd be best if the cluster could just block either the operation request or the flush response until complete.

Attachments

Issue Links

relates to

MB-8956[RN'd] - if flush times out final stage (janitor creating vbuckets back) it returns success causing clients to see TMPFAIL after flush succeeds (WAS: there are reports that even after invoking FLUSH nodes return TMPFAIL...)

Matt Ingenthron
added a comment - 12/Nov/12 1:58 AM Sergey: this issue originally came from you.
Can you:
1) let us know if you still see this issue and
2) let us know if you'd opened a bug previously?
Thanks.

Aleksey Kondratenko (Inactive)
added a comment - 13/Nov/12 6:04 PM Sergey, I see potential for confusion. Is that related to slow flush you're seeing? Are you sure you're getting 200 back ?
Please describe your case better.
As for the question of workaround, its basically same as any tmpfail error: just keep re-trying.

Aleksey Kondratenko (Inactive)
added a comment - 07/Mar/13 1:25 PM Matt also said the following over email:
BUT recent testing by our team indicates the fsync is not the underlying cause. UPDATE, as of 3/7, MB-6232 has now been moved from Backlog to 2.0.2. Need to reopen MB-7160 with current findings
I'm waiting those datails that could indicate there's some other reason for tmpfails

This may or may not be directly related to the specific issue reported here. It seems to be a contributing factor anyway.

Trond carried out a series of experiments recently recently to test whether or not the time problems with flush time were owing to the underlying MB-6232 as previously suspected. MB-6232 says fsyncs and serially doing a lot of work with vbuckets cause lots of the creation/deletion slowness we see.

To isolate this and find a solution for a particular deployment, a number of tests were run. In the last experiment, two tests were run:
1) Reduce the number of vbuckets to 8, but otherwise use a real fileystem, disks, etc. (~2sec)
2) Use ramfs (9.5sec)

1) Reduce the number of vbuckets:

I uninstalled 2.0.1-build 170 from my box and removed /opt/couchbase before reinstalling it and stopping it immediately. Then I edited /opt/couchbase/bin/couchbase-server and added (right under the license header):

COUCHBASE_NUM_VBUCKETS=8
export COUCHBASE_NUM_VBUCKETS

This sets the server to be using only 8 vbuckets instead of the default number of 1024. This number HAS to be a "power-of-two", and should not be less than the number of nodes you have in your cluster (then you'll have nodes without any vbuckets).

Running the same program as I used earlier (who did a flush of an empty bucket) now use ~2 secs on a filesystem that is mounted with "barrier=1" and less than a second (0.7) on a filesystem that is mounted with "barrier=0"

Then installed couchbase and had it use /tmp/couchbase_data for storage.

Running flush takes ~9.5 sec for an empty bucket. Reducing the number of vbuckets to 8 like described in 1 used 0.36 s

— — —

In an earlier set of experiments, there was more measurement of where time was going. I've edited this slightly from Trond:

Test rig: (Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GH with 4GB memory).
The system is using ext4 as a filesystem, 2GB of memory to Couchbase (and installed the two sample databases during installation).

Test was run with the following PHP script which was intended to simulate a series of flushes as unit tests are run:

Please note that we can't really "trust" the numbers above when we had an error, because I don't check for the reason why it failed (the next one could for instance fail due to the fact that we've already got a flush running etc).

At least all of the flush calls succeed, but look at the time.. It takes more than 3 minutes to run 10 flush commands, so for a user running 500 unit tests each separated by a flush, just running the flush for their tests would result in a roughly 2 1/2 hours waiting for the flush commands in their 500 tests cycles.. (then depending on how much data they add it may take longer...)

What I do find interesting is when I run top while I'm doing this I see beam.smp (erlang vm) using from 100-170% CPU (aka almost two cores), whereas memcached is relatively idle. From what I can see it looks to me that flushing a vbucket is setting all of the vbuckets to "dead" before activating them again.

I know we don't have stats for everything we do, but from running the "timings" cbstat command we see (removing the printout of the distribution there):

If I'm adding all of these up (assuming that they're all done in sequence) we end up at roughly 13 secs, but the command "time" report of the php process doing a single flush reports:
real 0m23.613s
user 0m0.016s
sys 0m0.012s

So there is 10 secs not accounted for, but this seems to vary.. (starting php and doing a single set use 0m0.033s so it's not the php runtime overhead)..

I'm not entirely sure what the "disk_vbstate_snapshot" is used for but it accounts for roughly 9 of the 13 secs)

Wen I disabled that function in ep-engine I got the example program above running those 10 flushes:

Flushing: 10 Errors: 0
real 0m22.411s
user 0m0.020s
sys 0m0.012s

Interestingly enough eam.smp is now down at ~100% CPU... For fun I tried to run the script with 500 flushes (to see if it changed over time), and we're down to less than 18 minutes flushing (which isn't that bad )

I know we added that call to disk_vbstate_snapshot for some reason, so just removing isn't an alternative. I am however pretty sure we don't need to snapshot them 1600+ times while we're running a flush (I would have guessed that two would do the trick.. one when they are all disabled, one when they are all back up again). It is ns-server and not ep-engine who needs to ensure that the clustermap is consistent after a potential crash during the process, and to be honest ns_server could just persist a flag saying that it is doing a flush before it starts, then nuke the flag when it receives the response that all of the vbuckets are dead if it want to protect itself from coming back up in a "hosed" configuration if a crash occurs during the flush.

Personally I think we should change our code to let ns_server send a SINGLE message to ep_engine listing ALL of the vbuckets it want to shut down, and it should get a single return message when that is in place (and we could then have a SINGLE snapshot vbucket when we've disabled all of them "atomically") then a SINGLE enable message (returning in a single snapshot_vbucket_state).

Right now we're also adding FLUSH markers to our tap connections. Given that ns_server is the organizer of the entire process it could just shutdown the TAP connections before running the flush command, and then restart the replication chains when its done. (the database should be empty anyway, and we wouldn't have to add special flush logic to the tap streams).

It would be interesting to know if all of this message passing and constant vbucket state change is part of its heavy CPU usage during the flush.

Matt Ingenthron
added a comment - 21/Mar/13 5:02 PM This may or may not be directly related to the specific issue reported here. It seems to be a contributing factor anyway.
Trond carried out a series of experiments recently recently to test whether or not the time problems with flush time were owing to the underlying MB-6232 as previously suspected. MB-6232 says fsyncs and serially doing a lot of work with vbuckets cause lots of the creation/deletion slowness we see.
To isolate this and find a solution for a particular deployment, a number of tests were run. In the last experiment, two tests were run:
1) Reduce the number of vbuckets to 8, but otherwise use a real fileystem, disks, etc. (~2sec)
2) Use ramfs (9.5sec)
1) Reduce the number of vbuckets:
I uninstalled 2.0.1-build 170 from my box and removed /opt/couchbase before reinstalling it and stopping it immediately. Then I edited /opt/couchbase/bin/couchbase-server and added (right under the license header):
COUCHBASE_NUM_VBUCKETS=8
export COUCHBASE_NUM_VBUCKETS
This sets the server to be using only 8 vbuckets instead of the default number of 1024. This number HAS to be a "power-of-two", and should not be less than the number of nodes you have in your cluster (then you'll have nodes without any vbuckets).
Running the same program as I used earlier (who did a flush of an empty bucket) now use ~2 secs on a filesystem that is mounted with "barrier=1" and less than a second (0.7) on a filesystem that is mounted with "barrier=0"
2) Using ramfs
I created the ramfs by running:
mount -t ramfs -o size=100m ramfs /tmp/couchbase_data
chown couchbase:couchbase /tmp/couchbase_data
Then installed couchbase and had it use /tmp/couchbase_data for storage.
Running flush takes ~9.5 sec for an empty bucket. Reducing the number of vbuckets to 8 like described in 1 used 0.36 s
— — —
In an earlier set of experiments, there was more measurement of where time was going. I've edited this slightly from Trond:
Test rig: (Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GH with 4GB memory).
The system is using ext4 as a filesystem, 2GB of memory to Couchbase (and installed the two sample databases during installation).
Test was run with the following PHP script which was intended to simulate a series of flushes as unit tests are run:
<?php
$counter = 1;
$errors = 0;
$cb = new Couchbase("localhost", "Administrator", "password");
while ($counter < 11) {
try
{
echo "\rFlushing: " . $counter . " Errors: " . $errors;
$cb->flush();
}
catch (CouchbaseException $e)
{
$errors++;
}
$counter++;
}
?>
When I initially ran the script it reported:
trond@desktop:~/compile/sdk/php$ time php -c php.ini example/buckets.php
Flushing: 10 Errors: 8
real 5m36.613s
user 0m0.012s
sys 0m0.020s
Please note that we can't really "trust" the numbers above when we had an error, because I don't check for the reason why it failed (the next one could for instance fail due to the fact that we've already got a flush running etc).
When I remounted the filesystem with barrier=0 and I got:
trond@deskttime op:~/compile/sdk/php$ php -c php.ini example/buckets.php
Flushing: 10 Errors: 0
real 3m4.609s
user 0m0.028s
sys 0m0.008s
At least all of the flush calls succeed, but look at the time.. It takes more than 3 minutes to run 10 flush commands, so for a user running 500 unit tests each separated by a flush, just running the flush for their tests would result in a roughly 2 1/2 hours waiting for the flush commands in their 500 tests cycles.. (then depending on how much data they add it may take longer...)
What I do find interesting is when I run top while I'm doing this I see beam.smp (erlang vm) using from 100-170% CPU (aka almost two cores), whereas memcached is relatively idle. From what I can see it looks to me that flushing a vbucket is setting all of the vbuckets to "dead" before activating them again.
I know we don't have stats for everything we do, but from running the "timings" cbstat command we see (removing the printout of the distribution there):
set_vb_cmd (2048 total)
Avg : ( 27us)
del_vb_cmd (2048 total)
Avg : ( 44us)
disk_vbstate_snapshot (1622 total)
Avg : ( 6ms)
disk_vb_del (1024 total)
Avg : ( 3ms)
If I'm adding all of these up (assuming that they're all done in sequence) we end up at roughly 13 secs, but the command "time" report of the php process doing a single flush reports:
real 0m23.613s
user 0m0.016s
sys 0m0.012s
So there is 10 secs not accounted for, but this seems to vary.. (starting php and doing a single set use 0m0.033s so it's not the php runtime overhead)..
I'm not entirely sure what the "disk_vbstate_snapshot" is used for but it accounts for roughly 9 of the 13 secs)
Wen I disabled that function in ep-engine I got the example program above running those 10 flushes:
Flushing: 10 Errors: 0
real 0m22.411s
user 0m0.020s
sys 0m0.012s
Interestingly enough eam.smp is now down at ~100% CPU... For fun I tried to run the script with 500 flushes (to see if it changed over time), and we're down to less than 18 minutes flushing (which isn't that bad )
I know we added that call to disk_vbstate_snapshot for some reason, so just removing isn't an alternative. I am however pretty sure we don't need to snapshot them 1600+ times while we're running a flush (I would have guessed that two would do the trick.. one when they are all disabled, one when they are all back up again). It is ns-server and not ep-engine who needs to ensure that the clustermap is consistent after a potential crash during the process, and to be honest ns_server could just persist a flag saying that it is doing a flush before it starts, then nuke the flag when it receives the response that all of the vbuckets are dead if it want to protect itself from coming back up in a "hosed" configuration if a crash occurs during the flush.
Personally I think we should change our code to let ns_server send a SINGLE message to ep_engine listing ALL of the vbuckets it want to shut down, and it should get a single return message when that is in place (and we could then have a SINGLE snapshot vbucket when we've disabled all of them "atomically") then a SINGLE enable message (returning in a single snapshot_vbucket_state).
Right now we're also adding FLUSH markers to our tap connections. Given that ns_server is the organizer of the entire process it could just shutdown the TAP connections before running the flush command, and then restart the replication chains when its done. (the database should be empty anyway, and we wouldn't have to add special flush logic to the tap streams).
It would be interesting to know if all of this message passing and constant vbucket state change is part of its heavy CPU usage during the flush.

Aleksey Kondratenko (Inactive)
added a comment - 08/Oct/13 8:41 PM Don't get "or" above. This is not fixed yet. This is fixable relatively easily (assuming 204 is ok). But I won't be crying if it's not for 3.0.

Aleksey Kondratenko (Inactive)
added a comment - 08/Oct/13 11:30 PM Matt, let's be specific here. This ticket is only about returning 204 if flush takes longer than 30 seconds.
There's another ticket on my team to investigate reported slowness for cases where we know it's not fsyncs. Once we investigate we might or might not be able to improve it.