event handler: ok[user:info,2012-09-12T17:06:06.245,ns_1@10.3.121.13:'ns_memcached-default':ns_memcached:terminate:625]Shutting down bucket "default" on 'ns_1@10.3.121.13' for server shutdown[ns_server:info,2012-09-12T17:06:06.261,ns_1@10.3.121.13:'ns_memcached-default':ns_memcached:terminate:636]This bucket shutdown is not due to bucket deletion. Doing nothing[ns_server:debug,2012-09-12T17:06:06.276,ns_1@10.3.121.13:<0.28651.34>:ns_pubsub:do_subscribe_link:134]Parent process of subscription

Filipe Manana (Inactive)
added a comment - 11/Oct/12 5:51 PM Tony, do you think you can save all the files (database, indexes, etc) from the moment the crash happens?
I think it's more helpful rather than pasting a stack trace everytime it happens.

Right now we think this might be related to another bug spotted on Erlang VMs. When started with async threads and using "raw" file descriptors, if the process that opened the file is shutdown or crashes abnormally, the the file descriptor is leaked.

damien
added a comment - 12/Oct/12 4:00 PM Right now we think this might be related to another bug spotted on Erlang VMs. When started with async threads and using "raw" file descriptors, if the process that opened the file is shutdown or crashes abnormally, the the file descriptor is leaked.
one possible fix is to change the erlang startup parameters to turn off async file io, so this:
> erl +A 16 +sbt u +P 327680 +K true
becomes:
> erl +sbt u +P 327680 +K true

Aleksey Kondratenko (Inactive)
added a comment - 12/Oct/12 4:24 PM Just keep in mind that without async io we'll have massive timeouts all over the place. We specifically had exactly this problem in early days of 1.6.0.

Farshid Ghods (Inactive)
added a comment - 17/Oct/12 8:15 PM according to karan he has not seen any crash or rebalance timeouts on cluster where there are lot of rebalancing
1- 2.0 cluster doing views , 2 buckets , 30M items , 2 ddocs , 2 views , 8 nodes under 30k ops/sec and 300 queries per second
2- 20+ node cluster running 1.8.x key value use cases
performance team is also running more performance tests to verify that +s12 works ( without +a) option
we have not yet run xdcr system tests with these settings.

Mounting evidence suggests this bug is caused by the +A erlang startup settings, which turns on async IO for port drivers by using a pool of threads to perform the IO. We still don't understand the root cause, but it appears there is a race condition/cache coherency problem with how port drivers are freed in the VM.

Reassigning to Alk, as he will check in the change to disable the async threads and bump up the # of schedulers to mitigate problems with timeouts due to blocking IO.

damien
added a comment - 23/Oct/12 1:50 PM Mounting evidence suggests this bug is caused by the +A erlang startup settings, which turns on async IO for port drivers by using a pool of threads to perform the IO. We still don't understand the root cause, but it appears there is a race condition/cache coherency problem with how port drivers are freed in the VM.
Reassigning to Alk, as he will check in the change to disable the async threads and bump up the # of schedulers to mitigate problems with timeouts due to blocking IO.

Added to RN : Couchbase Server had intermittently crashed
during rebalance due to Erlang virtual machine
issues; we now
disable asynchronous threads and perform garbage collection
more often to avoid timeouts and process crashes.