Sam Bisbee
added a comment - 27/Mar/12 19:21 FYI, this issue was found and resolved in the Debian package a while ago (v0.10.1-2). Here's a link to the patch file: http://patch-tracker.debian.org/patch/series/view/couchdb/0.11.0-2.3/init.patch
Cheers.

The Debian patch would need to be reworked a bit, but I like the approach of checking for the parent heart process and waiting until it exits before returning a success. However, I don't know if it's overkill, or if it even matters.

Wendall Cada
added a comment - 27/Mar/12 19:31 The Debian patch would need to be reworked a bit, but I like the approach of checking for the parent heart process and waiting until it exits before returning a success. However, I don't know if it's overkill, or if it even matters.

However, adding -sname couchdb to the command options results the second start failing silently, but couchdb does stop. A stale pid id is left in the pid file from the second start command.

Now if I modified start_couchdb so it actually checks if the process id returned from the erl command is running, then wait 2 seconds so the pid file can hit the disk. I modified stop_couchdb and eliminated the use of kill -1 and wait for the process to actually exit. Now everything works as intended, no matter what bizarre scenario is encountered.

So for just pure stupid, I can do this:
for i in

{1..5} ; do couchdb -d; couchdb -b ; done
The last command is a start and sure enough, couchdb is running and has restarted completely five times.
Same stupid in reverse:
for i in {1..5}

; do couchdb -b; couchdb -d ; done
CouchDB is stopped.

Now clearly there is going to be an issue with the use of sname and multiple couchdb instances up and running, but I think it will be worthwhile to fix. Every single resource I read and my own experience with erlang is that using kill to shut down is just waiting for problems.

I've temporarily appended the pid to start and stop messages for clarity on what's happening.

Wendall Cada
added a comment - 28/Mar/12 10:43 See: use-sname-rpc-not-kill.patch
Here is what I figured out while testing. The whole concept of using a PID file and kill -1 $PID with erlang is just not going to work consistently.
Here is a way to replicate what happens sometimes when issuing a restart (stop/start), and beam hasn't stopped yet.
For example, try: couchdb -b && couchdb -d && couchdb -b
Apache CouchDB has started, time to relax.
Apache CouchDB is not running.
Apache CouchDB has started, time to relax.
$ echo `cat /var/run/couchdb/couchdb.pid`
10229
$ ps -A | grep beam.smp
10193 pts/2 00:00:00 beam.smp
However, adding -sname couchdb to the command options results the second start failing silently, but couchdb does stop. A stale pid id is left in the pid file from the second start command.
Now if I modified start_couchdb so it actually checks if the process id returned from the erl command is running, then wait 2 seconds so the pid file can hit the disk. I modified stop_couchdb and eliminated the use of kill -1 and wait for the process to actually exit. Now everything works as intended, no matter what bizarre scenario is encountered.
So for just pure stupid, I can do this:
for i in
{1..5} ; do couchdb -d; couchdb -b ; done
The last command is a start and sure enough, couchdb is running and has restarted completely five times.
Same stupid in reverse:
for i in {1..5}
; do couchdb -b; couchdb -d ; done
CouchDB is stopped.
Now clearly there is going to be an issue with the use of sname and multiple couchdb instances up and running, but I think it will be worthwhile to fix. Every single resource I read and my own experience with erlang is that using kill to shut down is just waiting for problems.
I've temporarily appended the pid to start and stop messages for clarity on what's happening.

Wendall Cada
added a comment - 14/Mar/13 15:31 Randall tested and gives it a +1 in this post to the dev list http://mail-archives.apache.org/mod_mbox/couchdb-dev/201303.mbox/%3CCAAL6JQjuiSQOkjuF6jLoZB_ee3Ki7bouxeF1grW9CQ6dFTms8A%40mail.gmail.com%3E

Wendall Cada
added a comment - 14/Mar/13 16:09 I'm thinking that we may want to land this as well. https://github.com/apache/couchdb/commit/410f4c980e6f3dbb02f0432280523e19210bb83e
This would need a solution for windows, maybe taskkill, but I'd defer to Dave on this.

Re the kill-9 branch, I don't want to distribute a 3rd party binary (or a dependency on one) for Windows. @jan what's the problem we are trying to fix here? the noisy logs around _restart ? Or something else that I miss?

Dave Cottlehuber
added a comment - 14/Mar/13 19:01 Re the kill-9 branch, I don't want to distribute a 3rd party binary (or a dependency on one) for Windows. @jan what's the problem we are trying to fix here? the noisy logs around _restart ? Or something else that I miss?

We discussed options on IRC. taskkill is part of Windows from XP onwards, this would leave windows 2000 out. I'm not crying over this one. It doesn't require admin privileges i.e. you can kill your own procs. So +1 moving ahead with an amended version of this.

Dave Cottlehuber
added a comment - 14/Mar/13 23:26 We discussed options on IRC. taskkill is part of Windows from XP onwards, this would leave windows 2000 out. I'm not crying over this one. It doesn't require admin privileges i.e. you can kill your own procs. So +1 moving ahead with an amended version of this.
ref http://ss64.com/nt/taskkill.html and http://technet.microsoft.com/en-us/library/bb491009.aspx full taskkill.exe should be specified.

My opinion is that this has been a blocker for 2+ years. It makes issuing a restart on a production service a risky operation that can result in leaving the service in a state that it cannot be started without manually killing all of the orphans left behind.

I agree that the other patch should be a separate ticket. I'll do so right now.

Wendall Cada
added a comment - 18/Mar/13 15:59 My opinion is that this has been a blocker for 2+ years. It makes issuing a restart on a production service a risky operation that can result in leaving the service in a state that it cannot be started without manually killing all of the orphans left behind.
I agree that the other patch should be a separate ticket. I'll do so right now.