You are never gonna keep it down

Checking Health

Start by getting the list of etcd servers in your CF deployment:

bosh vms <your deployment> | grep etc

Adjust the following script for your etcd hosts. By adjust, change the values in {} to match your job/index values of your vms and run it on any hm9000 server since it will have the necessary certs if you are using self-signed certs for etcd:

For one or more nodes that are not leader but don't know who the leader is

Delete the /var/vcap/store/etcd/member directory and all sub directories and files

monit start etcd, wait for it to come up clean tail -f /var/vcap/sys/log/etcd/*.log

Re-run the script to validate

You have 450+ runners and etcd runs out of file descriptors

This does happen on large deployments since the default ulimit is 1024 for every stemcell we've looked at so far. CF v243 it's uses etcd release v66 which doesn't handle ulimits correctly. This looks like it may be addressed in newer releases of etcd used in newer CF versions.

To workaround the current problem:

bosh ssh into each etcd vm then:

sudo -i
monit stop etcd

Verify etcd is down via (etcd metrics server is fine to remain up)

ps -ef | grep etcd

Once all nodes have etcd stopped, dump the etcd cluster db files by deleting the member directory and all sub directories and files:

rm -rf /var/vcap/store/etcd/member

Modify limits.conf:

vim /etc/security/limits.conf

Add in the following:

* soft nofile 4096
* hard nofile 4096

Modify /var/vcap/jobs/etcd/bin/etcd_ctl around line 82 to add the ulimit just before calling the etcd executable: