Recovering a Failed Controller Node

What happened?

In an OpenStack instance with a single controller, using Galera and RabbitMQ for clustered database and message queue, the controller failed. I still had the message queue and database on the other nodes, but thanks to a looming timeline, pressure from management, and my own mistakes, I ended up with a SD card backing some key logical volumes. The SD card died, and thus, I am rebuilding the controller from nothing.

It's important to note the the first step was to fix the automation that built things touching the SD card. This work was completed by some of my colleagues about a month ago (thanks guys!). I was under pressure to get something done in a hurry, and instead of pushing back, I made the wrong choice and succumbed to the pressure. Now, I must pay. This activity is the cost of tech debt and rushing to a deadline instead of doing it the right way the first time and letting the timeline slip.

Steps to Recovery: Mise en place

First, we need to reinstall the OS with PV/VG/LV configured correctly. Since the deployer profile for this host was fixed after it was built, this part was easy. Next, I ran our Ansible automation for installing OpenStack. Then, I stopped all the OpenStack services, including RabbitMQ and MySQL.

Steps to Recovery: Galera

I configured Galera on the host as described in a previous post, and started the service, but recovery did not start. I checked the grastate.dat files on the other hosts in the cluster, and while they were both in sync and at the same change ID/UUID, both had safe_to_bootstrap: 0. I set safe_to_bootstrap: 1 in the grastate.dat file on one of the hosts; normally you'd pick the one with the highest change number, but I could do either node since they were the same. Next, I started recovery again on the host being rebuilt. The newly rebuilt host joined the cluster, which I was able to monitor locally in logs by setting wsrep_debug=ON in the hosts config, and remotely on the boostrap and other cluster member with standard show state like 'wsrep%'; commands in mysql client.

Just to make myself feel a little less nervous, I ran some simple sanity checks on all the Galera nodes and made sure I got the same result:

mysql nova -e 'select count(*) from instances;'

What a relief!

Steps to Recovery: RabbitMQ

Again, I will refer to a previous post on configuration for reference. Fortunately, I did the right things and folded the changes mentioned there into our Ansible automation for RabbitMQ, appending this to /etc/rabbit/rabbitmq-env.conf:

Since the node that failed was the master of the cluster, and anything lingering in the queue is not worthwhile, I simply stopped rabbitmq everywhere, blew away the mnesia dir and rebuilt the cluster from scratch.

Steps to Recovery: OpenStack

Next, I needed to run through the configs and make sure that when I start OpenStack services again, I don't explode the world. For the most part, this was a noop, but it's better to check than to wish I checked after the damage is done. The one difference I found was in recently merged changes that referred to a separate instance of RabbitMQ for notifications. This was present on the new node, but not on the old nodes. The other difference was in the specification of transport_url, which referred only to a single host instead of the whole RabbitMQ cluster. Once I checked and fixed configs, I ran through restarting services in roughly this order:

cinder.scheduler.flows.create_volume.ScheduleCreateVolumeTask;volume:create: No valid backend was found. No weighed backends available

I checked openstack volume service list and found duplicate backends for every storage node and backend. One up, one down. I generated a list of the bad ones with openstack volume service list -f value -c Binary -c Host -c State | awk '/down/ {print $1 " " $2}', and then used cinder-manage service remove BINARY HOST, as suggested in this post on OpenStack ask.

I tried to deploy an instance again, but no joy.

So let's dial it back and just try creating some volumes. I was able to reproduce the same error. I narrowed it down by specifying the AZ I wanted, and I picked one that is sparse. Next, I bounced nova, cinder, and neutron services on the compute/storage nodes in that AZ. I was finally able to create a volume.

The lesson here, is that I probably would have saved myself some grief by restarting nova, cinder, and neutron agents on the compute/storage nodes after resurrecting my dead controller.

I tried creating a VM again, and got a similar error. I scratched my head a bit on this one. I started walking through the process, and specifically what I was asking Cinder to do. I wanted to create a volume from an image. I double checked configs for Glance, and found that there's an autogenerated uuid to specify the swift container to use as a backend. I was able to dig the previous container out of the glance database with a quick query: MariaDB [glance]> select * from image_locations LIMIT 5;

Once I corrected the swift config, I thought that might resolve the issue, but I was skeptical, since it seems possible that the image_locations table is used on each query, not just the container in the config; this would make sense if there are multiple glance backends in play. I tried again, and again, was unsuccessful.

I took another step back. I had been attempting to create VMs in horizon this whole time. I tried first creating volumes in each AZ via the cli, which succeeded. Next, I tried creating VMs via the cli, which also succeeded. So that's great, but why? When you create a VM from an image in the cli, it doesn't create a persistent volume on the default backend (at least with our config), but creates a lvm ephemeral volume instead.

So I can create volumes on the default backend (NetApp over FC) from the cli, and I can create VMs using LVM storage via cli, but I still could not create a VM using the default backend, and even worse, I got different errors depending on placement. In one AZ, I got a 'not enough hosts available' error, and in the others, I got 'Block device mapping invalid' errors. I found stuck cinder-scheduler-fanout queues in rabbitmq, so I restarted every node in the rabbitmq cluster and restarted nova and cinder scheduler services on the controller. I also restarted nova, cinder, and neutron services on a compute node I was watching during testing. This was somewhat successful, and cleared the queues, also getting rid of my 'not enough hosts available' errors, but I was still unable to provision VMs from horizon.

I spent a little time looking for differences in the things I was doing. When I created a volume in the cli, I was specifying the type. I checked the cinder config, and found I had missed the default_volume_type=xxx option when scrubbing configs. I added the value, scrubbed configs again just in case, and restarted services on the controller.

Finally, my test instances in each AZ were able to launch successfully from both the cli and horizon.

Verification, Continued

Now, I finally begin working through the remainder of my verification list.

Volume CRUD works fine in all availability zones. We use cross_az_attach=false in /etc/nova/nova.conf to prevent block traffic traversing between AZs, so it was important to validate Nova/Cinder within each.

During the course of testing this, it was late, and I was getting tired, so I made a mistake. I accidentally validated the cross_az_attach=false parameter, and got this helpful message:

openstack server add volume 868fbde4-FFFF-FFFF-baae-b132ebe844ed 086f3021-4ea8-FFFF-FFFF-6aa27ef4a069
Invalid volume: Instance 186 and volume 086f3021-4ea8-FFFF-FFFF-6aa27ef4a069 are not in the same availability_zone. Instance is in AZ1. Volume is in AZ3 (HTTP 400) (Request-ID: req-736732d7-FFFF-FFFF-839c-92b23cbca6a8)

The last thing to do was add a stack. I fired off a simple single server test stack to validate heat was working. I checked the output of openstack server list, openstack stack list and openstack stack show $mystackid.

Lastly, I installed and checked our custom daemon to sync with IPAM, and was able to validate that test nodes I created were getting pushed to IPAM and therefore DNS.

It is with great joy that I can finally publish this post. I wrote this as I worked through the recovery. When it got a little rough I was glad to have notes. Now, I am glad to be done.