How to debug volume stuck in deleting state issue?

Today I ran into a scenario where my IceHouse openstack installation had existing cinder volumes that became stuck in the deleting state when attempts were made to delete them.

After several hours of probing the system, I can to identify the root cause. It seems that when the cinder volumes were created, the cinder volume service was associated with one hostname, but somewhere along the line, the hostname of processor hosting the cinder service changed (e.g. when from "cinderHost" to "cinderHost.local") I don't know when this happened, but when the cinder service was restarted at that point, the cinder service name was changed to reflect the new host name.

By itself, this is not horrible, new volumes could be created/deleted with the new cinder service. The problem was that the existing volumes have data stored in the cinder database that reflects the cinder service name used when creating the volume. Since that old cinder volume service name was not in service when the volume delete was attempted, the volume became stuck in the deleting state.

My question is, how would you debug this efficiently if it were to happen again. I was surprised to not find any ERROR messages in any of the cinder logs. Enabling debug cinder logging didn't seem to really help since it didn't print anything that looked alarming. nothing about unable to reach the desired cinder volume service.

Is the design of cinder such that it just queues requests targeted for a disabled volume service and it does not view the unavailability of the service as an error?

1 answer

Is this a "cinder design" problem? Maybe. It depends on how you are using it.

If your Cinder backend is LVM, then no. Why your hostname change, I don't know but that is the problem. All Openstack knows is this new cinder-volume service has started checking in, and the old one isn't responding. Since LVM is isolated to a single node and no distributed storage you wouldn't want remapping of things, that would be bad.

If you are using NFS, Ceph, GlusterFS, etc. then it is a problem with the Cinder architecture. Those distributed backends mean that any of the cinder-volume services can manage any of the Cinder volumes because it is all the same storage. The current (Juno) Cinder architecture doesn't work like that. The volume service that created the volume also manages the volume for its lifetime. So if that service goes down, you can't manage that volume anymore. The work around to this is to have several cinder-volume services running with the same hostname. That introduces some potential race issue, but they are rare. This issue is addressed in Kilo.

Just like anything else, if your hostname changes, you can't expect the project to absorb that change, nor would you want to.

To debug this issue, you would want to turn on verbose and debug in the Cinder conf and restart the cinder-scheduler service, it may provide useful information. If it doesn't you should see the cinder rabbit queue climbing rabbitmqctl list_queues | awk '$2 > 0'