Have you ever had Openstack do something to your instance that put it in an unbootable state? Did YOU do something to your instance that put it into an unbootable state?

Modern IaaS wisdom teaches us that we are to treat instances like "cattle", that we should be able to just blow it away and replace it at any time. However, we still have dev environments, jump boxes, etc. that will still be treated as "pets". When these instances get in trouble, we panic.

In today's story, we happen upon an Openstack admin who decided to try migrating such an instance from one node to another to better distribute the memory load. That brings us to another axiom: Test Openstack's migrate feature on a test VM BEFORE attempting to move an instance.

So imagine the ensuing panic when said migration failed with a 401 error. Gulp.

As with any other SNAFU involving nova-compute, we figure out which host we're running on and the virsh instance name:

Not yet. Looks like the OS has decided that something is amiss - possibly a corrupted root filesystem?!?! All you need to do is type the root password and... oh wait, this is a cloud image. You don't KNOW the root password!

NOTE: It was brought to my attention after posting this that the next logical step is to use nova rescue, which essentially allows me to boot another instance, attach to the boot disk of the instance in question, and perform whatever operations I need. Try that first. If nova rescue doe not work for your particular situation - read on.

If this was your desktop, you'd simply pop the CentOS 7 DVD into the drive and attempt recovery. Let's do that!

Back on the compute node, use virsh to add a cd-rom drive to your instance:

Then, you can actually go into Horizon, click the Console link for your instance, and operate the console from there. From the console, click Send CTRLALTDEL to restart your instance and boot from the ISO.

You may be tempted to finally say "YAY! I can finally fix the filesystem and boot my instance - almost there." Then some jerk keeps restarting your instance before you can run fsck or xfs_repair. That prankster is nova-compute. To tell him to "cut it out", simply reset the status on the instance after you hit Send CTRLALTDEL.

# nova reset-state --active 928907ae-4711-4863-9add-cff4f0ff161e

Do what you need to do - set the root password this may help, restart your instance from the local disk, and fix what's wrong.

The underlying problem that caused all of this seemed to be twofold: First, xfs_repair found that there were some errors in the root filesystem, and promptly fixed them. Also, I had a block device I was using for data storage that didn't detach cleanly. In fact, early on in the process I went to Horizon and detached the block device when virsh start didn't initially work, and planned on reattaching it when I determined all was well with the OS. However, during boot up, the OS was trying to mount said device per its /etc/fstab and it wasn't apparent from what I was seeing at the console.

When finished, make sure you cleanly power down your instance, go back to the compute node, and use virsh to remove the changes you made to get the cdrom drive to work. Then, start your instance back up using the Horizon UI.

Also - You should probably reset that root password.

Anyway, this may have gone from Buffalo to New York by way of Chicago, but at least now we know what lengths we can go to if something goes south on an instance you care about (and ideally, you SHOULDN'T care about them).