by Gustavo Niemeyer

Recovering a bootable EBS image

Scott Moser has just announced this week that the new Ubuntu images which boot out of an EBS-based root filesystem in EC2, and thus will persist across reboots, are available for testing.

As usual with something that just left the oven and is explicitly labeled for testing purposes, there was a minor bug in the first iteration of images which was even mentioned in the announcement itself. The bug, if not worked around as specified in the announcement, will prevent the image from rebooting.

Having an bootable EBS image which can’t reboot is a quite interesting (and ironic) problem. You have an image which persists, but suddenly you have no way to see what is inside the image anymore because you can’t boot it. Naturally, even if the said bug didn’t exist in the first place, it’s fairly easy to get into such a situation accidentally if you’re fiddling with the image configuration.

So, in this post we’ll see how to recover from a situation where a bootable EBS image can’t boot.

Getting started

To start this up, we’ll boot one of the EBS images which Scott mentioned in his announcement: ami-8bec03e2. As we see in the output of ec2-describe-images, this is an EBS-based image for i386:

There we go. We got an instance allocated in the availability zone us-east-1c. It’s important to keep track of this information, since EBS volumes are zone-specific.

As part of the above command, we must have been allocated an EBS volume automatically, and it should be attached to the instance we just started. We can investigate it with the ec2-describe-volumes command:

Now, we’ll get into the running instance and do some arbitrary modifications, just as a way to demonstrate that the data we don’t want to lose actually survives the recovering operation. Note that the domain name is obtained with the ec2-describe-instances command.

ubuntu@domU-12-31-39-0E-A0-03:~$ sudo reboot
Broadcast message from ubuntu@domU-12-31-39-0E-A0-03
(/dev/pts/0) at 20:18 …
The system is going down for reboot NOW!

Note that we didn’t actually fix the problem reported by Scott, so our machine won’t really reboot. If we wait a while, we can even see that the problem is exactly what was reported in the announcement (note it really takes a bit for the output to be synced up):

Alright, now what? Machine is dead.. and can’t reboot. How do we get to our important data?

Fixing the problem

The first thing we do is to stop the instance. Do not terminate it, or you’ll lose the EBS volume! After stopping it, we’ll detach the EBS volume that was being used as the root filesystem, so that we can attach somewhere else.

Now, we need to attach this volume in an image which actually boots, so that we can fix it. For this experiment, we’ll pick one of the daily Lucid images, but we could use any other working image really. Just remind that the image must be running in the same availability zone as our previous instance, since the EBS volume won’t be accessible otherwise.

With the instance running and the EBS root device attached with an alternative device name, we can then login to fix the original problem which prevented the image from booting correctly. In our case, we’ll simply do what Scott suggested in the announcement.

Done! Our EBS volume is now correct, and it should boot alright. We’ll detach the volume from the temporary instance we created, and will reattach it back to the old bootable EBS instance which is stopped. Note that we won’t yet terminate the temporary instance, because we may need it in case something else is still wrong, and we are already paying to use it for the hour anyway. We just have to remind ourselves to terminate it once we’re fully done.

Okay! It should all be good now. It’s time to restart our instance, and see if it is working. Note that since you stopped and started the instance, the public domain name most probably has changed, and thus we need to find it out again with ec2-describe-instances once the instance is running.

Concluding, in this post we have seen how to fix a bootable EBS machine which can’t actually boot. The technique consists of detaching the volume from the stopped instance, attaching it to a temporary instance, fixing the image, and then reattaching it back to the original image. This back and forth of EBS volumes is quite useful in many circumstances, so keep it in your tool belt.