Bug Description

On instance provision, if an exception is generated which stops the build, and then the failed build is deleted, the files pulled from Glance are not cleaned up at all.

Over time, this can stack up and become a very large problem since there is a lot of junk disks. From what I observed, the disk image is brought down to the host machine, and has been scanned in to the SR as there is a VDI record. It's just not removed. We should also add the instance id into the VDI name-description so that we can track which VDIs are associated with what instances. At this point, there's not a good way to track and clean this cruft from the failed builds.

For reference I'm running rev 1265. The exception I ran into was (nova): TRACE: RemoteError: FixedIpNotFoundForInstance Instance 1 has zero fixed ips. This was due to not having added IPs yet to the DB.

I'm a fan of providing this via the Admin API because then any operational team can decide what to do with that information. We don't have an admin API client (something we need?)

I don't love the periodic task strategy (feels like bailing out water when you could be finding the leak) but it might be prudent to make a task and then a blueprint for conversion to admin API and having an admin API client which could be run in a cron job?

I'm not certain why I didn't think about this before, but the cleanup method you used in your fix won't work for situation where the cause of VM spawn failure is loss of connectivity with the hypervisor. How I was testing this was by killing XenAPI on the hypervisor, and while this is a more specific/rare case...it might be worth looking at.

You're right.
I'm not a great fan of periodic tasks either; ideally, an operation admin API would allow clients such as the dashboard to perform mainteinance operations on hypervisors. However, I'm not sure whether there is a place in the OS API for this kind of operations.

At least for the xenapi backend it would also worth tagging VDIs created by nova, maybe with a parameter in other-config. This way the cleanup operation, whichever way it is implemented, will remove only orphaned disk created by nova (there might be some orphaned VDIs on the SR used for other purposes).