I recently had a very serious problem.
I called "onevm stop" on a VM to hiberate the VM into checkpoint file.
Then I tried to call "onevm resume" to bring it back online.
However, the resumption progress went wrong.
There can be several reasons for it to go wrong.
For example, libvirt would fail if there is another volume attached to it.
But this is not relevant to this thread (I am planning on starting a
new one on this soon).
The key point here is that, as soon as the restore fails, the
OpenNebula code triggers the DEPLOY_FAILURE LCM.
This can be found at src/vmm/VirtualMachineManagerDriver.cc
399 else if ( action == "RESTORE" )
400 {
401 Nebula &ne = Nebula::instance();
402 LifeCycleManager *lcm = ne.get_lcm();
403
404 if (result == "SUCCESS")
405 {
406 lcm->trigger(LifeCycleManager::DEPLOY_SUCCESS, id);
407 }
408 else
409 {
410 string info;
411
412 getline(is,info);
413
414 os.str("");
415 os << "Error restoring VM, " << info;
416
417 vm->log("VMM",Log::ERROR,os);
418
419 lcm->trigger(LifeCycleManager::DEPLOY_FAILURE, id);
420 }
421 }
The LCM would eventually delete the images directory and the user
would lost all the precious data he/she has obtained so far and there
is no way to get it back!
So I desperately need to prevent OpenNebula from deleting the precious images.
A quick hack I did was to comment out the line 419 above so that the
LCM is not triggered at all. But I am sure this is not clean and we
need more than this.
I am thinking maybe one needs a way to separate a fresh booting VM and
a resumption VM. For now, they are no different to OpenNebula and are
both in the BOOT State.
So please let me know if what I reported is a bug and if this can be
fixed in the future.
I could submit this on the dev site as well.
Thank you very much.

Associated revisions

bug #265: Failure actions will NOT remove VM files in the host. Host files will be removed from the remote host upon VM resubmition or deletion. This will let sysadmins to easily debug any failure or perform forensic analysis.

bug #265: Failure actions will NOT remove VM files in the host. Host files will be removed from the remote host upon VM resubmition or deletion. This will let sysadmins to easily debug any failure or perform forensic analysis.(cherry picked from commit c6a8c1fbdcc1d11df23f8ead30a1fd0df3d2630e)

Agree.But the delete action will definitely remove the images, right?When a VM fails, the user still need to issue the delete command anyway, correct?I guess they don't have to but the failed VM keeps showing up in the list of VMs, which can be very annoying.So this is not a perfect solution indeed.

ShiRuben S. Montero wrote:

Well we do not want to leave cluster workernodes with disk images from failed VMs. This will fill the workernode FS and will make the admin to manually delete those volumes/images.

Yes, I've not realized that a VM will end in a failed state and you have to delete it (and hence delete the images) anyway... Note that delete came after the life-cycle implementation so back to 1.2 when we did not have a delete there were no other means to delete failed VM images... Have you applied the patch to your system?

This is now implemented. When a failure occurs VM files are not removed from the host. Host files are cleaned up when the VM is finally deleted or resubmitted. Now sysadmins can easily debug any problems or keep VM images for a forensic analysis.