Why do you suppose the VM job became unmanageable? When is "...later".

Is there anything I can do about any of this?

These VM jobs are more of a pain than anything else in the projects I participate. Unstable, locks up my computers forcing a hard reboot. Hope it's worth it to the three projects that use it because it sure isn't worth the trouble from my side of the fence.

I am having the same issues. I upgraded to the latest Boinc (7.10.2 x64) with the VB. Must be a version problem. 4 jobs all stuck with "VM job unmanageable, restarting later " Restarting Boinc get them running for a short while before the same message again. VB is 5.2.12.

Suspending the project, then stopping BOINC and restarting the rig solves the problem - BUT only for a short time.

While trying this method to recover, I am receiving message "waiting for slot ..." (I don't remember the complete text).

I noticed that of the four WUs running per rig, one of the four gets postponed while the three others run happily.
After restarting the rig (shutdown and restart) it is the other way around: three of the four wait for slots (?) and the one previously postponed runs nicely.
For a while - then the whole process repeats itself ...

ATLAS seems to be the troublemaker - it is the one that always becomes postponed first after all four WUs where running for a couple of minutes.
AND, please, don't suggest to just not run ATLAS !!

Have this also,
thinking it is a RAM-problem.
Atlas-tasks are dynamicly growing to use more RAM.
When there is no more RAM avalaible in the PC than postponed...
Every better answer is welcome for us volunteers.

There is, unfortunately, not yet a recent stderr.txt from your WUs that can be analysed.

The older logs as well as the error messages
1. "postponed: VM job unmanageable ..."
2. "Postponed: Waiting to acquire slot directory lock. Another instance may be running"
point out a local problem rather than a general problem.

(1.) could be caused by the RAM setting you configured in your BOINC client.
(2.) could be caused by remains of older crashes.

while for me, too, it's not clear by what your problems are caused, one of your lines jumped into my eye:

-- fast SSDs (NVMe Samsung 500GB)

Are you really sure you want to crunch LHC VM tasks with a SSD? Particularly Atlas writes tons of data to the disk.
When I started crunching Atlas with my new PC which was equipped with a SSD, I quickly figured that 4 (or was it even only 3?) concurrently running ATLAS tasks were writing up to 200GB data per day (!). So, it was clear to me that the TBW value of the SSD would be reached within a year, if not earlier (although meanwhile one could read in various forums that some people's SSDs have reached a manyfold of the indicated TBW).

So, once your ATLAS tasks will run well again, you might give a thought to operate VM crunching on a separate HDD.
Just my advice - whatever it's worth :-)

Have this also,
thinking it is a RAM-problem.
Atlas-tasks are dynamicly growing to use more RAM.
When there is no more RAM avalaible in the PC than postponed...
Every better answer is welcome for us volunteers.

while for me, too, it's not clear by what your problems are caused, one of your lines jumped into my eye:

-- fast SSDs (NVMe Samsung 500GB)

Are you really sure you want to crunch LHC VM tasks with a SSD? Particularly Atlas writes tons of data to the disk.
When I started crunching Atlas with my new PC which was equipped with a SSD, I quickly figured that 4 (or was it even only 3?) concurrently running ATLAS tasks were writing up to 200GB data per day (!). So, it was clear to me that the TBW value of the SSD would be reached within a year, if not earlier (although meanwhile one could read in various forums that some people's SSDs have reached a manyfold of the indicated TBW).

So, once your ATLAS tasks will run well again, you might give a thought to operate VM crunching on a separate HDD.
Just my advice - whatever it's worth :-)

Thanks for the advice - but I am absolutely not concerned about the SSDs ...

There is, unfortunately, not yet a recent stderr.txt from your WUs that can be analysed.

The older logs as well as the error messages
1. "postponed: VM job unmanageable ..."
2. "Postponed: Waiting to acquire slot directory lock. Another instance may be running"
point out a local problem rather than a general problem.

(1.) could be caused by the RAM setting you configured in your BOINC client.
(2.) could be caused by remains of older crashes.

How many cores do you use for your ATLAS WUs?
How much RAM is configured?

Thanks for your reply:

RAM settings in BOINC are: no more than 90% of total (the rig has 64GB)
No crashes ... as I said no "dead" entries in VirtualBox.
Of course I restarted the host - they don't run 24/7/365 ...
No, I haven't reinstalled BOINC - why should I ?
No, I did not reset the project.

I use one core per WU (ATLAS, Theaory, LHCb) - in other words I'm playing it save!

I would like to point out, that I have/had other projects running nicely in the past years - some of them up to 16GB RAM usage AND using
hyperthreading (2x4 cores) ... (turned off now).

The older logs as well as the error messages
1. "postponed: VM job unmanageable ..."
2. "Postponed: Waiting to acquire slot directory lock. Another instance may be running"

Of course I restarted the host - they don't run 24/7/365 ...

there are 2 thoughs on this:

- did you wait due time between shutting down BOINC and shutting down the computer, so that the VB can close down properly? I remember having had the "Postponed: ..." error when, rarely enough, my PC froze or had some other failure, so that the VB could not close the way it's supposed to.

- I guess I remember having read somewhere here that if a VB task (regardless which one: Atlas, CMS, LHCb, Theory) is interrupted for too long time (by shutting down the PC for a while), it's unable to properly continue lateron.

So, my guess would be that either one of the above points, or both, are the reason for your problems.

there are 2 thoughs on this:
...
So, my guess would be that either one of the above points, or both, are the reason for your problems.

You might have a point there - BUT I had to shutdown the computer BECAUSE of the "postponed ..." message (trying to revive the ATLAS WU).
So this "action" of mine probably is the reason for the problems that followed afterwards.

So what is the remedy for the initial problem?
Just don't crunch ATLAS.
Same remedy for CMS.

The problem with the lockfile acquisition is indeed indicative of a crash or not giving the VM enough time to shutdown properly before shutting down the OS. BOINC will not fix that problem for you. You must find out which slot dir that lockfile is in then delete it manually from an admin account though I doubt that action alone will cure all your problems. Sounds like you need to remove the LHC project and then re-add it.

ATLAS now has an elapsed time of almost ten hours - taking so far the last three hours for 10 seconds of remaining run time !
At the same time theory WU is predicting 1 to 2 days (!) of remaining time.
LHCb has upped the estimated time to around 20 hours !

Don't rely on the times.
They are more or less fake as they are based on fixed input parameters.
ATLAS: has longer or shorter batches
Theory, LHCb (,CMS): designed to run 12h+ but time calculation is based on the watchdog limit of 18h

Theory special: Today a new app version has been introduced. It needs a couple of days until your BOINC client corrects the times.

Some stderr.txt files are now available.
Most of them show that your hosts are much too busy (for whatever reason).
Thus your BOINC client, VirtualBox, vboxwrapper and VMs run into timing/priority problems.

There are lots of blank lines in your logs plus messages like this:

2018-06-28 20:54:21 (2308): Powering off VM.
2018-06-28 20:59:22 (2308): VM did not power off when requested.
2018-06-28 20:59:22 (2308): VM was successfully terminated.
2018-06-28 21:13:37 (5104): Powering off VM.
2018-06-28 21:18:38 (5104): VM did not power off when requested.
2018-06-28 21:18:38 (5104): VM was NOT successfully terminated.
2018-06-28 11:19:33 (3948): ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time.

ATLAS 1-core may run with only 3500 MB but to be on the save side you may configure 4800 MB via an app_config.xml:

Your BOINC client was terminated before it could save all VMs and other relevant files.
How many VMs (all together) do you run concurrently? Seems to be too much.
Same problem occurs when you restart too many VMs concurrently.
This is what Erich56 already mentioned.

The problem with the lockfile acquisition is indeed indicative of a crash or not giving the VM enough time to shutdown properly before shutting down the OS. BOINC will not fix that problem for you. You must find out which slot dir that lockfile is in then delete it manually from an admin account though I doubt that action alone will cure all your problems. Sounds like you need to remove the LHC project and then re-add it.

Thanks for your time answering me.
But I would like to point out, that I am not experiencing/responsible for any crashes - matter of fact: what is a crash - what do you mean by that?

Let me explain my situation again from the beginning:

1. Started my rigs (three Win7 - one Win10) - all is fine !
2. Started BOINC 7.10.2 - OK !
3. Checked VirtualBox 5.2.12 if anything was left behind - nothing there (had not run VB for some time) !
4. Requested Tasks for LHC (all subprojects checked in LHC prefs) in BOINC - no other projects running !
5. RAM 64GB !
6. SSDs fast and large enough (NVMe Samsung 500GB - more or less empty) !
7. BOINC Options fully "open" - no restrictions !
8. Hyperthreading off - so I have 4 cores !
9. No overclocking !
10. The request for tasks for LHC downloaded per core one LHC (mixed subprojects - but always one ATLAS included) !
11. After a longer while (10 minutes ? - don't remember) ATLAS gets the "postponed" message - for no reason whatsoever !
12. The other (mixed) three keep on running fine !
13. Checking VB shows ATLAS powered off !
14. As time goes by, the "remaining exc. time" keeps on going up - way up !
15. So now I have the situation, that one fourth of each rig is doing nothing - ATLAS blocking one core - nice !
16. Suspended LHC !
17. Waited for VB to shutoff its machines correctly - takes quite long !
18. THEN I stopt BOINC !
19. Waited a while - then LOGOFF for rig user !
20. RESTART for the rig !
21. Went on with point 1. above !
22. In BOINC ticked RESUME for LHC !
23. ALL four WUs start - even the postponed ATLAS WU !
24. After a while (see above point 11.) ATLAS againgoes into "postponed ..." !
25. Retried the whole procedure again - same results - EXCEPT this time other non-ATLAS WUs show the other message (slots or something ...), while ATLA runs ok !
26. Furthermore, the "remaining estimated run time" for all WUs goes up and up (on all rigs) - extremely fast - to 1 day or 2 days and more ... !

ATLAS now has an elapsed time of almost ten hours - taking so far the last three hours for 10 seconds of remaining run time !
At the same time theory WU is predicting 1 to 2 days (!) of remaining time.
LHCb has upped the estimated time to around 20 hours !

Don't rely on the times.
They are more or less fake as they are based on fixed input parameters.
ATLAS: has longer or shorter batches
Theory, LHCb (,CMS): designed to run 12h+ but time calculation is based on the watchdog limit of 18h

Theory special: Today a new app version has been introduced. It needs a couple of days until your BOINC client corrects the times.

Some stderr.txt files are now available.
Most of them show that your hosts are much too busy (for whatever reason).
Thus your BOINC client, VirtualBox, vboxwrapper and VMs run into timing/priority problems.

There are lots of blank lines in your logs plus messages like this:

2018-06-28 20:54:21 (2308): Powering off VM.
2018-06-28 20:59:22 (2308): VM did not power off when requested.
2018-06-28 20:59:22 (2308): VM was successfully terminated.
2018-06-28 21:13:37 (5104): Powering off VM.
2018-06-28 21:18:38 (5104): VM did not power off when requested.
2018-06-28 21:18:38 (5104): VM was NOT successfully terminated.
2018-06-28 11:19:33 (3948): ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time.

ATLAS 1-core may run with only 3500 MB but to be on the save side you may configure 4800 MB via an app_config.xml:

Your BOINC client was terminated before it could save all VMs and other relevant files.
How many VMs (all together) do you run concurrently? Seems to be too much.
Same problem occurs when you restart too many VMs concurrently.
This is what Erich56 already mentioned.

SIDE NOTE: Shouldn't these WUs run WITHOUT me/us having to fiddel around with apps etc.? - My 2 cents of griping.

Well of course they should and someday maybe they will but for now they do not and don't expect even 2 billion cents worth of griping and exclamations at the end of every sentence to change that fact overnight. The programmers are getting it figured out but they're not perfect just human. The best we can do as crunchers is accept that reality and work with it patiently. If you can't find the patience then you need to decide whether or not crunching is for you.

One thing that makes it easier is to walk before you run. And don't EVER "cut the line forcibly" because on an OS as unstable and poorly designed as Windoze you're just asking for tons of trouble. I would consider formatting the drive and reinstalling everything from scratch and this time going with a real OS instead of Windoze. Then learn how to walk before you run. Turning on ALL the applications at this project is likely a mistake. Setting "unlimited cores" would be another mistake.

You seem to have a number of hosts so simplify things by segregating LHC apps by computer which means run ATLAS and nothing but ATLAS on host A, Theory and nothing but Theory on host B, LHCb and nothing but LHCb on host C. Try it that way for a couple months until you get a better handle on what plays nice together and what does not. Then you'll have an idea of how much babysitting is required and whether you have the time and patience to make things even more complex by attempting to mix apps. That's the lovely thing here... you can make it as simple or as complex as you like :)

You seem to have a number of hosts so simplify things by segregating LHC apps by computer which means run ATLAS and nothing but ATLAS on host A, Theory and nothing but Theory on host B, LHCb and nothing but LHCb on host C.

What works for me (sort of) is a non-VirtualBox machine to run native ATLAS, and then a VirtualBox machine. For the latter, I can usually run Theory and LHCb without a problem, but I would avoid CMS like the plague if you want a simple life.
And I don't normally do Sixtrack at all; it is too easy, and anyone can do it without special software, so I leave it to them.

But what used to work is not working so well now, as everything is falling apart in various degrees. I think the fact of life is that LHC is the most advanced physics project in the world, and it has the most complicated computer and network structure as a necessary part of it. Furthermore, it was not developed for the home users running BOINC, as are most BOINC projects, but was developed for the advanced computing capabilities of similar large institutions (Fermilab, etc.) around the world. They probably don't use VBox at all. So we are sort of an afterthought. It is not that they don't appreciate our efforts, but we are the tail of the dog, not the head of it.

On the Top500 list IBM has reached the top with its Summit computer, which makes large use of nVidia Tesla GPU boards at 200 petaflops. The second is a Chinese supercomputer which was first last time, the third is another American computer, Sierra. The Piz Daint Swiss computer which was third is now sixth, the best Italian is thirteenth. Things change rapidly in the supercomputer world.
Tullio