LHC has stopped sending me ATLAS tasks. Last sent was on 17th of Oct. The Host is running an older version of VBox (5.1.30) but that shouldn't be a problem, it ran a dozen or so 2.00 version tasks successfully before stopped receiving them. I get Theory, CMS and sixtracktest tasks without a problem but Atlas requests are rejected with response 'No ATLAS tasks available' in spite of continuous requests and server status page shows plenty available and over 10000 in progress.

Harri,
Atlas-VM have more than one problem to get work for us.
One test in -dev was to use only 6.0.x for the new CentOs image.
Now it is 5.2.32. You can see this in the log of a older finished task.
The main-problem is for the moment, that a change of thursday was removed back to the version before.
David is this weekend absent, so we have to wait up to monday for clearing it.
-native Atlas is running for the moment.

The recent ATLAS vdi includes VBoxGuestAdditions 5.2.32 which should work even with more recent guest additions on the host - although it is recommended to keep guest and host in sync.

What makes me wonder:
The timestamps of /opt/VBoxGuestAdditions-5.2.32 and below inside the vdi is 2019-09-12 while David introduced v2.0 at 2019-10-09 including a new linux kernel.
I suspect that the vdi's guest additions need to be recompiled to fit into the new kernel.

The vdi is at kernel version 3.10.0-957.27.2.el7.x86_64 and this is the kernel with which the vbox additions were compiled.

I think Jonathan's stuck task was a victim of the "top" change that was enabled on Thursday and reverted on Friday, rather than a virtualbox version problem. I'm trying to figure out what the problem was with that change.

The other problem at the moment is that the server is not giving out many tasks, I had to click a few times on my client and finally got a single task. But this one is working as normal.

Maybe I missed it but I never saw a word from Projektteam that VirtualBox Version 6.x is okay to use with Atlas. And so long I will stay with 5.x

David,

which version of VirtualBox do you want us to use for the V2 ?

Latest 6.x or 5.x ?????

The answer is use whatever works :)

I myself have tested with 5.2.32 and 6.0.12 successfully on Linux, but I don't have the means to test all combinations of versions and operating systems. If you find a version that works for you then it's fine to stick with it.

The image itself was created with 5.2.32 because problems were reported during the testing on LHC-dev when an image from 6.0 was used.

I've got 2 Atlas (1T) tasks, each of them running for over 7 days. Console shows:

Total number of events to be processed: 200
Total number of events already finished: 54
Time left: 0d 14h 38m
---
Last finished...
worker 1: Event nr. 54 took 397.5 s. New average 348.4 +- 9.882

top shows athena.py @ 99% CPU

The "New average" shows "+- 9.882".
Very unusual that this value is that huge.
It points out that there must have been at least 1 longrunner (a veryverylongrunner).
The value is directly logged by the scientific app.

All other values are looking normal, especially "athena.py @ 99% CPU", but there's still a bit work to do.

I would let it run until it finishes or it hits the BOINC due date.

<edit>
Sorry my fault.
It's a "." instead of a ",".
Typical german misinterpretation.
But then you are right when you ask what the machine has done.
</edit>

- A small script is automatically run
- This script checks CVMFS by running the probe command
- If the probe fails, the probe logs the messages like you saw but then continues (I changed this not to fail the job because the probe failure can be temporary)
- The script then copies another script (the "bootstrap script") from CVMFS and runs it. The bootstrap script takes care of setting everything up for the task then starting the real work

It is done like this so that we can make changes simply by putting a new version of the bootstrap script on CVMFS instead of having to create a new VM image and app version each time.

I think what it happening in your case is that there is a problem with CVMFS which causes the copy of the bootstrap script to hang forever. I can put a timeout around this to avoid blocking the task forever but I'll need to make changes in the VM and make a new app version. I will look into it next week since I am away at a conference this week.

- If the probe fails, the probe logs the messages like you saw but then continues (I changed this not to fail the job because the probe failure can be temporary)

So, does script check it again or does something else try to download a job just the same? What's the interval?

I think what it happening in your case is that there is a problem with CVMFS which causes the copy of the bootstrap script to hang forever. I can put a timeout around this to avoid blocking the task forever but I'll need to make changes in the VM and make a new app version. I will look into it next week since I am away at a conference this week.

Yeah, thank you. Otherwise I can write a bash script that parses stderr.txt and automatically aborts the concerning task when three "Probing /cvmfs/*... Failed!" are raised (and those three lines must be consecutive).

Yeah, thank you. Otherwise I can write a bash script that parses stderr.txt and automatically aborts the concerning task when three "Probing /cvmfs/*... Failed!" are raised (and those three lines must be consecutive).

CVMFS configuration usually lists a couple of servers to be used either
- as main server followed by a couple of spare servers if the main server fails or
- as a list of servers to be tested by the CVMFS geolocation API. In this case the nearest server will be used.

CVMFS_SERVER_URL="http://s1cern-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1ral-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1bnl-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1fnal-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1unl-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1asgc-cvmfs.openhtc.io:8080/cvmfs/@fqrn@;http://s1ihep-cvmfs.openhtc.io/cvmfs/@fqrn@"
# set to 'yes' activates the geo API, set to 'no' deactivates it
CVMFS_USE_GEOAPI=yes

It should be checked (by the project team) if CVMFS_SERVER_URL lists at least 4 servers. Then it's very unlikely that all of them fail at the same moment.

Client side issues could be:
- wrong firewall settings, e.g. closed ports or filtered destinations
- slow DNS resolving
- high load on the router (not the same as high bandwidth usage!) that causes timeouts